Re: [PATCH v2 00/32] repository object

2017-06-12 Thread Jeff King
On Mon, Jun 12, 2017 at 01:04:09PM -0700, Jonathan Tan wrote:

> > I'm not sure I know what you mean by config variables which are static,
> > are you referring to the in-memory options which are populated by
> > querying the config?  Those I wouldn't want to see placed in a
> > 'repository object'.
> 
> Yes. I agree that I wouldn't want to see them placed in a repository
> object, but after reading Peff's e-mail, I was thinking of what happens
> if a file repeatedly invokes a config-sensitive function in another
> file. For example:
> 
>  a.c
>   for (i = 0; i < 100; i++) b_output(repo, x);
> 
>  b.c
>   void b_output(struct repository *repo, int x)
>   {
>/* print the configured "b.prefix" followed by x */
>   }
> 
> We probably wouldn't want to parse the repo's configset every time
> b_output() is invoked, but then, where to store our parsed "b.prefix"?
> The only alternatives I see is to have a static hashmap in b.c (keyed by
> repo, as described above), which would work if such a situation is rare
> (or can be made rare), but if it is common, maybe we have no choice but
> to put it in struct repository.

I think besides optimization, we often parse the textual config into
variables and then _modify_ those variables based on other input (most
often command-line arguments provided by the user, but sometimes other
circumstances).

You can move the resolution to the point-of-use instead of
point-of-setting, but that's going to be a big change to how most of the
code already works.

-Peff


Re: [PATCH v2 00/32] repository object

2017-06-12 Thread Jeff King
On Sun, Jun 11, 2017 at 10:24:12PM -0700, Stefan Beller wrote:

> On Fri, Jun 9, 2017 at 11:07 PM, Jeff King  wrote:
> > On Fri, Jun 09, 2017 at 05:40:34PM -0700, Jonathan Tan wrote:
> >
> >> Before I get into the details, I have some questions:
> >>
> >> 1. I am concerned that "struct repository" will end up growing without
> >> bounds as we store more and more repo-specific concerns in it. Could it
> >> be restricted to just the fields populated by repo_init()?
> >> repo_read_index() will then return the index itself, instead of using
> >> "struct repository" as a cache. This means that code using
> >> repo_read_index() will need to maintain its own variable holding the
> >> returned index, but that is likely a positive - it's better for code to
> >> just pass around the specific thing needed between functions anyway, as
> >> opposed to passing a giant "struct repository" (which partially defeats
> >> the purpose of eliminating the usage of globals).
> >
> > I think the repository object has to become a kitchen sink of sorts,
> > because we have tons of global variables representing repo-wide config.
> 
> AFAICT we want to operate on struct 'the_repo' and struct 'the_cmd_options'
> eventually. In our use case of submodules the submodules would ignore the
> settings of the main repo, but still accept guidance of the_cmd_config or
> 'the_config.
> 
> > So I have a feeling that we're always going to need some
> > big object to hold all that context when doing multi-repo operations in
> > a single process.
> 
> Well not just one big struct, but two. (or more?)

Right, I think you could have a separate kitchen-sink struct that isn't
the "repo" one. But now you have to pass both of those around, which is
going to get cumbersome. Almost every function is going to end up
passing around the context struct.

I almost think it would be easier to shove them all of the context into
a big global struct and "push" and "pop" contexts from a stack of
structs. That gets you the in-process benefits, though of course it's
absolutely horrible if you ever want to multi-thread across two contexts.

-Peff


Re: [PATCH v2 00/32] repository object

2017-06-12 Thread Jonathan Tan
On Mon, 12 Jun 2017 12:11:21 -0700
Brandon Williams  wrote:

> On 06/12, Jonathan Tan wrote:
> > On Sat, 10 Jun 2017 02:07:12 -0400
> > Jeff King  wrote:
> > 
> > > I do agree that "pass just what the sub-function needs" is a good rule
> > > of thumb. But the reason that these are globals in the first place is
> > > that there are a ton of them, and they are used at the lowest levels of
> > > call chains. So I have a feeling that we're always going to need some
> > > big object to hold all that context when doing multi-repo operations in
> > > a single process.
> > 
> > From my experience with the codebase, it seems that most of these config
> > variables are static (file-local). This means that the lowest levels of
> > call chains could probably get away with storing per-repo configs in a
> > static hashmap or associative array keyed by repo (if they cannot just
> > pass the config around).
> > 
> > Having said that, if it did come to the hashmap, I probably would prefer
> > just putting the config in the repo object too. So maybe that is the way
> > to go.
> 
> This is how the config is already handled.  A config_set is just a
> wrapper around a fancy hashmap.  Callers query using a string as a key
> and are returned the value for that config option.  I say fancy because
> it does stuff to handle multiple values, etc.

The hashmap I meant is one wrapping the one you describe:

  repo -> configset

Or equivalently:

  repo -> (string -> value(s))

> I'm not sure I know what you mean by config variables which are static,
> are you referring to the in-memory options which are populated by
> querying the config?  Those I wouldn't want to see placed in a
> 'repository object'.

Yes. I agree that I wouldn't want to see them placed in a repository
object, but after reading Peff's e-mail, I was thinking of what happens
if a file repeatedly invokes a config-sensitive function in another
file. For example:

 a.c
  for (i = 0; i < 100; i++) b_output(repo, x);

 b.c
  void b_output(struct repository *repo, int x)
  {
   /* print the configured "b.prefix" followed by x */
  }

We probably wouldn't want to parse the repo's configset every time
b_output() is invoked, but then, where to store our parsed "b.prefix"?
The only alternatives I see is to have a static hashmap in b.c (keyed by
repo, as described above), which would work if such a situation is rare
(or can be made rare), but if it is common, maybe we have no choice but
to put it in struct repository.


Re: [PATCH v2 00/32] repository object

2017-06-12 Thread Brandon Williams
On 06/12, Jonathan Tan wrote:
> On Sat, 10 Jun 2017 02:07:12 -0400
> Jeff King  wrote:
> 
> > I do agree that "pass just what the sub-function needs" is a good rule
> > of thumb. But the reason that these are globals in the first place is
> > that there are a ton of them, and they are used at the lowest levels of
> > call chains. So I have a feeling that we're always going to need some
> > big object to hold all that context when doing multi-repo operations in
> > a single process.
> 
> From my experience with the codebase, it seems that most of these config
> variables are static (file-local). This means that the lowest levels of
> call chains could probably get away with storing per-repo configs in a
> static hashmap or associative array keyed by repo (if they cannot just
> pass the config around).
> 
> Having said that, if it did come to the hashmap, I probably would prefer
> just putting the config in the repo object too. So maybe that is the way
> to go.

This is how the config is already handled.  A config_set is just a
wrapper around a fancy hashmap.  Callers query using a string as a key
and are returned the value for that config option.  I say fancy because
it does stuff to handle multiple values, etc.

I'm not sure I know what you mean by config variables which are static,
are you referring to the in-memory options which are populated by
querying the config?  Those I wouldn't want to see placed in a
'repository object'.

-- 
Brandon Williams


Re: [PATCH v2 00/32] repository object

2017-06-12 Thread Jonathan Tan
On Sat, 10 Jun 2017 17:43:29 -0700
Brandon Williams  wrote:

> I disagree with a few points of what jonathan said (mostly about
> removing the config from the repo object, as I like the idea of nothing
> knowing about a 'config_set' object) and I think this problem could be
> solved in a couple ways.

Ah...is the plan to eventually delete the git_configset_.* functions and
only have repo_config_get_.* functions? If yes, I would prefer the
in-between state to be "git_config_set_get_int(repo->configset, ...)"
instead of "repo_config_get_int(repo, ...)" to avoid the parallel set of
functions (which will make it more troublesome, for example, to add
support for a new data type) but I can see the benefits of having the
repo_config_get_.* functions too (conciseness and not having to make one
final find-and-replace when we finally complete the migration, at least)
so I don't feel too strongly about this.

> I don't think that the in-memory global variable 'quotepath' (or
> whatever its called) should live in the repository object (I mean it's
> already contained in the config) but rather 'quotepath' is specific to
> how ls-files handles its output.  So really what should happen is you
> pass a pair of objects to the ls-files machinery (or any other command's
> machinery) (1) the repository object being operated on and (2) an
> options struct which can be configured based on the repository.  So when
> recursing you can do something like the following:
> 
>   repo_init(submodule, path_to_submodule);
>   configure_opts(sub_opts, submodule, super_opts)
>   ls_files(submodule, sub_opts);
> 
> This eliminates bloating 'struct repository' and would allow you to have
> things configured differently in submodules (which is crazy if you ask
> me, but people do crazy things).

This does sound good to me.


Re: [PATCH v2 00/32] repository object

2017-06-12 Thread Jonathan Tan
On Sat, 10 Jun 2017 02:07:12 -0400
Jeff King  wrote:

> I do agree that "pass just what the sub-function needs" is a good rule
> of thumb. But the reason that these are globals in the first place is
> that there are a ton of them, and they are used at the lowest levels of
> call chains. So I have a feeling that we're always going to need some
> big object to hold all that context when doing multi-repo operations in
> a single process.

From my experience with the codebase, it seems that most of these config
variables are static (file-local). This means that the lowest levels of
call chains could probably get away with storing per-repo configs in a
static hashmap or associative array keyed by repo (if they cannot just
pass the config around).

Having said that, if it did come to the hashmap, I probably would prefer
just putting the config in the repo object too. So maybe that is the way
to go.


Re: [PATCH v2 00/32] repository object

2017-06-11 Thread Stefan Beller
On Fri, Jun 9, 2017 at 11:07 PM, Jeff King  wrote:
> On Fri, Jun 09, 2017 at 05:40:34PM -0700, Jonathan Tan wrote:
>
>> Before I get into the details, I have some questions:
>>
>> 1. I am concerned that "struct repository" will end up growing without
>> bounds as we store more and more repo-specific concerns in it. Could it
>> be restricted to just the fields populated by repo_init()?
>> repo_read_index() will then return the index itself, instead of using
>> "struct repository" as a cache. This means that code using
>> repo_read_index() will need to maintain its own variable holding the
>> returned index, but that is likely a positive - it's better for code to
>> just pass around the specific thing needed between functions anyway, as
>> opposed to passing a giant "struct repository" (which partially defeats
>> the purpose of eliminating the usage of globals).
>
> I think the repository object has to become a kitchen sink of sorts,
> because we have tons of global variables representing repo-wide config.

AFAICT we want to operate on struct 'the_repo' and struct 'the_cmd_options'
eventually. In our use case of submodules the submodules would ignore the
settings of the main repo, but still accept guidance of the_cmd_config or
'the_config.

> So I have a feeling that we're always going to need some
> big object to hold all that context when doing multi-repo operations in
> a single process.

Well not just one big struct, but two. (or more?)


Re: [PATCH v2 00/32] repository object

2017-06-10 Thread Brandon Williams
On 06/10, Jeff King wrote:
> On Sat, Jun 10, 2017 at 02:07:12AM -0400, Jeff King wrote:
> 
> > I think the repository object has to become a kitchen sink of sorts,
> > because we have tons of global variables representing repo-wide config.
> > ls-files doesn't respect a lot of config, but what should, e.g.:
> > 
> >   git config core.quotepath true
> >   git -C submodule config core.quotepath false
> >   git ls-files --recurse-submodules
> >
> > [...]
> >
> > [1] I wanted to see how Brandon's series behaved for this quotepath
> > case, but unfortunately I couldn't get it to work in even a simple
> > case.  :(
> > 
> >   $ git ls-files --recurse-submodules
> >   fatal: index file corrupt
> 
> Ah, this was just hitting the bug mentioned later in the thread. With
> that fix, I can see that it does indeed behave differently than the
> current code:
> 
>   git config core.quotepath true
>   git -C submodule config core.quotepath false
>   (cd submodule &&
>echo hello >buenos_días &&
>git add .
>   )
>   git ls-files --recurse-submodules
> 
> shows:
> 
>   submodule/buenos_días
> 
> before the patch series, and:
> 
>   "submodule/buenos_d\303\255as"
> 
> after.
> 
> Like I said, I doubt this is a bug that anybody cares much about, but
> it's hard to know what other repo-specific global-variable usage is
> lurking in low-level code.

I disagree with a few points of what jonathan said (mostly about
removing the config from the repo object, as I like the idea of nothing
knowing about a 'config_set' object) and I think this problem could be
solved in a couple ways.

I don't think that the in-memory global variable 'quotepath' (or
whatever its called) should live in the repository object (I mean it's
already contained in the config) but rather 'quotepath' is specific to
how ls-files handles its output.  So really what should happen is you
pass a pair of objects to the ls-files machinery (or any other command's
machinery) (1) the repository object being operated on and (2) an
options struct which can be configured based on the repository.  So when
recursing you can do something like the following:

  repo_init(submodule, path_to_submodule);
  configure_opts(sub_opts, submodule, super_opts)
  ls_files(submodule, sub_opts);

This eliminates bloating 'struct repository' and would allow you to have
things configured differently in submodules (which is crazy if you ask
me, but people do crazy things).

-- 
Brandon Williams


Re: [PATCH v2 00/32] repository object

2017-06-10 Thread Brandon Williams
On 06/10, Jeff King wrote:
> On Fri, Jun 09, 2017 at 05:40:34PM -0700, Jonathan Tan wrote:
> 
> > Before I get into the details, I have some questions:
> > 
> > 1. I am concerned that "struct repository" will end up growing without
> > bounds as we store more and more repo-specific concerns in it. Could it
> > be restricted to just the fields populated by repo_init()?
> > repo_read_index() will then return the index itself, instead of using
> > "struct repository" as a cache. This means that code using
> > repo_read_index() will need to maintain its own variable holding the
> > returned index, but that is likely a positive - it's better for code to
> > just pass around the specific thing needed between functions anyway, as
> > opposed to passing a giant "struct repository" (which partially defeats
> > the purpose of eliminating the usage of globals).
> 
> I think the repository object has to become a kitchen sink of sorts,
> because we have tons of global variables representing repo-wide config.
> ls-files doesn't respect a lot of config, but what should, e.g.:
> 
>   git config core.quotepath true
>   git -C submodule config core.quotepath false
>   git ls-files --recurse-submodules
> 
> do?  Right now, with a separate process, we respect the submodule
> version of the config. But in a single process[1] we'd need one copy of
> the quote_path_fully variable for each repo object. It's tempting for
> this case to say that core.quotepath from the super-project should just
> take precedence, as that's where the command is issued from (and why the
> heck would anybody have per-repo settings for this anyway?). But I
> suspect as we get into more complicated commands that there are likely
> to be config variables that are important to match to each repo.
> 
> I do agree that "pass just what the sub-function needs" is a good rule
> of thumb. But the reason that these are globals in the first place is
> that there are a ton of them, and they are used at the lowest levels of
> call chains. So I have a feeling that we're always going to need some
> big object to hold all that context when doing multi-repo operations in
> a single process.
> 
> For config, in theory that could be a big "config_set" object, but
> that's not quite how we treat our config. We usually parse it once into
> actual variables. So really you end up with a big parsed-config object
> that gets passed around, I'd think.
> 
> -Peff
> 
> [1] I wanted to see how Brandon's series behaved for this quotepath
> case, but unfortunately I couldn't get it to work in even a simple
> case.  :(
> 
>   $ git ls-files --recurse-submodules
>   fatal: index file corrupt

Yeah sorry about that...I commented on patch 32 indicating that I made a
small mistake and forgot a '> 0' when checking the index.  I made a last
minute change before sending v2 out and forgot to run tests again (I'm
terrible i know!).

-- 
Brandon Williams


Re: [PATCH v2 00/32] repository object

2017-06-10 Thread Jeff King
On Sat, Jun 10, 2017 at 02:07:12AM -0400, Jeff King wrote:

> I think the repository object has to become a kitchen sink of sorts,
> because we have tons of global variables representing repo-wide config.
> ls-files doesn't respect a lot of config, but what should, e.g.:
> 
>   git config core.quotepath true
>   git -C submodule config core.quotepath false
>   git ls-files --recurse-submodules
>
> [...]
>
> [1] I wanted to see how Brandon's series behaved for this quotepath
> case, but unfortunately I couldn't get it to work in even a simple
> case.  :(
> 
>   $ git ls-files --recurse-submodules
>   fatal: index file corrupt

Ah, this was just hitting the bug mentioned later in the thread. With
that fix, I can see that it does indeed behave differently than the
current code:

  git config core.quotepath true
  git -C submodule config core.quotepath false
  (cd submodule &&
   echo hello >buenos_días &&
   git add .
  )
  git ls-files --recurse-submodules

shows:

  submodule/buenos_días

before the patch series, and:

  "submodule/buenos_d\303\255as"

after.

Like I said, I doubt this is a bug that anybody cares much about, but
it's hard to know what other repo-specific global-variable usage is
lurking in low-level code.

-Peff


Re: [PATCH v2 00/32] repository object

2017-06-10 Thread Jeff King
On Fri, Jun 09, 2017 at 05:40:34PM -0700, Jonathan Tan wrote:

> Before I get into the details, I have some questions:
> 
> 1. I am concerned that "struct repository" will end up growing without
> bounds as we store more and more repo-specific concerns in it. Could it
> be restricted to just the fields populated by repo_init()?
> repo_read_index() will then return the index itself, instead of using
> "struct repository" as a cache. This means that code using
> repo_read_index() will need to maintain its own variable holding the
> returned index, but that is likely a positive - it's better for code to
> just pass around the specific thing needed between functions anyway, as
> opposed to passing a giant "struct repository" (which partially defeats
> the purpose of eliminating the usage of globals).

I think the repository object has to become a kitchen sink of sorts,
because we have tons of global variables representing repo-wide config.
ls-files doesn't respect a lot of config, but what should, e.g.:

  git config core.quotepath true
  git -C submodule config core.quotepath false
  git ls-files --recurse-submodules

do?  Right now, with a separate process, we respect the submodule
version of the config. But in a single process[1] we'd need one copy of
the quote_path_fully variable for each repo object. It's tempting for
this case to say that core.quotepath from the super-project should just
take precedence, as that's where the command is issued from (and why the
heck would anybody have per-repo settings for this anyway?). But I
suspect as we get into more complicated commands that there are likely
to be config variables that are important to match to each repo.

I do agree that "pass just what the sub-function needs" is a good rule
of thumb. But the reason that these are globals in the first place is
that there are a ton of them, and they are used at the lowest levels of
call chains. So I have a feeling that we're always going to need some
big object to hold all that context when doing multi-repo operations in
a single process.

For config, in theory that could be a big "config_set" object, but
that's not quite how we treat our config. We usually parse it once into
actual variables. So really you end up with a big parsed-config object
that gets passed around, I'd think.

-Peff

[1] I wanted to see how Brandon's series behaved for this quotepath
case, but unfortunately I couldn't get it to work in even a simple
case.  :(

  $ git ls-files --recurse-submodules
  fatal: index file corrupt


Re: [PATCH v2 00/32] repository object

2017-06-09 Thread Jonathan Tan
On Thu,  8 Jun 2017 16:40:28 -0700
Brandon Williams  wrote:

> When I sent out my RFC series there seemed to be a lot of interest but I
> haven't seen many people jump to review this series.  Despite lack of review I
> wanted to get out another version which includes some changes to fix things
> that were bugging me about the series.  Hopfully this v2 will prod some more
> people to take a look.
> 
> The meat of the series is really from 04-15 and patch 32 which converts
> ls-files to recurse using a repository object.  So if you're pressed for time
> you can focus on those patches.

Thanks - I can see in patch 32 that one immediate benefit is that
ls-files can now recurse into submodules without needing to invoke an
extra process just to change the GIT_DIR.

Before I get into the details, I have some questions:

1. I am concerned that "struct repository" will end up growing without
bounds as we store more and more repo-specific concerns in it. Could it
be restricted to just the fields populated by repo_init()?
repo_read_index() will then return the index itself, instead of using
"struct repository" as a cache. This means that code using
repo_read_index() will need to maintain its own variable holding the
returned index, but that is likely a positive - it's better for code to
just pass around the specific thing needed between functions anyway, as
opposed to passing a giant "struct repository" (which partially defeats
the purpose of eliminating the usage of globals).

2. If we do the above, another potential benefit is that (I think) the
repo_config_get_.* functions would no longer be necessary, since we
would have a function that generates a "struct config_set" given a repo,
and then we just use the git_configset_get_.* functions on it. This
would also reduce the urgency of extracting the config methods into its
own header file, and reduce the size of this patch set.

(I was also going to suggest that you remove the "convert" patches, but
that is not possible - ls-files uses get_cached_convert_stats_ascii()
from convert.h despite not #include-ing it.)


[PATCH v2 00/32] repository object

2017-06-08 Thread Brandon Williams
When I sent out my RFC series there seemed to be a lot of interest but I
haven't seen many people jump to review this series.  Despite lack of review I
wanted to get out another version which includes some changes to fix things
that were bugging me about the series.  Hopfully this v2 will prod some more
people to take a look.

The meat of the series is really from 04-15 and patch 32 which converts
ls-files to recurse using a repository object.  So if you're pressed for time
you can focus on those patches.

You can find this series at: 
https://github.com/bmwill/git/tree/repository-object

What's different in v2:
* 'struct repo' has been renamed 'struct repository'.  Really just an aesthetic
  change and it looks more "official" without having a shortened name.
* Better commit messages on a few of the patches which were soarly lacking.
* I found that 'git_config_iter' was being exported without an implementation
  so I added a patch to remove it.
* repo_init is a bit cleaner and is written in such a way to not call 'die()',
  instead it returns a non-zero return value upon failure.

Thanks for your help!

Brandon Williams (32):
  config: create config.h
  config: remove git_config_iter
  config: don't include config.h by default
  config: don't implicitly use gitdir
  setup: don't perform lazy initialization of repository state
  environment: remove namespace_len variable
  repository: introduce the repository object
  environment: place key repository state in the_repository
  environment: store worktree in the_repository
  setup: add comment indicating a hack
  config: read config from a repository object
  repository: add index_state to struct repo
  submodule-config: store the_submodule_cache in the_repository
  submodule: add repo_read_gitmodules
  submodule: convert is_submodule_initialized to work on a repository
  convert: convert get_cached_convert_stats_ascii to take an index
  convert: convert crlf_to_git to take an index
  convert: convert convert_to_git_filter_fd to take an index
  convert: convert convert_to_git to take an index
  convert: convert renormalize_buffer to take an index
  tree: convert read_tree to take an index parameter
  ls-files: convert overlay_tree_on_cache to take an index
  ls-files: convert write_eolinfo to take an index
  ls-files: convert show_killed_files to take an index
  ls-files: convert show_other_files to take an index
  ls-files: convert show_ru_info to take an index
  ls-files: convert ce_excluded to take an index
  ls-files: convert prune_cache to take an index
  ls-files: convert show_files to take an index
  ls-files: factor out debug info into a function
  ls-files: factor out tag calculation
  ls-files: use repository object

 Makefile   |   1 +
 advice.c   |   1 +
 alias.c|   1 +
 apply.c|   3 +-
 archive-tar.c  |   1 +
 archive-zip.c  |   1 +
 archive.c  |   1 +
 attr.c |   1 +
 bisect.c   |   1 +
 branch.c   |   1 +
 builtin/add.c  |   1 +
 builtin/am.c   |   1 +
 builtin/blame.c|   3 +-
 builtin/branch.c   |   1 +
 builtin/cat-file.c |   1 +
 builtin/check-attr.c   |   1 +
 builtin/check-ignore.c |   1 +
 builtin/check-mailmap.c|   1 +
 builtin/checkout-index.c   |   1 +
 builtin/checkout.c |   1 +
 builtin/clean.c|   1 +
 builtin/clone.c|   1 +
 builtin/column.c   |   1 +
 builtin/commit-tree.c  |   1 +
 builtin/commit.c   |   4 +-
 builtin/config.c   |   3 +
 builtin/count-objects.c|   1 +
 builtin/describe.c |   1 +
 builtin/diff-files.c   |   1 +
 builtin/diff-index.c   |   1 +
 builtin/diff-tree.c|   1 +
 builtin/diff.c |   1 +
 builtin/difftool.c |   1 +
 builtin/fast-export.c  |   1 +
 builtin/fetch.c|   1 +
 builtin/fmt-merge-msg.c|   1 +
 builtin/for-each-ref.c |   1 +
 builtin/fsck.c |   1 +
 builtin/gc.c   |   1 +
 builtin/grep.c |   4 +-
 builtin/hash-object.c  |   1 +
 builtin/help.c |   1 +
 builtin/index-pack.c   |   1 +
 builtin/init-db.c  |   1 +
 builtin/log.c  |   1 +
 builtin/ls-files.c | 330 -
 builtin/ls-tree.c