Re: [PATCH v2 00/32] repository object
On Mon, Jun 12, 2017 at 01:04:09PM -0700, Jonathan Tan wrote: > > I'm not sure I know what you mean by config variables which are static, > > are you referring to the in-memory options which are populated by > > querying the config? Those I wouldn't want to see placed in a > > 'repository object'. > > Yes. I agree that I wouldn't want to see them placed in a repository > object, but after reading Peff's e-mail, I was thinking of what happens > if a file repeatedly invokes a config-sensitive function in another > file. For example: > > a.c > for (i = 0; i < 100; i++) b_output(repo, x); > > b.c > void b_output(struct repository *repo, int x) > { >/* print the configured "b.prefix" followed by x */ > } > > We probably wouldn't want to parse the repo's configset every time > b_output() is invoked, but then, where to store our parsed "b.prefix"? > The only alternatives I see is to have a static hashmap in b.c (keyed by > repo, as described above), which would work if such a situation is rare > (or can be made rare), but if it is common, maybe we have no choice but > to put it in struct repository. I think besides optimization, we often parse the textual config into variables and then _modify_ those variables based on other input (most often command-line arguments provided by the user, but sometimes other circumstances). You can move the resolution to the point-of-use instead of point-of-setting, but that's going to be a big change to how most of the code already works. -Peff
Re: [PATCH v2 00/32] repository object
On Sun, Jun 11, 2017 at 10:24:12PM -0700, Stefan Beller wrote: > On Fri, Jun 9, 2017 at 11:07 PM, Jeff Kingwrote: > > On Fri, Jun 09, 2017 at 05:40:34PM -0700, Jonathan Tan wrote: > > > >> Before I get into the details, I have some questions: > >> > >> 1. I am concerned that "struct repository" will end up growing without > >> bounds as we store more and more repo-specific concerns in it. Could it > >> be restricted to just the fields populated by repo_init()? > >> repo_read_index() will then return the index itself, instead of using > >> "struct repository" as a cache. This means that code using > >> repo_read_index() will need to maintain its own variable holding the > >> returned index, but that is likely a positive - it's better for code to > >> just pass around the specific thing needed between functions anyway, as > >> opposed to passing a giant "struct repository" (which partially defeats > >> the purpose of eliminating the usage of globals). > > > > I think the repository object has to become a kitchen sink of sorts, > > because we have tons of global variables representing repo-wide config. > > AFAICT we want to operate on struct 'the_repo' and struct 'the_cmd_options' > eventually. In our use case of submodules the submodules would ignore the > settings of the main repo, but still accept guidance of the_cmd_config or > 'the_config. > > > So I have a feeling that we're always going to need some > > big object to hold all that context when doing multi-repo operations in > > a single process. > > Well not just one big struct, but two. (or more?) Right, I think you could have a separate kitchen-sink struct that isn't the "repo" one. But now you have to pass both of those around, which is going to get cumbersome. Almost every function is going to end up passing around the context struct. I almost think it would be easier to shove them all of the context into a big global struct and "push" and "pop" contexts from a stack of structs. That gets you the in-process benefits, though of course it's absolutely horrible if you ever want to multi-thread across two contexts. -Peff
Re: [PATCH v2 00/32] repository object
On Mon, 12 Jun 2017 12:11:21 -0700 Brandon Williamswrote: > On 06/12, Jonathan Tan wrote: > > On Sat, 10 Jun 2017 02:07:12 -0400 > > Jeff King wrote: > > > > > I do agree that "pass just what the sub-function needs" is a good rule > > > of thumb. But the reason that these are globals in the first place is > > > that there are a ton of them, and they are used at the lowest levels of > > > call chains. So I have a feeling that we're always going to need some > > > big object to hold all that context when doing multi-repo operations in > > > a single process. > > > > From my experience with the codebase, it seems that most of these config > > variables are static (file-local). This means that the lowest levels of > > call chains could probably get away with storing per-repo configs in a > > static hashmap or associative array keyed by repo (if they cannot just > > pass the config around). > > > > Having said that, if it did come to the hashmap, I probably would prefer > > just putting the config in the repo object too. So maybe that is the way > > to go. > > This is how the config is already handled. A config_set is just a > wrapper around a fancy hashmap. Callers query using a string as a key > and are returned the value for that config option. I say fancy because > it does stuff to handle multiple values, etc. The hashmap I meant is one wrapping the one you describe: repo -> configset Or equivalently: repo -> (string -> value(s)) > I'm not sure I know what you mean by config variables which are static, > are you referring to the in-memory options which are populated by > querying the config? Those I wouldn't want to see placed in a > 'repository object'. Yes. I agree that I wouldn't want to see them placed in a repository object, but after reading Peff's e-mail, I was thinking of what happens if a file repeatedly invokes a config-sensitive function in another file. For example: a.c for (i = 0; i < 100; i++) b_output(repo, x); b.c void b_output(struct repository *repo, int x) { /* print the configured "b.prefix" followed by x */ } We probably wouldn't want to parse the repo's configset every time b_output() is invoked, but then, where to store our parsed "b.prefix"? The only alternatives I see is to have a static hashmap in b.c (keyed by repo, as described above), which would work if such a situation is rare (or can be made rare), but if it is common, maybe we have no choice but to put it in struct repository.
Re: [PATCH v2 00/32] repository object
On 06/12, Jonathan Tan wrote: > On Sat, 10 Jun 2017 02:07:12 -0400 > Jeff Kingwrote: > > > I do agree that "pass just what the sub-function needs" is a good rule > > of thumb. But the reason that these are globals in the first place is > > that there are a ton of them, and they are used at the lowest levels of > > call chains. So I have a feeling that we're always going to need some > > big object to hold all that context when doing multi-repo operations in > > a single process. > > From my experience with the codebase, it seems that most of these config > variables are static (file-local). This means that the lowest levels of > call chains could probably get away with storing per-repo configs in a > static hashmap or associative array keyed by repo (if they cannot just > pass the config around). > > Having said that, if it did come to the hashmap, I probably would prefer > just putting the config in the repo object too. So maybe that is the way > to go. This is how the config is already handled. A config_set is just a wrapper around a fancy hashmap. Callers query using a string as a key and are returned the value for that config option. I say fancy because it does stuff to handle multiple values, etc. I'm not sure I know what you mean by config variables which are static, are you referring to the in-memory options which are populated by querying the config? Those I wouldn't want to see placed in a 'repository object'. -- Brandon Williams
Re: [PATCH v2 00/32] repository object
On Sat, 10 Jun 2017 17:43:29 -0700 Brandon Williamswrote: > I disagree with a few points of what jonathan said (mostly about > removing the config from the repo object, as I like the idea of nothing > knowing about a 'config_set' object) and I think this problem could be > solved in a couple ways. Ah...is the plan to eventually delete the git_configset_.* functions and only have repo_config_get_.* functions? If yes, I would prefer the in-between state to be "git_config_set_get_int(repo->configset, ...)" instead of "repo_config_get_int(repo, ...)" to avoid the parallel set of functions (which will make it more troublesome, for example, to add support for a new data type) but I can see the benefits of having the repo_config_get_.* functions too (conciseness and not having to make one final find-and-replace when we finally complete the migration, at least) so I don't feel too strongly about this. > I don't think that the in-memory global variable 'quotepath' (or > whatever its called) should live in the repository object (I mean it's > already contained in the config) but rather 'quotepath' is specific to > how ls-files handles its output. So really what should happen is you > pass a pair of objects to the ls-files machinery (or any other command's > machinery) (1) the repository object being operated on and (2) an > options struct which can be configured based on the repository. So when > recursing you can do something like the following: > > repo_init(submodule, path_to_submodule); > configure_opts(sub_opts, submodule, super_opts) > ls_files(submodule, sub_opts); > > This eliminates bloating 'struct repository' and would allow you to have > things configured differently in submodules (which is crazy if you ask > me, but people do crazy things). This does sound good to me.
Re: [PATCH v2 00/32] repository object
On Sat, 10 Jun 2017 02:07:12 -0400 Jeff Kingwrote: > I do agree that "pass just what the sub-function needs" is a good rule > of thumb. But the reason that these are globals in the first place is > that there are a ton of them, and they are used at the lowest levels of > call chains. So I have a feeling that we're always going to need some > big object to hold all that context when doing multi-repo operations in > a single process. From my experience with the codebase, it seems that most of these config variables are static (file-local). This means that the lowest levels of call chains could probably get away with storing per-repo configs in a static hashmap or associative array keyed by repo (if they cannot just pass the config around). Having said that, if it did come to the hashmap, I probably would prefer just putting the config in the repo object too. So maybe that is the way to go.
Re: [PATCH v2 00/32] repository object
On Fri, Jun 9, 2017 at 11:07 PM, Jeff Kingwrote: > On Fri, Jun 09, 2017 at 05:40:34PM -0700, Jonathan Tan wrote: > >> Before I get into the details, I have some questions: >> >> 1. I am concerned that "struct repository" will end up growing without >> bounds as we store more and more repo-specific concerns in it. Could it >> be restricted to just the fields populated by repo_init()? >> repo_read_index() will then return the index itself, instead of using >> "struct repository" as a cache. This means that code using >> repo_read_index() will need to maintain its own variable holding the >> returned index, but that is likely a positive - it's better for code to >> just pass around the specific thing needed between functions anyway, as >> opposed to passing a giant "struct repository" (which partially defeats >> the purpose of eliminating the usage of globals). > > I think the repository object has to become a kitchen sink of sorts, > because we have tons of global variables representing repo-wide config. AFAICT we want to operate on struct 'the_repo' and struct 'the_cmd_options' eventually. In our use case of submodules the submodules would ignore the settings of the main repo, but still accept guidance of the_cmd_config or 'the_config. > So I have a feeling that we're always going to need some > big object to hold all that context when doing multi-repo operations in > a single process. Well not just one big struct, but two. (or more?)
Re: [PATCH v2 00/32] repository object
On 06/10, Jeff King wrote: > On Sat, Jun 10, 2017 at 02:07:12AM -0400, Jeff King wrote: > > > I think the repository object has to become a kitchen sink of sorts, > > because we have tons of global variables representing repo-wide config. > > ls-files doesn't respect a lot of config, but what should, e.g.: > > > > git config core.quotepath true > > git -C submodule config core.quotepath false > > git ls-files --recurse-submodules > > > > [...] > > > > [1] I wanted to see how Brandon's series behaved for this quotepath > > case, but unfortunately I couldn't get it to work in even a simple > > case. :( > > > > $ git ls-files --recurse-submodules > > fatal: index file corrupt > > Ah, this was just hitting the bug mentioned later in the thread. With > that fix, I can see that it does indeed behave differently than the > current code: > > git config core.quotepath true > git -C submodule config core.quotepath false > (cd submodule && >echo hello >buenos_días && >git add . > ) > git ls-files --recurse-submodules > > shows: > > submodule/buenos_días > > before the patch series, and: > > "submodule/buenos_d\303\255as" > > after. > > Like I said, I doubt this is a bug that anybody cares much about, but > it's hard to know what other repo-specific global-variable usage is > lurking in low-level code. I disagree with a few points of what jonathan said (mostly about removing the config from the repo object, as I like the idea of nothing knowing about a 'config_set' object) and I think this problem could be solved in a couple ways. I don't think that the in-memory global variable 'quotepath' (or whatever its called) should live in the repository object (I mean it's already contained in the config) but rather 'quotepath' is specific to how ls-files handles its output. So really what should happen is you pass a pair of objects to the ls-files machinery (or any other command's machinery) (1) the repository object being operated on and (2) an options struct which can be configured based on the repository. So when recursing you can do something like the following: repo_init(submodule, path_to_submodule); configure_opts(sub_opts, submodule, super_opts) ls_files(submodule, sub_opts); This eliminates bloating 'struct repository' and would allow you to have things configured differently in submodules (which is crazy if you ask me, but people do crazy things). -- Brandon Williams
Re: [PATCH v2 00/32] repository object
On 06/10, Jeff King wrote: > On Fri, Jun 09, 2017 at 05:40:34PM -0700, Jonathan Tan wrote: > > > Before I get into the details, I have some questions: > > > > 1. I am concerned that "struct repository" will end up growing without > > bounds as we store more and more repo-specific concerns in it. Could it > > be restricted to just the fields populated by repo_init()? > > repo_read_index() will then return the index itself, instead of using > > "struct repository" as a cache. This means that code using > > repo_read_index() will need to maintain its own variable holding the > > returned index, but that is likely a positive - it's better for code to > > just pass around the specific thing needed between functions anyway, as > > opposed to passing a giant "struct repository" (which partially defeats > > the purpose of eliminating the usage of globals). > > I think the repository object has to become a kitchen sink of sorts, > because we have tons of global variables representing repo-wide config. > ls-files doesn't respect a lot of config, but what should, e.g.: > > git config core.quotepath true > git -C submodule config core.quotepath false > git ls-files --recurse-submodules > > do? Right now, with a separate process, we respect the submodule > version of the config. But in a single process[1] we'd need one copy of > the quote_path_fully variable for each repo object. It's tempting for > this case to say that core.quotepath from the super-project should just > take precedence, as that's where the command is issued from (and why the > heck would anybody have per-repo settings for this anyway?). But I > suspect as we get into more complicated commands that there are likely > to be config variables that are important to match to each repo. > > I do agree that "pass just what the sub-function needs" is a good rule > of thumb. But the reason that these are globals in the first place is > that there are a ton of them, and they are used at the lowest levels of > call chains. So I have a feeling that we're always going to need some > big object to hold all that context when doing multi-repo operations in > a single process. > > For config, in theory that could be a big "config_set" object, but > that's not quite how we treat our config. We usually parse it once into > actual variables. So really you end up with a big parsed-config object > that gets passed around, I'd think. > > -Peff > > [1] I wanted to see how Brandon's series behaved for this quotepath > case, but unfortunately I couldn't get it to work in even a simple > case. :( > > $ git ls-files --recurse-submodules > fatal: index file corrupt Yeah sorry about that...I commented on patch 32 indicating that I made a small mistake and forgot a '> 0' when checking the index. I made a last minute change before sending v2 out and forgot to run tests again (I'm terrible i know!). -- Brandon Williams
Re: [PATCH v2 00/32] repository object
On Sat, Jun 10, 2017 at 02:07:12AM -0400, Jeff King wrote: > I think the repository object has to become a kitchen sink of sorts, > because we have tons of global variables representing repo-wide config. > ls-files doesn't respect a lot of config, but what should, e.g.: > > git config core.quotepath true > git -C submodule config core.quotepath false > git ls-files --recurse-submodules > > [...] > > [1] I wanted to see how Brandon's series behaved for this quotepath > case, but unfortunately I couldn't get it to work in even a simple > case. :( > > $ git ls-files --recurse-submodules > fatal: index file corrupt Ah, this was just hitting the bug mentioned later in the thread. With that fix, I can see that it does indeed behave differently than the current code: git config core.quotepath true git -C submodule config core.quotepath false (cd submodule && echo hello >buenos_días && git add . ) git ls-files --recurse-submodules shows: submodule/buenos_días before the patch series, and: "submodule/buenos_d\303\255as" after. Like I said, I doubt this is a bug that anybody cares much about, but it's hard to know what other repo-specific global-variable usage is lurking in low-level code. -Peff
Re: [PATCH v2 00/32] repository object
On Fri, Jun 09, 2017 at 05:40:34PM -0700, Jonathan Tan wrote: > Before I get into the details, I have some questions: > > 1. I am concerned that "struct repository" will end up growing without > bounds as we store more and more repo-specific concerns in it. Could it > be restricted to just the fields populated by repo_init()? > repo_read_index() will then return the index itself, instead of using > "struct repository" as a cache. This means that code using > repo_read_index() will need to maintain its own variable holding the > returned index, but that is likely a positive - it's better for code to > just pass around the specific thing needed between functions anyway, as > opposed to passing a giant "struct repository" (which partially defeats > the purpose of eliminating the usage of globals). I think the repository object has to become a kitchen sink of sorts, because we have tons of global variables representing repo-wide config. ls-files doesn't respect a lot of config, but what should, e.g.: git config core.quotepath true git -C submodule config core.quotepath false git ls-files --recurse-submodules do? Right now, with a separate process, we respect the submodule version of the config. But in a single process[1] we'd need one copy of the quote_path_fully variable for each repo object. It's tempting for this case to say that core.quotepath from the super-project should just take precedence, as that's where the command is issued from (and why the heck would anybody have per-repo settings for this anyway?). But I suspect as we get into more complicated commands that there are likely to be config variables that are important to match to each repo. I do agree that "pass just what the sub-function needs" is a good rule of thumb. But the reason that these are globals in the first place is that there are a ton of them, and they are used at the lowest levels of call chains. So I have a feeling that we're always going to need some big object to hold all that context when doing multi-repo operations in a single process. For config, in theory that could be a big "config_set" object, but that's not quite how we treat our config. We usually parse it once into actual variables. So really you end up with a big parsed-config object that gets passed around, I'd think. -Peff [1] I wanted to see how Brandon's series behaved for this quotepath case, but unfortunately I couldn't get it to work in even a simple case. :( $ git ls-files --recurse-submodules fatal: index file corrupt
Re: [PATCH v2 00/32] repository object
On Thu, 8 Jun 2017 16:40:28 -0700 Brandon Williamswrote: > When I sent out my RFC series there seemed to be a lot of interest but I > haven't seen many people jump to review this series. Despite lack of review I > wanted to get out another version which includes some changes to fix things > that were bugging me about the series. Hopfully this v2 will prod some more > people to take a look. > > The meat of the series is really from 04-15 and patch 32 which converts > ls-files to recurse using a repository object. So if you're pressed for time > you can focus on those patches. Thanks - I can see in patch 32 that one immediate benefit is that ls-files can now recurse into submodules without needing to invoke an extra process just to change the GIT_DIR. Before I get into the details, I have some questions: 1. I am concerned that "struct repository" will end up growing without bounds as we store more and more repo-specific concerns in it. Could it be restricted to just the fields populated by repo_init()? repo_read_index() will then return the index itself, instead of using "struct repository" as a cache. This means that code using repo_read_index() will need to maintain its own variable holding the returned index, but that is likely a positive - it's better for code to just pass around the specific thing needed between functions anyway, as opposed to passing a giant "struct repository" (which partially defeats the purpose of eliminating the usage of globals). 2. If we do the above, another potential benefit is that (I think) the repo_config_get_.* functions would no longer be necessary, since we would have a function that generates a "struct config_set" given a repo, and then we just use the git_configset_get_.* functions on it. This would also reduce the urgency of extracting the config methods into its own header file, and reduce the size of this patch set. (I was also going to suggest that you remove the "convert" patches, but that is not possible - ls-files uses get_cached_convert_stats_ascii() from convert.h despite not #include-ing it.)
[PATCH v2 00/32] repository object
When I sent out my RFC series there seemed to be a lot of interest but I haven't seen many people jump to review this series. Despite lack of review I wanted to get out another version which includes some changes to fix things that were bugging me about the series. Hopfully this v2 will prod some more people to take a look. The meat of the series is really from 04-15 and patch 32 which converts ls-files to recurse using a repository object. So if you're pressed for time you can focus on those patches. You can find this series at: https://github.com/bmwill/git/tree/repository-object What's different in v2: * 'struct repo' has been renamed 'struct repository'. Really just an aesthetic change and it looks more "official" without having a shortened name. * Better commit messages on a few of the patches which were soarly lacking. * I found that 'git_config_iter' was being exported without an implementation so I added a patch to remove it. * repo_init is a bit cleaner and is written in such a way to not call 'die()', instead it returns a non-zero return value upon failure. Thanks for your help! Brandon Williams (32): config: create config.h config: remove git_config_iter config: don't include config.h by default config: don't implicitly use gitdir setup: don't perform lazy initialization of repository state environment: remove namespace_len variable repository: introduce the repository object environment: place key repository state in the_repository environment: store worktree in the_repository setup: add comment indicating a hack config: read config from a repository object repository: add index_state to struct repo submodule-config: store the_submodule_cache in the_repository submodule: add repo_read_gitmodules submodule: convert is_submodule_initialized to work on a repository convert: convert get_cached_convert_stats_ascii to take an index convert: convert crlf_to_git to take an index convert: convert convert_to_git_filter_fd to take an index convert: convert convert_to_git to take an index convert: convert renormalize_buffer to take an index tree: convert read_tree to take an index parameter ls-files: convert overlay_tree_on_cache to take an index ls-files: convert write_eolinfo to take an index ls-files: convert show_killed_files to take an index ls-files: convert show_other_files to take an index ls-files: convert show_ru_info to take an index ls-files: convert ce_excluded to take an index ls-files: convert prune_cache to take an index ls-files: convert show_files to take an index ls-files: factor out debug info into a function ls-files: factor out tag calculation ls-files: use repository object Makefile | 1 + advice.c | 1 + alias.c| 1 + apply.c| 3 +- archive-tar.c | 1 + archive-zip.c | 1 + archive.c | 1 + attr.c | 1 + bisect.c | 1 + branch.c | 1 + builtin/add.c | 1 + builtin/am.c | 1 + builtin/blame.c| 3 +- builtin/branch.c | 1 + builtin/cat-file.c | 1 + builtin/check-attr.c | 1 + builtin/check-ignore.c | 1 + builtin/check-mailmap.c| 1 + builtin/checkout-index.c | 1 + builtin/checkout.c | 1 + builtin/clean.c| 1 + builtin/clone.c| 1 + builtin/column.c | 1 + builtin/commit-tree.c | 1 + builtin/commit.c | 4 +- builtin/config.c | 3 + builtin/count-objects.c| 1 + builtin/describe.c | 1 + builtin/diff-files.c | 1 + builtin/diff-index.c | 1 + builtin/diff-tree.c| 1 + builtin/diff.c | 1 + builtin/difftool.c | 1 + builtin/fast-export.c | 1 + builtin/fetch.c| 1 + builtin/fmt-merge-msg.c| 1 + builtin/for-each-ref.c | 1 + builtin/fsck.c | 1 + builtin/gc.c | 1 + builtin/grep.c | 4 +- builtin/hash-object.c | 1 + builtin/help.c | 1 + builtin/index-pack.c | 1 + builtin/init-db.c | 1 + builtin/log.c | 1 + builtin/ls-files.c | 330 - builtin/ls-tree.c