Re: What's cooking in git.git (Apr 2019, #03; Tue, 16)
On 4/16/2019 9:19 AM, Junio C Hamano wrote: Here are the topics that have been cooking. Commits prefixed with '-' are only in 'pu' (proposed updates) while commits prefixed with '+' are in 'next'. The ones marked with '.' do not appear in any of the integration branches, but I am still holding onto them. Yet another batch of ~30 topics have graduated to 'master', and 'next' has also gained ~25 topics. We may want to start merging down fixes to 'maint' for a 2.21.1. You can find the changes described here in the integration branches of the repositories listed at http://git-blame.blogspot.com/p/git-public-repositories.html -- * bp/post-index-change-hook (2019-02-15) 1 commit (merged to 'next' on 2019-03-11 at cb96d1d7c4) + read-cache: add post-index-change hook Originally merged to 'next' on 2019-02-23 A new hook "post-index-change" is called when the on-disk index file changes, which can help e.g. a virtualized working tree implementation. Will cook in 'next'. Anything in particular this is waiting for? I'm unaware of any requests for a re-roll. If something is needed, please let me know.
RE: regression in new built-in stash + fsmonitor (was Re: [PATCH v13 11/27] stash: convert apply to builtin)
> It fixes not just this issue, but now the whole test suite passes with > GIT_TEST_FSMONITOR, i.e. this test that's been failing for ~2 years also > works now: > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpubli > c- > inbox.org%2Fgit%2F87k1vwn9qe.fsf%40evledraar.gmail.com%2F&data= > 02%7C01%7CBen.Peart%40microsoft.com%7C758f6272cb9741291c0208d6a8 > d650b4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6368820357 > 67290383&sdata=H4J7xguBJVxanMfn4y0I0HsXBNYcrMS9IrwHn3aWwO > U%3D&reserved=0 > > >> In the meantime I did a build with "next" (so stash-in-C) using the > >> standard test mode and these: > >> > >> (cd t && GIT_TEST_GETTEXT_POISON=true /usr/bin/prove > $BKNG_GIT_HARNESS_OPTIONS t[0-9]*.sh) > >> (cd t && GIT_TEST_FSMONITOR=$PWD/t7519/fsmonitor-all > GIT_SKIP_TESTS="t3404.8 t3420.36" /usr/bin/prove > $BKNG_GIT_HARNESS_OPTIONS t[0-9]*.sh) > >> (cd t && GIT_TEST_SPLIT_INDEX=true /usr/bin/prove > $BKNG_GIT_HARNESS_OPTIONS t[0-9]*.sh) > >> (cd t && GIT_TEST_FULL_IN_PACK_ARRAY=true GIT_TEST_OE_SIZE=10 > /usr/bin/prove $BKNG_GIT_HARNESS_OPTIONS t[0-9]*.sh) > >> (cd t && GIT_TEST_COMMIT_GRAPH=true /usr/bin/prove > $BKNG_GIT_HARNESS_OPTIONS t[0-9]*.sh) > >> (cd t && GIT_TEST_MULTI_PACK_INDEX=true /usr/bin/prove > $BKNG_GIT_HARNESS_OPTIONS t[0-9]*.sh) > >> (cd t && GIT_TEST_STASH_USE_BUILTIN=false /usr/bin/prove > $BKNG_GIT_HARNESS_OPTIONS t[0-9]*.sh) > >> (cd t && GIT_TEST_CHECK_COLLISIONS=false /usr/bin/prove > >> $BKNG_GIT_HARNESS_OPTIONS t[0-9]*.sh) > >> > >> Only this specific test failed. > > > > Well, good! > > > > Thank you for getting the ball rolling! Awesome find and fix. Thanks for chasing this down and CC'ing me so I am aware. > > Dscho
Re: [PATCH] read-cache.c: index format v5 -- 30% smaller/faster than v4
On 2/14/2019 5:14 AM, Duy Nguyen wrote: On Thu, Feb 14, 2019 at 5:02 PM Ævar Arnfjörð Bjarmason wrote: Take a look at stat data, st_dev, st_uid, st_gid and st_mode are the same most of the time. ctime should often be the same (or differs just slightly). And sometimes mtime is the same as well. st_ino is also always zero on Windows. We're storing a lot of duplicate values. Index v5 handles this This looks really promising. I was going to reply to Junio. But it turns out I underestimated "varint" encoding overhead and it increases read time too much. I might get back and try some optimization when I'm bored, but until then this is yet another failed experiment. As a result of this, v5 reduces file size from 30% (git.git) to 36% (webkit.git) compared to v4. Comparing to v2, webkit.git index file size is reduced by 63%! A 8.4MB index file is _almost_ acceptable. Just for kicks, I tried this out on a couple of repos I have handy. files version index size %savings 200k2 25,033,758 0.00% 3 25,033,758 0.00% 4 15,269,923 39.00% 5 9,759,844 61.01% 3m 2 446,123,848 0.00% 3 446,123,848 0.00% 4 249,631,640 44.04% 5 82,147,981 81.59% The 81% savings is very impressive. I didn't measure performance but not writing out an extra 167MB to disk has to help. I'm definitely also interested in your 'sparse index' format ideas as in our 3M repos, there are typically only a few thousand that don't have the skip-worktree bit set. I'm not sure if that is the same 'sparse' you had in mind but it would sure be nice! I've also contemplated multi-threading the index write code path. My thought was in the primary thread to allocate a buffer and when it is full have a background thread compute the SHA and write it to disk while the primary thread fills the next buffer. I'm not sure how much it will buy us as I don't know the relative cost of computing the SHA/writing to disk vs filling the buffer. I've suspected the filling the buffer thread would end up blocked on the background thread most of the time which is why I haven't tried it yet. Of course we trade off storage with cpu. We now need to spend more cycles writing or even reading (but still plenty fast compared to zlib). For reading, I'm counting on multi thread to hide away all this even if it becomes significant. This would be a bigger change, but have we/you ever done a POC experiment to see how much of this time is eaten up by zlib that wouldn't be eaten up with some of the newer "fast but good enough" compression algorithms, e.g. Snappy and Zstandard? I'm quite sure I tried zlib at some point, the only lasting impression I have is "not good enough". Other algorithms might improve a bit, perhaps on the uncompress/read side, but I find it unlikely we could reasonably compress like a hundred megabytes in a few dozen milliseconds (a quick google says Snappy compresses 250MB/s, so about 400ms per 100MB, too long). Splitting the files and compressing in parallel might help. But I will probably focus on "sparse index" approach before going that direction.
RE: [PATCH v2] read-cache: add post-indexchanged hook
> -Original Message- > From: Junio C Hamano On Behalf Of Junio C Hamano > Sent: Friday, February 15, 2019 12:50 PM > To: Ben Peart > Cc: Ramsay Jones ; git@vger.kernel.org; > Ben Peart ; Kevin Willford > ; sand...@crustytoothpaste.net > Subject: Re: [PATCH v2] read-cache: add post-indexchanged hook > > Ben Peart writes: > > > On 2/14/2019 3:33 PM, Junio C Hamano wrote: > >> Ramsay Jones writes: > >> > >>> On 14/02/2019 14:42, Ben Peart wrote: > >>>> From: Ben Peart > >>>> > >>>> Add a post-indexchanged hook that is invoked after the index is > >>>> written in > >>> > >>> s/post-indexchanged/post-index-changed/ > >> > >> Good. I wasn't paying close attention to the previous round, but is > >> that the only name-related bikeshedding? I somehow feel that without > >> s/changed/change/ the name does not roll well on my tongue and does > >> not sit well together with existing ones like post-receive (which is > >> not post-received). I dunno. > >> > >> Will queue. Thanks. > > > > Would you like me to submit another version with the above spelling > > corrections in the commit message or is it easier to fix it up > > yourself? > > I've already done s/indexchanged/index-changed/ before queuing (there > was only one IIRC in the log message), and also the 'optimize' typofix. > > I didn't do anything about dropping 'd' at the end, as I haven't heard any > feedback on that from anybody yet. > I'm ok with either. post-index-changed sounded clearer to me but you're right, none of the other hooks use the post tense. I've submitted one with 'post-index-change' - feel free to keep/user either. > >>>> do_write_locked_index(). > >>>> > >>>> This hook is meant primarily for notification, and cannot affect > >>>> the outcome of git commands that trigger the index write. > >>>> > >>>> The hook is passed a flag to indicate whether the working directory > >>>> was updated or not and a flag indicating if a skip-worktree bit > >>>> could have changed. These flags enable the hook to optmize its > >>>> response to the > >>> > >>> s/optmize/optimize/ > >>> > >>> ATB, > >>> Ramsay Jones
[PATCH v3] read-cache: add post-index-change hook
From: Ben Peart Add a post-index-change hook that is invoked after the index is written in do_write_locked_index(). This hook is meant primarily for notification, and cannot affect the outcome of git commands that trigger the index write. The hook is passed a flag to indicate whether the working directory was updated or not and a flag indicating if a skip-worktree bit could have changed. These flags enable the hook to optimize its response to the index change notification. Signed-off-by: Ben Peart --- Notes: Base Ref: v2.21.0-rc0 Web-Diff: https://github.com/benpeart/git/commit/27001af8db Checkout: git fetch https://github.com/benpeart/git post-index-changed-v3 && git checkout 27001af8db ### Interdiff (v2..v3): diff --git a/Documentation/githooks.txt b/Documentation/githooks.txt index 94b4dadf30..bfb0be3659 100644 --- a/Documentation/githooks.txt +++ b/Documentation/githooks.txt @@ -492,7 +492,7 @@ This hook is invoked by `git-p4 submit`. It takes no parameters and nothing from standard input. Exiting with non-zero status from this script prevent `git-p4 submit` from launching. Run `git-p4 submit --help` for details. -post-index-changed +post-index-change ~ This hook is invoked when the index is written in read-cache.c diff --git a/read-cache.c b/read-cache.c index b6ead7bf8f..862bdf383d 100644 --- a/read-cache.c +++ b/read-cache.c @@ -3004,7 +3004,7 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l else ret = close_lock_file_gently(lock); - run_hook_le(NULL, "post-index-changed", + run_hook_le(NULL, "post-index-change", istate->updated_workdir ? "1" : "0", istate->updated_skipworktree ? "1" : "0", NULL); istate->updated_workdir = 0; diff --git a/t/t7113-post-index-changed-hook.sh b/t/t7113-post-index-change-hook.sh similarity index 95% rename from t/t7113-post-index-changed-hook.sh rename to t/t7113-post-index-change-hook.sh index 6231b88fca..f011ad7eec 100755 --- a/t/t7113-post-index-changed-hook.sh +++ b/t/t7113-post-index-change-hook.sh @@ -1,6 +1,6 @@ #!/bin/sh -test_description='post index changed hook' +test_description='post index change hook' . ./test-lib.sh @@ -14,7 +14,7 @@ test_expect_success 'setup' ' test_expect_success 'test status, add, commit, others trigger hook without flags set' ' mkdir -p .git/hooks && - write_script .git/hooks/post-index-changed <<-\EOF && + write_script .git/hooks/post-index-change <<-\EOF && if test "$1" -eq 1; then echo "Invalid combination of flags passed to hook; updated_workdir is set." >testfailure exit 1 @@ -59,7 +59,7 @@ test_expect_success 'test status, add, commit, others trigger hook without flags ' test_expect_success 'test checkout and reset trigger the hook' ' - write_script .git/hooks/post-index-changed <<-\EOF && + write_script .git/hooks/post-index-change <<-\EOF && if test "$1" -eq 1 && test "$2" -eq 1; then echo "Invalid combination of flags passed to hook; updated_workdir and updated_skipworktree are both set." >testfailure exit 1 @@ -102,7 +102,7 @@ test_expect_success 'test checkout and reset trigger the hook' ' ' test_expect_success 'test reset --mixed and update-index triggers the hook' ' - write_script .git/hooks/post-index-changed <<-\EOF && + write_script .git/hooks/post-index-change <<-\EOF && if test "$1" -eq 1 && test "$2" -eq 1; then echo "Invalid combination of flags passed to hook; updated_workdir and updated_skipworktree are both set." >testfailure exit 1 ### Patches Documentation/githooks.txt| 18 builtin/reset.c | 1 + builtin/update-index.c| 2 + cache.h | 4 +- read-cache.c | 14 ++- t/t7113-post-index-change-hook.sh | 144 ++ unpack-trees.c| 2 + 7 files changed, 182 insertions(+), 3 deletions(-) create mode 100755 t/t7113-post-index-change-hook.sh diff --git a/Documentation/githooks.txt b/Documentation/githooks.txt index 959044347e..bfb0be3659 100644 --- a/Documentation/githoo
Re: [PATCH v2] read-cache: add post-indexchanged hook
On 2/14/2019 3:33 PM, Junio C Hamano wrote: Ramsay Jones writes: On 14/02/2019 14:42, Ben Peart wrote: From: Ben Peart Add a post-indexchanged hook that is invoked after the index is written in s/post-indexchanged/post-index-changed/ Good. I wasn't paying close attention to the previous round, but is that the only name-related bikeshedding? I somehow feel that without s/changed/change/ the name does not roll well on my tongue and does not sit well together with existing ones like post-receive (which is not post-received). I dunno. Will queue. Thanks. Would you like me to submit another version with the above spelling corrections in the commit message or is it easier to fix it up yourself? do_write_locked_index(). This hook is meant primarily for notification, and cannot affect the outcome of git commands that trigger the index write. The hook is passed a flag to indicate whether the working directory was updated or not and a flag indicating if a skip-worktree bit could have changed. These flags enable the hook to optmize its response to the s/optmize/optimize/ ATB, Ramsay Jones
[PATCH v2] read-cache: add post-indexchanged hook
From: Ben Peart Add a post-indexchanged hook that is invoked after the index is written in do_write_locked_index(). This hook is meant primarily for notification, and cannot affect the outcome of git commands that trigger the index write. The hook is passed a flag to indicate whether the working directory was updated or not and a flag indicating if a skip-worktree bit could have changed. These flags enable the hook to optmize its response to the index changed notification. Signed-off-by: Ben Peart --- Notes: Base Ref: v2.21.0-rc0 Web-Diff: https://github.com/benpeart/git/commit/03b96ccbd5 Checkout: git fetch https://github.com/benpeart/git post-index-changed-v2 && git checkout 03b96ccbd5 ### Interdiff (v1..v2): diff --git a/Documentation/githooks.txt b/Documentation/githooks.txt index 9349cd8900..94b4dadf30 100644 --- a/Documentation/githooks.txt +++ b/Documentation/githooks.txt @@ -492,7 +492,7 @@ This hook is invoked by `git-p4 submit`. It takes no parameters and nothing from standard input. Exiting with non-zero status from this script prevent `git-p4 submit` from launching. Run `git-p4 submit --help` for details. -post-indexchanged +post-index-changed ~ This hook is invoked when the index is written in read-cache.c diff --git a/read-cache.c b/read-cache.c index 0fcfa8a075..b6ead7bf8f 100644 --- a/read-cache.c +++ b/read-cache.c @@ -3004,7 +3004,7 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l else ret = close_lock_file_gently(lock); - run_hook_le(NULL, "post-indexchanged", + run_hook_le(NULL, "post-index-changed", istate->updated_workdir ? "1" : "0", istate->updated_skipworktree ? "1" : "0", NULL); istate->updated_workdir = 0; diff --git a/t/t7113-post-index-changed-hook.sh b/t/t7113-post-index-changed-hook.sh index 5aeb726e37..6231b88fca 100755 --- a/t/t7113-post-index-changed-hook.sh +++ b/t/t7113-post-index-changed-hook.sh @@ -14,7 +14,7 @@ test_expect_success 'setup' ' test_expect_success 'test status, add, commit, others trigger hook without flags set' ' mkdir -p .git/hooks && - write_script .git/hooks/post-indexchanged <<-\EOF && + write_script .git/hooks/post-index-changed <<-\EOF && if test "$1" -eq 1; then echo "Invalid combination of flags passed to hook; updated_workdir is set." >testfailure exit 1 @@ -59,7 +59,7 @@ test_expect_success 'test status, add, commit, others trigger hook without flags ' test_expect_success 'test checkout and reset trigger the hook' ' - write_script .git/hooks/post-indexchanged <<-\EOF && + write_script .git/hooks/post-index-changed <<-\EOF && if test "$1" -eq 1 && test "$2" -eq 1; then echo "Invalid combination of flags passed to hook; updated_workdir and updated_skipworktree are both set." >testfailure exit 1 @@ -102,7 +102,7 @@ test_expect_success 'test checkout and reset trigger the hook' ' ' test_expect_success 'test reset --mixed and update-index triggers the hook' ' - write_script .git/hooks/post-indexchanged <<-\EOF && + write_script .git/hooks/post-index-changed <<-\EOF && if test "$1" -eq 1 && test "$2" -eq 1; then echo "Invalid combination of flags passed to hook; updated_workdir and updated_skipworktree are both set." >testfailure exit 1 ### Patches Documentation/githooks.txt | 18 builtin/reset.c| 1 + builtin/update-index.c | 2 + cache.h| 4 +- read-cache.c | 14 ++- t/t7113-post-index-changed-hook.sh | 144 + unpack-trees.c | 2 + 7 files changed, 182 insertions(+), 3 deletions(-) create mode 100755 t/t7113-post-index-changed-hook.sh diff --git a/Documentation/githooks.txt b/Documentation/githooks.txt index 959044347e..94b4dadf30 100644 --- a/Documentation/githooks.txt +++ b/Documentation/githooks.txt @@ -492,6 +492,24 @@ This hook is invoked by `git-p4 submit`. It takes no parameters and nothing from standard input. Exiting with non-zero status from this script prevent `git-p4 submit` from launching. Run `git-p4 submit --help` for details. +post-inde
Re: [PATCH v1 1/3] read-cache: add post-indexchanged hook
On 2/8/2019 6:53 PM, brian m. carlson wrote: On Fri, Feb 08, 2019 at 02:51:13PM -0500, Ben Peart wrote: From: Ben Peart Add a post-indexchanged hook that is invoked after the index is written in do_write_locked_index(). This hook is meant primarily for notification, and cannot affect the outcome of git commands that trigger the index write. Signed-off-by: Ben Peart First, I think the tests should be merged into this commit. That's what we typically do. Happy to. In fact, I'd be happy to add the documentation as well and have a single commit. That's what _I'd_ typically do for something small like this. :) I'm also going to bikeshed slightly and suggest "post-index-changed", since we normally use dashes between words in our hook names. I can do that as well to help make it more consistent. diff --git a/cache.h b/cache.h index 27fe635f62..46eb862d3e 100644 --- a/cache.h +++ b/cache.h @@ -338,7 +338,9 @@ struct index_state { struct cache_time timestamp; unsigned name_hash_initialized : 1, initialized : 1, -drop_cache_tree : 1; +drop_cache_tree : 1, +updated_workdir : 1, +updated_skipworktree : 1; How important is it that we expose whether the skip-worktree bit is changed? I can understand if we expose the workdir is updated, since that's a thing a general user of this hook is likely to be interested in. However, I'm not sure that for a general-purpose hook, the skip-worktree bit is interesting. In our use case, we needed the skip-worktree flag because if something clears the skip-worktree bit on a file, we need to start paying attention to it in the work directory. This flag tells us that may have happened and enables us to not have to do the extra work for other index changed events that don't change the index without updating the working directory. Initially this was just to deal with 'reset --mixed' as it behaves differently with regards to updating the index and working directory than most other commands. However, the update-index command can also arbitrarily clear the skip-worktree bit so we renamed the flag to be more generic. diff --git a/read-cache.c b/read-cache.c index 0e0c93edc9..0fcfa8a075 100644 --- a/read-cache.c +++ b/read-cache.c @@ -17,6 +17,7 @@ #include "commit.h" #include "blob.h" #include "resolve-undo.h" +#include "run-command.h" #include "strbuf.h" #include "varint.h" #include "split-index.h" @@ -2999,8 +3000,17 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l if (ret) return ret; if (flags & COMMIT_LOCK) - return commit_locked_index(lock); - return close_lock_file_gently(lock); + ret = commit_locked_index(lock); + else + ret = close_lock_file_gently(lock); + + run_hook_le(NULL, "post-indexchanged", + istate->updated_workdir ? "1" : "0", + istate->updated_skipworktree ? "1" : "0", NULL); I have, in general, some concerns about this API. First, I think we need to consider that if we're going to expose various bits of information, we might in the future want to expose more such bits. If so, adding integer parameters is not likely to be a good way to do this. It's hard to remember and if a binary is used as the hook, it may not always handle additional arguments gracefully like shell scripts tend to. Binaries deal with a variable number of arguments all the time via `int argc, const char **argv` so this isn't a problem (we actually use a binary for this hook already). If we're not going to expose the skip-worktree bit, then I suppose one argument is fine. Otherwise, it might be better to expose key-value pairs on stdin instead, or something like that. I'm not sure what else we may want to add in the future; this is all we've needed for our uses. For now, I'd suggest we keep it simple and just pass them as command line parameters like we do with the other hooks. It's easy to add additional arguments in the future and if we ever get to where that is unwieldy, we can address it then (YAGNI). Finally, I have questions about performance. What's the overhead of determining whether the hook exists in this code path when there isn't one? Since the index is frequently used, and can be written out as an optimization by some commands, it would be nice to keep overhead low if the hook isn't present. If you ever hit this code path, we've just updated the index which means we read the index file, did an lstat() on every file in the repo plus various refs, config files, etc, and then wrote out a new index file. Adding one more test for a hooks existence doesn't have any measurable impact. Thank you for the feedback!
[PATCH v1 2/3] read-cache: add test for post-indexchanged hook
From: Ben Peart Test the new post-indexchanged hook and ensure it is triggered and passes the correct flags for various git commands. Signed-off-by: Ben Peart --- t/t7113-post-index-changed-hook.sh | 144 + 1 file changed, 144 insertions(+) create mode 100755 t/t7113-post-index-changed-hook.sh diff --git a/t/t7113-post-index-changed-hook.sh b/t/t7113-post-index-changed-hook.sh new file mode 100755 index 00..5aeb726e37 --- /dev/null +++ b/t/t7113-post-index-changed-hook.sh @@ -0,0 +1,144 @@ +#!/bin/sh + +test_description='post index changed hook' + +. ./test-lib.sh + +test_expect_success 'setup' ' + mkdir -p dir1 && + touch dir1/file1.txt && + echo testing >dir1/file2.txt && + git add . && + git commit -m "initial" +' + +test_expect_success 'test status, add, commit, others trigger hook without flags set' ' + mkdir -p .git/hooks && + write_script .git/hooks/post-indexchanged <<-\EOF && + if test "$1" -eq 1; then + echo "Invalid combination of flags passed to hook; updated_workdir is set." >testfailure + exit 1 + fi + if test "$2" -eq 1; then + echo "Invalid combination of flags passed to hook; updated_skipworktree is set." >testfailure + exit 1 + fi + if test -f ".git/index.lock"; then + echo ".git/index.lock exists" >testfailure + exit 3 + fi + if ! test -f ".git/index"; then + echo ".git/index does not exist" >testfailure + exit 3 + fi + echo "success" >testsuccess + EOF + mkdir -p dir2 && + touch dir2/file1.txt && + touch dir2/file2.txt && + : force index to be dirty && + test-tool chmtime +60 dir1/file1.txt && + git status && + test_path_is_file testsuccess && rm -f testsuccess && + test_path_is_missing testfailure && + git add . && + test_path_is_file testsuccess && rm -f testsuccess && + test_path_is_missing testfailure && + git commit -m "second" && + test_path_is_file testsuccess && rm -f testsuccess && + test_path_is_missing testfailure && + git checkout -- dir1/file1.txt && + test_path_is_file testsuccess && rm -f testsuccess && + test_path_is_missing testfailure && + git update-index && + test_path_is_missing testsuccess && + test_path_is_missing testfailure && + git reset --soft && + test_path_is_missing testsuccess && + test_path_is_missing testfailure +' + +test_expect_success 'test checkout and reset trigger the hook' ' + write_script .git/hooks/post-indexchanged <<-\EOF && + if test "$1" -eq 1 && test "$2" -eq 1; then + echo "Invalid combination of flags passed to hook; updated_workdir and updated_skipworktree are both set." >testfailure + exit 1 + fi + if test "$1" -eq 0 && test "$2" -eq 0; then + echo "Invalid combination of flags passed to hook; neither updated_workdir or updated_skipworktree are set." >testfailure + exit 2 + fi + if test "$1" -eq 1; then + if test -f ".git/index.lock"; then + echo "updated_workdir set but .git/index.lock exists" >testfailure + exit 3 + fi + if ! test -f ".git/index"; then + echo "updated_workdir set but .git/index does not exist" >testfailure + exit 3 + fi + else + echo "update_workdir should be set for checkout" >testfailure + exit 4 + fi + echo "success" >testsuccess + EOF + : force index to be dirty && + test-tool chmtime +60 dir1/file1.txt && + git checkout master && + test_path_is_file testsuccess && rm -f testsuccess && + test_path_is_missing testfailure && +
[PATCH v1 0/3] Add post-indexchanged hook
From: Ben Peart Add a post-indexchanged hook that is invoked after the index is written in do_write_locked_index(). This hook is meant primarily for notification, and cannot affect the outcome of git commands that trigger the index write. VFSForGit uses this hook to be notified when a git command has made a change that could impact the virtual files projected in the working directory. I'm submitting this in an effort to further minimize the set of differences between the VFSForGit fork and git.git in the hope that we can someday not need a separate fork at all. Base Ref: v2.21.0-rc0 Web-Diff: https://github.com/benpeart/git/commit/639e57486a Checkout: git fetch https://github.com/benpeart/git post-index-changed-v1 && git checkout 639e57486a Ben Peart (2): read-cache: add post-indexchanged hook read-cache: add test for post-indexchanged hook Kevin Willford (1): read-cache: Add documentation for the post-indexchanged hook Documentation/githooks.txt | 18 builtin/reset.c| 1 + builtin/update-index.c | 2 + cache.h| 4 +- read-cache.c | 14 ++- t/t7113-post-index-changed-hook.sh | 144 + unpack-trees.c | 2 + 7 files changed, 182 insertions(+), 3 deletions(-) create mode 100755 t/t7113-post-index-changed-hook.sh base-commit: d62dad7a7dca3f6a65162bf0e52cdf6927958e78 -- 2.20.1.windows.1
[PATCH v1 1/3] read-cache: add post-indexchanged hook
From: Ben Peart Add a post-indexchanged hook that is invoked after the index is written in do_write_locked_index(). This hook is meant primarily for notification, and cannot affect the outcome of git commands that trigger the index write. Signed-off-by: Ben Peart --- builtin/reset.c| 1 + builtin/update-index.c | 2 ++ cache.h| 4 +++- read-cache.c | 14 -- unpack-trees.c | 2 ++ 5 files changed, 20 insertions(+), 3 deletions(-) diff --git a/builtin/reset.c b/builtin/reset.c index 4d18a461fa..e173afcaac 100644 --- a/builtin/reset.c +++ b/builtin/reset.c @@ -380,6 +380,7 @@ int cmd_reset(int argc, const char **argv, const char *prefix) int flags = quiet ? REFRESH_QUIET : REFRESH_IN_PORCELAIN; if (read_from_tree(&pathspec, &oid, intent_to_add)) return 1; + the_index.updated_skipworktree = 1; if (!quiet && get_git_work_tree()) { uint64_t t_begin, t_delta_in_ms; diff --git a/builtin/update-index.c b/builtin/update-index.c index 02ace602b9..cf731640fa 100644 --- a/builtin/update-index.c +++ b/builtin/update-index.c @@ -1071,6 +1071,8 @@ int cmd_update_index(int argc, const char **argv, const char *prefix) if (entries < 0) die("cache corrupted"); + the_index.updated_skipworktree = 1; + /* * Custom copy of parse_options() because we want to handle * filename arguments as they come. diff --git a/cache.h b/cache.h index 27fe635f62..46eb862d3e 100644 --- a/cache.h +++ b/cache.h @@ -338,7 +338,9 @@ struct index_state { struct cache_time timestamp; unsigned name_hash_initialized : 1, initialized : 1, -drop_cache_tree : 1; +drop_cache_tree : 1, +updated_workdir : 1, +updated_skipworktree : 1; struct hashmap name_hash; struct hashmap dir_hash; struct object_id oid; diff --git a/read-cache.c b/read-cache.c index 0e0c93edc9..0fcfa8a075 100644 --- a/read-cache.c +++ b/read-cache.c @@ -17,6 +17,7 @@ #include "commit.h" #include "blob.h" #include "resolve-undo.h" +#include "run-command.h" #include "strbuf.h" #include "varint.h" #include "split-index.h" @@ -2999,8 +3000,17 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l if (ret) return ret; if (flags & COMMIT_LOCK) - return commit_locked_index(lock); - return close_lock_file_gently(lock); + ret = commit_locked_index(lock); + else + ret = close_lock_file_gently(lock); + + run_hook_le(NULL, "post-indexchanged", + istate->updated_workdir ? "1" : "0", + istate->updated_skipworktree ? "1" : "0", NULL); + istate->updated_workdir = 0; + istate->updated_skipworktree = 0; + + return ret; } static int write_split_index(struct index_state *istate, diff --git a/unpack-trees.c b/unpack-trees.c index 3563daae1a..8665a4a7c0 100644 --- a/unpack-trees.c +++ b/unpack-trees.c @@ -1637,6 +1637,8 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options WRITE_TREE_SILENT | WRITE_TREE_REPAIR); } + + o->result.updated_workdir = 1; discard_index(o->dst_index); *o->dst_index = o->result; } else { -- 2.20.1.windows.1
[PATCH v1 3/3] read-cache: Add documentation for the post-indexchanged hook
From: Kevin Willford Document the new post-indexchanged hook with information on when it is called as well as the flags passed and what each of them mean. Signed-off-by: Kevin Willford Signed-off-by: Ben Peart --- Documentation/githooks.txt | 18 ++ 1 file changed, 18 insertions(+) diff --git a/Documentation/githooks.txt b/Documentation/githooks.txt index 959044347e..9349cd8900 100644 --- a/Documentation/githooks.txt +++ b/Documentation/githooks.txt @@ -492,6 +492,24 @@ This hook is invoked by `git-p4 submit`. It takes no parameters and nothing from standard input. Exiting with non-zero status from this script prevent `git-p4 submit` from launching. Run `git-p4 submit --help` for details. +post-indexchanged +~ + +This hook is invoked when the index is written in read-cache.c +do_write_locked_index. + +The first parameter passed to the hook is the indicator for the +working directory being updated. "1" meaning working directory +was updated or "0" when the working directory was not updated. + +The second parameter passed to the hook is the indicator for whether +or not the index was updated and the skip-worktree bit could have +changed. "1" meaning skip-worktree bits could have been updated +and "0" meaning they were not. + +Only one parameter should be set to "1" when the hook runs. The hook +running passing "1", "1" should not be possible. + GIT --- Part of the linkgit:git[1] suite -- 2.20.1.windows.1
Re: [PATCH v2] teach git to support a virtual (partially populated) work directory
Ping. Any thoughts, comments, feedback, suggestions? On 12/13/2018 2:41 PM, Ben Peart wrote: From: Ben Peart To make git perform well on the very largest repos, we must make git operations O(modified) instead of O(size of repo). This takes advantage of the fact that the number of files a developer has modified (especially in very large repos) is typically a tiny fraction of the overall repo size. We accomplished this by utilizing the existing internal logic for the skip worktree bit and excludes to tell git to ignore all files and folders other than those that have been modified. This logic is driven by an external process that monitors writes to the repo and communicates the list of files and folders with changes to git via the virtual work directory hook in this patch. The external process maintains a list of files and folders that have been modified. When git runs, it requests the list of files and folders that have been modified via the virtual work directory hook. Git then sets/clears the skip-worktree bit on the cache entries and builds a hashmap of the modified files/folders that is used by the excludes logic to avoid scanning the entire repo looking for changes and untracked files. With this system, we have been able to make local git command performance on extremely large repos (millions of files, 1/2 million folders) entirely manageable (30 second checkout, 3.5 seconds status, 4 second add, 7 second commit, etc). On index load, clear/set the skip worktree bits based on the virtual work directory data. Use virtual work directory data to update skip-worktree bit in unpack-trees. Use virtual work directory data to exclude files and folders not explicitly requested. Signed-off-by: Ben Peart --- Notes: Base Ref: v2.20.0 Web-Diff: https://github.com/benpeart/git/commit/acc00a41af Checkout: git fetch https://github.com/benpeart/git virtual-workdir-v2 && git checkout acc00a41af ### Patches Documentation/config/core.txt | 9 + Documentation/githooks.txt| 23 ++ Makefile | 1 + cache.h | 1 + config.c | 32 ++- config.h | 1 + dir.c | 26 ++- environment.c | 1 + read-cache.c | 2 + t/t1092-virtualworkdir.sh | 390 ++ unpack-trees.c| 23 +- virtualworkdir.c | 314 +++ virtualworkdir.h | 25 +++ 13 files changed, 840 insertions(+), 8 deletions(-) create mode 100755 t/t1092-virtualworkdir.sh create mode 100644 virtualworkdir.c create mode 100644 virtualworkdir.h diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt index d0e6635fe0..49b7699a4e 100644 --- a/Documentation/config/core.txt +++ b/Documentation/config/core.txt @@ -68,6 +68,15 @@ core.fsmonitor:: avoiding unnecessary processing of files that have not changed. See the "fsmonitor-watchman" section of linkgit:githooks[5]. +core.virtualWorkDir:: + Please regard this as an experimental feature. + If set to true, utilize the virtual-work-dir hook to identify all + files and directories that are present in the working directory. + Git will only track and update files listed in the virtual work + directory. Using the virtual work directory will supersede the + sparse-checkout settings which will be ignored. + See the "virtual-work-dir" section of linkgit:githooks[6]. + core.trustctime:: If false, the ctime differences between the index and the working tree are ignored; useful when the inode change time diff --git a/Documentation/githooks.txt b/Documentation/githooks.txt index 959044347e..9888d504b4 100644 --- a/Documentation/githooks.txt +++ b/Documentation/githooks.txt @@ -485,6 +485,29 @@ The exit status determines whether git will use the data from the hook to limit its search. On error, it will fall back to verifying all files and folders. +virtual-work-dir + + +Please regard this as an experimental feature. + +The "Virtual Work Directory" hook allows populating the working directory +sparsely. The virtual work directory data is typically automatically +generated by an external process. Git will limit what files it checks for +changes as well as which directories are checked for untracked files based +on the path names given. Git will also only update those files listed in the +virtual work directory. + +The hook is invoked when the configuration option core.virtualWorkDir is +set to true. The hook takes one argument, a version (currently 1). + +The hook should output to stdout the list of all files in the working +directory that git should track. The paths are relative to the root +of the working directory and are separated by a single NUL. Full paths +(
[PATCH v3 1/2] checkout: add test demonstrating regression with checkout -b on initial commit
From: Ben Peart Commit fa655d8411 (checkout: optimize "git checkout -b ", 2018-08-16) introduced an unintentional change in behavior for 'checkout -b' after doing 'clone --no-checkout'. Add a test to demonstrate the changed behavior to be used in a later patch to verify the fix. Signed-off-by: Ben Peart --- t/t2018-checkout-branch.sh | 9 + 1 file changed, 9 insertions(+) diff --git a/t/t2018-checkout-branch.sh b/t/t2018-checkout-branch.sh index 2131fb2a56..6da2d4e68f 100755 --- a/t/t2018-checkout-branch.sh +++ b/t/t2018-checkout-branch.sh @@ -198,4 +198,13 @@ test_expect_success 'checkout -B to the current branch works' ' test_dirty_mergeable ' +test_expect_failure 'checkout -b after clone --no-checkout does a checkout of HEAD' ' + git init src && + test_commit -C src a && + rev="$(git -C src rev-parse HEAD)" && + git clone --no-checkout src dest && + git -C dest checkout "$rev" -b branch && + test_path_is_file dest/a.t +' + test_done -- 2.19.1.gvfs.1.16.g9d1374d
[PATCH v3 0/2] Fix regression in checkout -b
From: Ben Peart Minor update to comment from V2. Also wrapped commit messages to be <80 chars wide. Base Ref: master Web-Diff: https://github.com/benpeart/git/commit/fef76edbdc Checkout: git fetch https://github.com/benpeart/git initial-checkout-v3 && git checkout fef76edbdc ### Interdiff (v2..v3): diff --git a/builtin/checkout.c b/builtin/checkout.c index 9c6e94319e..9f8f3466f6 100644 --- a/builtin/checkout.c +++ b/builtin/checkout.c @@ -593,8 +593,9 @@ static int skip_merge_working_tree(const struct checkout_opts *opts, */ /* - * Do the merge if this is the initial checkout - * + * Do the merge if this is the initial checkout. We cannot use + * is_cache_unborn() here because the index hasn't been loaded yet + * so cache_nr and timestamp.sec are always zero. */ if (!file_exists(get_index_file())) return 0; ### Patches Ben Peart (2): checkout: add test demonstrating regression with checkout -b on initial commit checkout: fix regression in checkout -b on intitial checkout builtin/checkout.c | 8 t/t2018-checkout-branch.sh | 9 + 2 files changed, 17 insertions(+) base-commit: 77556354bb7ac50450e3b28999e3576969869068 -- 2.19.1.gvfs.1.16.g9d1374d
[PATCH v3 2/2] checkout: fix regression in checkout -b on intitial checkout
From: Ben Peart When doing a 'checkout -b' do a full checkout including updating the working tree when doing the initial checkout. As the new test involves an filesystem access, do it later in the sequence to give chance to other cheaper tests to leave early. This fixes the regression in behavior caused by fa655d8411 (checkout: optimize "git checkout -b ", 2018-08-16). Signed-off-by: Ben Peart --- builtin/checkout.c | 8 t/t2018-checkout-branch.sh | 2 +- 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/builtin/checkout.c b/builtin/checkout.c index 6fadf412e8..9f8f3466f6 100644 --- a/builtin/checkout.c +++ b/builtin/checkout.c @@ -592,6 +592,14 @@ static int skip_merge_working_tree(const struct checkout_opts *opts, * Remaining variables are not checkout options but used to track state */ +/* + * Do the merge if this is the initial checkout. We cannot use + * is_cache_unborn() here because the index hasn't been loaded yet + * so cache_nr and timestamp.sec are always zero. + */ + if (!file_exists(get_index_file())) + return 0; + return 1; } diff --git a/t/t2018-checkout-branch.sh b/t/t2018-checkout-branch.sh index 6da2d4e68f..c5014ad9a6 100755 --- a/t/t2018-checkout-branch.sh +++ b/t/t2018-checkout-branch.sh @@ -198,7 +198,7 @@ test_expect_success 'checkout -B to the current branch works' ' test_dirty_mergeable ' -test_expect_failure 'checkout -b after clone --no-checkout does a checkout of HEAD' ' +test_expect_success 'checkout -b after clone --no-checkout does a checkout of HEAD' ' git init src && test_commit -C src a && rev="$(git -C src rev-parse HEAD)" && -- 2.19.1.gvfs.1.16.g9d1374d
Re: [PATCH v2 0/2] Fix regression in checkout -b
On 1/22/2019 1:54 PM, Junio C Hamano wrote: Ben Peart writes: diff --git a/builtin/checkout.c b/builtin/checkout.c index af6b5c8336..9c6e94319e 100644 --- a/builtin/checkout.c +++ b/builtin/checkout.c @@ -517,12 +517,6 @@ static int skip_merge_working_tree(const struct checkout_opts *opts, if (core_apply_sparse_checkout && !checkout_optimize_new_branch) return 0; - /* -* We must do the merge if this is the initial checkout -*/ - if (is_cache_unborn()) - return 0; - /* * We must do the merge if we are actually moving to a new commit. */ @@ -598,6 +592,13 @@ static int skip_merge_working_tree(const struct checkout_opts *opts, * Remaining variables are not checkout options but used to track state */ + /* + * Do the merge if this is the initial checkout + * + */ + if (!file_exists(get_index_file())) + return 0; + return 1; } This is curious. The location the new special case is added is different, and the way the new special case is detected is also different, between v1 and v2. Are both of them significant? IOW, if we moved the check down but kept using is_cache_unborn(), would it break? Or if we did not move the check but switched to check the index file on the filesystem instead of calling is_cache_unborn(), would it break? I had to change the check to not use is_cache_unborn() because at this point, the index has not been loaded so cache_nr and timestamp.sec are always zero (thus defeating the entire optimization). Since part of the optimization was to avoid loading the index when it isn't necessary, the only replacement I could think of was to check for the existence of the index file as if it is missing entirely, it is clearly unborn. This solved the behavior change for the --no-checkout sequence reported. The only reason I moved it lower in the function was a micro perf optimization. Since file_exists() does file I/O, I thought I'd do all the in memory/flag checks first in case they drop out early and we can avoid the unnecessary file I/O. As long as it is tested before the 'return 1;' call, it is logically correct. There are three existing callers of is_{cache,index}_unborn(), all of which want to use it to decide if we are in this funny "unborn" state. If this fixes the issue we saw in v1 of these two patches, does that mean these three existing callers also are buggy in the same way and we are better off rewriting is_index_unborn() to see if the index file is on the disk? It is just the fact that I needed to check for an unborn index _before_ it was loaded that makes me unable to use is_{cache,index}_unborn() here. The other callers should still be fine. I could add a comment in the code to clarify this if you think it will cause confusion later. I am *not* suggesting to make such a drastic change to the existing system. I am wondering why they are working fine but only this new code has to avoid the existing is_index_unborn() logic and go directly to the filesystem. Especially as this new exception added to "skip-merge-working-tree" is to allow the special case code in merge-working-tree that depends on is_cache_unborn() to trigger. Thanks for working on this.
[PATCH v2 2/2] checkout: fix regression in checkout -b on intitial checkout
From: Ben Peart When doing a 'checkout -b' do a full checkout including updating the working tree when doing the initial checkout. This fixes the regression in behavior caused by fa655d8411 (checkout: optimize "git checkout -b ", 2018-08-16) Signed-off-by: Ben Peart --- builtin/checkout.c | 7 +++ t/t2018-checkout-branch.sh | 2 +- 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/builtin/checkout.c b/builtin/checkout.c index 6fadf412e8..9c6e94319e 100644 --- a/builtin/checkout.c +++ b/builtin/checkout.c @@ -592,6 +592,13 @@ static int skip_merge_working_tree(const struct checkout_opts *opts, * Remaining variables are not checkout options but used to track state */ +/* + * Do the merge if this is the initial checkout + * + */ + if (!file_exists(get_index_file())) + return 0; + return 1; } diff --git a/t/t2018-checkout-branch.sh b/t/t2018-checkout-branch.sh index 6da2d4e68f..c5014ad9a6 100755 --- a/t/t2018-checkout-branch.sh +++ b/t/t2018-checkout-branch.sh @@ -198,7 +198,7 @@ test_expect_success 'checkout -B to the current branch works' ' test_dirty_mergeable ' -test_expect_failure 'checkout -b after clone --no-checkout does a checkout of HEAD' ' +test_expect_success 'checkout -b after clone --no-checkout does a checkout of HEAD' ' git init src && test_commit -C src a && rev="$(git -C src rev-parse HEAD)" && -- 2.19.1.gvfs.1.16.g9d1374d
[PATCH v2 0/2] Fix regression in checkout -b
From: Ben Peart The optimized `checkout -b` doesn�t typically create/update the index and working directory. Add a new test to detect the case when the call to `checkout -b` is the first call after doing a `clone --no-checkout` and no index exists. In this specific case, well now make the call to merge_working_tree() which will create the index and update the working directory. Also simplify and update the test cases based on the feedback provided by Szeder and Junio. A shout out to Johannes who diagnosed, fixed and patched a bug that was preventing me from running the git test suite in the master branch. Base Ref: master Web-Diff: https://github.com/benpeart/git/commit/55dd8602f5 Checkout: git fetch https://github.com/benpeart/git initial-checkout-v2 && git checkout 55dd8602f5 ### Interdiff (v1..v2): diff --git a/builtin/checkout.c b/builtin/checkout.c index af6b5c8336..9c6e94319e 100644 --- a/builtin/checkout.c +++ b/builtin/checkout.c @@ -517,12 +517,6 @@ static int skip_merge_working_tree(const struct checkout_opts *opts, if (core_apply_sparse_checkout && !checkout_optimize_new_branch) return 0; - /* -* We must do the merge if this is the initial checkout -*/ - if (is_cache_unborn()) - return 0; - /* * We must do the merge if we are actually moving to a new commit. */ @@ -598,6 +592,13 @@ static int skip_merge_working_tree(const struct checkout_opts *opts, * Remaining variables are not checkout options but used to track state */ +/* + * Do the merge if this is the initial checkout + * + */ + if (!file_exists(get_index_file())) + return 0; + return 1; } diff --git a/t/t2018-checkout-branch.sh b/t/t2018-checkout-branch.sh index c438889b0c..c5014ad9a6 100755 --- a/t/t2018-checkout-branch.sh +++ b/t/t2018-checkout-branch.sh @@ -200,13 +200,11 @@ test_expect_success 'checkout -B to the current branch works' ' test_expect_success 'checkout -b after clone --no-checkout does a checkout of HEAD' ' git init src && - echo hi > src/a && - git -C src add . && - git -C src commit -m "initial commit" && + test_commit -C src a && rev="$(git -C src rev-parse HEAD)" && git clone --no-checkout src dest && git -C dest checkout "$rev" -b branch && - test -f dest/a + test_path_is_file dest/a.t ' test_done ### Patches Ben Peart (2): checkout: add test to demonstrate regression with checkout -b on initial commit checkout: fix regression in checkout -b on intitial checkout builtin/checkout.c | 7 +++ t/t2018-checkout-branch.sh | 9 + 2 files changed, 16 insertions(+) base-commit: 77556354bb7ac50450e3b28999e3576969869068 -- 2.19.1.gvfs.1.16.g9d1374d
[PATCH v2 1/2] checkout: add test to demonstrate regression with checkout -b on initial commit
From: Ben Peart Commit fa655d8411 (checkout: optimize "git checkout -b ", 2018-08-16) introduced an unintentional change in behavior for 'checkout -b' after doing 'clone --no-checkout'. Add a test to demonstrate the changed behavior to be used in a later patch to verify the fix. Signed-off-by: Ben Peart --- t/t2018-checkout-branch.sh | 9 + 1 file changed, 9 insertions(+) diff --git a/t/t2018-checkout-branch.sh b/t/t2018-checkout-branch.sh index 2131fb2a56..6da2d4e68f 100755 --- a/t/t2018-checkout-branch.sh +++ b/t/t2018-checkout-branch.sh @@ -198,4 +198,13 @@ test_expect_success 'checkout -B to the current branch works' ' test_dirty_mergeable ' +test_expect_failure 'checkout -b after clone --no-checkout does a checkout of HEAD' ' + git init src && + test_commit -C src a && + rev="$(git -C src rev-parse HEAD)" && + git clone --no-checkout src dest && + git -C dest checkout "$rev" -b branch && + test_path_is_file dest/a.t +' + test_done -- 2.19.1.gvfs.1.16.g9d1374d
[PATCH v1 2/2] checkout: fix regression in checkout -b on intitial checkout
From: Ben Peart When doing a 'checkout -b' do a full checkout including updating the working tree when doing the initial checkout. This fixes the regression in behavior caused by fa655d8411 checkout: optimize "git checkout -b " Signed-off-by: Ben Peart --- builtin/checkout.c | 6 ++ t/t2018-checkout-branch.sh | 2 +- 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/builtin/checkout.c b/builtin/checkout.c index 6fadf412e8..af6b5c8336 100644 --- a/builtin/checkout.c +++ b/builtin/checkout.c @@ -517,6 +517,12 @@ static int skip_merge_working_tree(const struct checkout_opts *opts, if (core_apply_sparse_checkout && !checkout_optimize_new_branch) return 0; + /* +* We must do the merge if this is the initial checkout +*/ + if (is_cache_unborn()) + return 0; + /* * We must do the merge if we are actually moving to a new commit. */ diff --git a/t/t2018-checkout-branch.sh b/t/t2018-checkout-branch.sh index 35999b3adb..c438889b0c 100755 --- a/t/t2018-checkout-branch.sh +++ b/t/t2018-checkout-branch.sh @@ -206,7 +206,7 @@ test_expect_success 'checkout -b after clone --no-checkout does a checkout of HE rev="$(git -C src rev-parse HEAD)" && git clone --no-checkout src dest && git -C dest checkout "$rev" -b branch && - test_must_fail test -f dest/a + test -f dest/a ' test_done -- 2.19.1.gvfs.1.16.g9d1374d
[PATCH v1 0/2] Fix regression in checkout -b
From: Ben Peart Anthony Sottile determined that commit fa655d8411 "checkout: optimize "git checkout -b " introduced an unintentional change in behavior for 'checkout -b' after doing a 'clone --no-checkout'. Create a test to demonstrate the regression then fix the bug and update the test to demonstrate the fix. Base Ref: master Web-Diff: https://github.com/benpeart/git/commit/3930f7fa32 Checkout: git fetch https://github.com/benpeart/git initial-checkout-v1 && git checkout 3930f7fa32 Ben Peart (2): checkout: add test to demonstrate regression with checkout -b on initial commit checkout: fix regression in checkout -b on intitial checkout builtin/checkout.c | 6 ++ t/t2018-checkout-branch.sh | 11 +++ 2 files changed, 17 insertions(+) base-commit: 77556354bb7ac50450e3b28999e3576969869068 -- 2.19.1.gvfs.1.16.g9d1374d
[PATCH v1 1/2] checkout: add test to demonstrate regression with checkout -b on initial commit
From: Ben Peart Commit fa655d8411 checkout: optimize "git checkout -b " introduced an unintentional change in behavior for 'checkout -b' after doing a 'clone --no-checkout'. Add a test to demonstrate the changed behavior to be used in a later patch to verify the fix. Signed-off-by: Ben Peart --- t/t2018-checkout-branch.sh | 11 +++ 1 file changed, 11 insertions(+) diff --git a/t/t2018-checkout-branch.sh b/t/t2018-checkout-branch.sh index 2131fb2a56..35999b3adb 100755 --- a/t/t2018-checkout-branch.sh +++ b/t/t2018-checkout-branch.sh @@ -198,4 +198,15 @@ test_expect_success 'checkout -B to the current branch works' ' test_dirty_mergeable ' +test_expect_success 'checkout -b after clone --no-checkout does a checkout of HEAD' ' + git init src && + echo hi > src/a && + git -C src add . && + git -C src commit -m "initial commit" && + rev="$(git -C src rev-parse HEAD)" && + git clone --no-checkout src dest && + git -C dest checkout "$rev" -b branch && + test_must_fail test -f dest/a +' + test_done -- 2.19.1.gvfs.1.16.g9d1374d
Re: Regression `git checkout $rev -b branch` while in a `--no-checkout` clone does not check out files
On 1/3/2019 5:05 PM, Anthony Sottile wrote: On Thu, Jan 3, 2019 at 1:51 PM Junio C Hamano wrote: Anthony Sottile writes: On Thu, Jan 3, 2019 at 12:26 PM Junio C Hamano wrote: A "fix" to Ben's optimization for this particular case should be fairly straight-forward. I think we have a special case in the checkout codepath for an initial checkout and disable "carry forward the fact that the user wanted all the paths removed", so it would be the matter of adding yet another condition (is_cache_unborn(), which is used to set topts.initial_checkout) to the large collection of conditions in skip_merge_working_tree(). I think it might be simpler than that even -- the optimization treats the following as equivalent when the current checked out revision is deadbeef (even if the index / worktree differ), when before they were not: - git checkout -b newbranch - git checkout deadbeef -b newbranch If a revision is specified on the commandline it should be checked out. If it were to be a "fix", the exact same command line as people used to be able to use, i.e. "git checkout -b newbranch", should be made to do what it used to do. Forcing users to use a different command to workaround the bug is not a usable "fix". If we want a working workaround, you can tell your users to use git reset --hard HEAD && git checkout -b newbranch and that would already work without any code change ;-). Just noticed this thread. I agree that the behavior of `git clone --no-checkout` is a little odd in that it shows everything as deleted but the goal of the `checkout -b` optimization was to not change behavior (unless the user opt-ed in to the changed behavior via checkout.optimizeNewBranch). I'll work on a patch to detect this case and ensure the default behavior doesn't change. oh wow, I didn't realize `git checkout -b newbranch` also used to reset the `--no-checkout` state, yeah you're right the optimization is way more problematic than I had considered. I'm working around by not using `--no-checkout` personally Anthony
[PATCH v2] teach git to support a virtual (partially populated) work directory
From: Ben Peart To make git perform well on the very largest repos, we must make git operations O(modified) instead of O(size of repo). This takes advantage of the fact that the number of files a developer has modified (especially in very large repos) is typically a tiny fraction of the overall repo size. We accomplished this by utilizing the existing internal logic for the skip worktree bit and excludes to tell git to ignore all files and folders other than those that have been modified. This logic is driven by an external process that monitors writes to the repo and communicates the list of files and folders with changes to git via the virtual work directory hook in this patch. The external process maintains a list of files and folders that have been modified. When git runs, it requests the list of files and folders that have been modified via the virtual work directory hook. Git then sets/clears the skip-worktree bit on the cache entries and builds a hashmap of the modified files/folders that is used by the excludes logic to avoid scanning the entire repo looking for changes and untracked files. With this system, we have been able to make local git command performance on extremely large repos (millions of files, 1/2 million folders) entirely manageable (30 second checkout, 3.5 seconds status, 4 second add, 7 second commit, etc). On index load, clear/set the skip worktree bits based on the virtual work directory data. Use virtual work directory data to update skip-worktree bit in unpack-trees. Use virtual work directory data to exclude files and folders not explicitly requested. Signed-off-by: Ben Peart --- Notes: Base Ref: v2.20.0 Web-Diff: https://github.com/benpeart/git/commit/acc00a41af Checkout: git fetch https://github.com/benpeart/git virtual-workdir-v2 && git checkout acc00a41af ### Patches Documentation/config/core.txt | 9 + Documentation/githooks.txt| 23 ++ Makefile | 1 + cache.h | 1 + config.c | 32 ++- config.h | 1 + dir.c | 26 ++- environment.c | 1 + read-cache.c | 2 + t/t1092-virtualworkdir.sh | 390 ++ unpack-trees.c| 23 +- virtualworkdir.c | 314 +++ virtualworkdir.h | 25 +++ 13 files changed, 840 insertions(+), 8 deletions(-) create mode 100755 t/t1092-virtualworkdir.sh create mode 100644 virtualworkdir.c create mode 100644 virtualworkdir.h diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt index d0e6635fe0..49b7699a4e 100644 --- a/Documentation/config/core.txt +++ b/Documentation/config/core.txt @@ -68,6 +68,15 @@ core.fsmonitor:: avoiding unnecessary processing of files that have not changed. See the "fsmonitor-watchman" section of linkgit:githooks[5]. +core.virtualWorkDir:: + Please regard this as an experimental feature. + If set to true, utilize the virtual-work-dir hook to identify all + files and directories that are present in the working directory. + Git will only track and update files listed in the virtual work + directory. Using the virtual work directory will supersede the + sparse-checkout settings which will be ignored. + See the "virtual-work-dir" section of linkgit:githooks[6]. + core.trustctime:: If false, the ctime differences between the index and the working tree are ignored; useful when the inode change time diff --git a/Documentation/githooks.txt b/Documentation/githooks.txt index 959044347e..9888d504b4 100644 --- a/Documentation/githooks.txt +++ b/Documentation/githooks.txt @@ -485,6 +485,29 @@ The exit status determines whether git will use the data from the hook to limit its search. On error, it will fall back to verifying all files and folders. +virtual-work-dir + + +Please regard this as an experimental feature. + +The "Virtual Work Directory" hook allows populating the working directory +sparsely. The virtual work directory data is typically automatically +generated by an external process. Git will limit what files it checks for +changes as well as which directories are checked for untracked files based +on the path names given. Git will also only update those files listed in the +virtual work directory. + +The hook is invoked when the configuration option core.virtualWorkDir is +set to true. The hook takes one argument, a version (currently 1). + +The hook should output to stdout the list of all files in the working +directory that git should track. The paths are relative to the root +of the working directory and are separated by a single NUL. Full paths +('dir1/a.txt') as well as directories are supported (ie 'dir1/'). + +The exit status determines whether git w
Re: [PATCH v1] teach git to support a virtual (partially populated) work directory
On 11/28/2018 8:31 AM, SZEDER Gábor wrote: On Tue, Nov 27, 2018 at 02:50:57PM -0500, Ben Peart wrote: diff --git a/t/t1092-virtualworkdir.sh b/t/t1092-virtualworkdir.sh new file mode 100755 index 00..0cdfe9b362 --- /dev/null +++ b/t/t1092-virtualworkdir.sh @@ -0,0 +1,393 @@ +#!/bin/sh + +test_description='virtual work directory tests' + +. ./test-lib.sh + +# We need total control of the virtual work directory hook +sane_unset GIT_TEST_VIRTUALWORKDIR + +clean_repo () { + rm .git/index && + git -c core.virtualworkdir=false reset --hard HEAD && + git -c core.virtualworkdir=false clean -fd && + touch untracked.txt && We would usually run '>untracked.txt' instead, sparing the external process. A further nit is that a function called 'clean_repo' creates new untracked files... Thanks, all good suggestions I've incorporated for the next iteration. + touch dir1/untracked.txt && + touch dir2/untracked.txt +} + +test_expect_success 'setup' ' + mkdir -p .git/hooks/ && + cat > .gitignore <<-\EOF && CodingGuidelines suggest no space between redirection operator and filename. + .gitignore + expect* + actual* + EOF + touch file1.txt && + touch file2.txt && + mkdir -p dir1 && + touch dir1/file1.txt && + touch dir1/file2.txt && + mkdir -p dir2 && + touch dir2/file1.txt && + touch dir2/file2.txt && + git add . && + git commit -m "initial" && + git config --local core.virtualworkdir true +' +test_expect_success 'verify files not listed are ignored by git clean -f -x' ' + clean_repo && I find it odd to clean the repo right after setting it up; but then again, 'clean_repo' not only cleans, but also creates new files. Perhaps rename it to 'reset_repo'? Dunno. + write_script .git/hooks/virtual-work-dir <<-\EOF && + printf "untracked.txt\0" + printf "dir1/\0" + EOF + mkdir -p dir3 && + touch dir3/untracked.txt && + git clean -f -x && + test -f file1.txt && Please use the 'test_path_is_file', ... + test -f file2.txt && + test ! -f untracked.txt && ... 'test_path_is_missing', and ... + test -d dir1 && ... 'test_path_is_dir' helpers, respectively, because they print informative error messages on failure. + test -f dir1/file1.txt && + test -f dir1/file2.txt && + test ! -f dir1/untracked.txt && + test -f dir2/file1.txt && + test -f dir2/file2.txt && + test -f dir2/untracked.txt && + test -d dir3 && + test -f dir3/untracked.txt +'
Re: [PATCH v1] mem_pool: add GIT_TRACE_MEMPOOL support
On 11/28/2018 4:37 AM, Johannes Schindelin wrote: Hi Ben, On Tue, 27 Nov 2018, Ben Peart wrote: From: Ben Peart Add tracing around initializing and discarding mempools. In discard report on the amount of memory unused in the current block to help tune setting the initial_size. Signed-off-by: Ben Peart --- Looks good. My only question: should we also trace calls to _alloc(), _calloc() and _combine()? I was trying to tune the initial size in my use of the mem_pool and so found this tracing useful to see how much memory was actually being used. I'm inclined to only add tracing as it is needed rather that proactively because we think it _might_ be needed. I suspect _alloc() and _calloc() would get very noisy and not add much value. Ciao, Johannes Notes: Base Ref: * git-trace-mempool Web-Diff: https://github.com/benpeart/git/commit/9ac84bbca2 Checkout: git fetch https://github.com/benpeart/git git-trace-mempool-v1 && git checkout 9ac84bbca2 mem-pool.c | 5 + 1 file changed, 5 insertions(+) diff --git a/mem-pool.c b/mem-pool.c index a2841a4a9a..065389aaec 100644 --- a/mem-pool.c +++ b/mem-pool.c @@ -5,6 +5,7 @@ #include "cache.h" #include "mem-pool.h" +static struct trace_key trace_mem_pool = TRACE_KEY_INIT(MEMPOOL); #define BLOCK_GROWTH_SIZE 1024*1024 - sizeof(struct mp_block); /* @@ -48,12 +49,16 @@ void mem_pool_init(struct mem_pool **mem_pool, size_t initial_size) mem_pool_alloc_block(pool, initial_size, NULL); *mem_pool = pool; + trace_printf_key(&trace_mem_pool, "mem_pool (%p): init (%"PRIuMAX") initial size\n", + pool, (uintmax_t)initial_size); } void mem_pool_discard(struct mem_pool *mem_pool, int invalidate_memory) { struct mp_block *block, *block_to_free; + trace_printf_key(&trace_mem_pool, "mem_pool (%p): discard (%"PRIuMAX") unused\n", + mem_pool, (uintmax_t)(mem_pool->mp_block->end - mem_pool->mp_block->next_free)); block = mem_pool->mp_block; while (block) { base-commit: bb75be6cb916297f271c846f2f9caa3daaaec718 -- 2.18.0.windows.1
[PATCH v1] mem_pool: add GIT_TRACE_MEMPOOL support
From: Ben Peart Add tracing around initializing and discarding mempools. In discard report on the amount of memory unused in the current block to help tune setting the initial_size. Signed-off-by: Ben Peart --- Notes: Base Ref: * git-trace-mempool Web-Diff: https://github.com/benpeart/git/commit/9ac84bbca2 Checkout: git fetch https://github.com/benpeart/git git-trace-mempool-v1 && git checkout 9ac84bbca2 mem-pool.c | 5 + 1 file changed, 5 insertions(+) diff --git a/mem-pool.c b/mem-pool.c index a2841a4a9a..065389aaec 100644 --- a/mem-pool.c +++ b/mem-pool.c @@ -5,6 +5,7 @@ #include "cache.h" #include "mem-pool.h" +static struct trace_key trace_mem_pool = TRACE_KEY_INIT(MEMPOOL); #define BLOCK_GROWTH_SIZE 1024*1024 - sizeof(struct mp_block); /* @@ -48,12 +49,16 @@ void mem_pool_init(struct mem_pool **mem_pool, size_t initial_size) mem_pool_alloc_block(pool, initial_size, NULL); *mem_pool = pool; + trace_printf_key(&trace_mem_pool, "mem_pool (%p): init (%"PRIuMAX") initial size\n", + pool, (uintmax_t)initial_size); } void mem_pool_discard(struct mem_pool *mem_pool, int invalidate_memory) { struct mp_block *block, *block_to_free; + trace_printf_key(&trace_mem_pool, "mem_pool (%p): discard (%"PRIuMAX") unused\n", + mem_pool, (uintmax_t)(mem_pool->mp_block->end - mem_pool->mp_block->next_free)); block = mem_pool->mp_block; while (block) { base-commit: bb75be6cb916297f271c846f2f9caa3daaaec718 -- 2.18.0.windows.1
[PATCH v1] teach git to support a virtual (partially populated) work directory
From: Ben Peart To make git perform well on the very largest repos, we must make git operations O(modified) instead of O(size of repo). This takes advantage of the fact that the number of files a developer has modified (especially in very large repos) is typically a tiny fraction of the overall repo size. We accomplished this by utilizing the existing internal logic for the skip worktree bit and excludes to tell git to ignore all files and folders other than those that have been modified. This logic is driven by an external process that monitors writes to the repo and communicates the list of files and folders with changes to git via the virtual work directory hook in this patch. The external process maintains a list of files and folders that have been modified. When git runs, it requests the list of files and folders that have been modified via the virtual work directory hook. Git then sets/clears the skip-worktree bit on the cache entries and builds a hashmap of the modified files/folders that is used by the excludes logic to avoid scanning the entire repo looking for changes and untracked files. With this system, we have been able to make local git command performance on extremely large repos (millions of files, 1/2 million folders) entirely manageable (30 second checkout, 3.5 seconds status, 4 second add, 7 second commit, etc). On index load, clear/set the skip worktree bits based on the virtual work directory data. Use virtual work directory data to update skip-worktree bit in unpack-trees. Use virtual work directory data to exclude files and folders not explicitly requested. Signed-off-by: Ben Peart --- I believe I've incorporated all the feedback from the RFC. Renaming the feature, updating the setting to be a boolean with a hard coded hook name, labeling the feature "experimental," and only calling get_dtype() if the feature is turned on. If there are other suggestions on how to ensure this is a useful and general purpose feature please let me know. Notes: Base Ref: master Web-Diff: https://github.com/benpeart/git/commit/65c3ca2e5f Checkout: git fetch https://github.com/benpeart/git virtual-workdir-v1 && git checkout 65c3ca2e5f Documentation/config/core.txt | 9 + Documentation/githooks.txt| 23 ++ Makefile | 1 + cache.h | 1 + config.c | 32 ++- config.h | 1 + dir.c | 26 ++- environment.c | 1 + read-cache.c | 2 + t/t1092-virtualworkdir.sh | 393 ++ unpack-trees.c| 23 +- virtualworkdir.c | 314 +++ virtualworkdir.h | 25 +++ 13 files changed, 843 insertions(+), 8 deletions(-) create mode 100755 t/t1092-virtualworkdir.sh create mode 100644 virtualworkdir.c create mode 100644 virtualworkdir.h diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt index d0e6635fe0..49b7699a4e 100644 --- a/Documentation/config/core.txt +++ b/Documentation/config/core.txt @@ -68,6 +68,15 @@ core.fsmonitor:: avoiding unnecessary processing of files that have not changed. See the "fsmonitor-watchman" section of linkgit:githooks[5]. +core.virtualWorkDir:: + Please regard this as an experimental feature. + If set to true, utilize the virtual-work-dir hook to identify all + files and directories that are present in the working directory. + Git will only track and update files listed in the virtual work + directory. Using the virtual work directory will supersede the + sparse-checkout settings which will be ignored. + See the "virtual-work-dir" section of linkgit:githooks[6]. + core.trustctime:: If false, the ctime differences between the index and the working tree are ignored; useful when the inode change time diff --git a/Documentation/githooks.txt b/Documentation/githooks.txt index 959044347e..9888d504b4 100644 --- a/Documentation/githooks.txt +++ b/Documentation/githooks.txt @@ -485,6 +485,29 @@ The exit status determines whether git will use the data from the hook to limit its search. On error, it will fall back to verifying all files and folders. +virtual-work-dir + + +Please regard this as an experimental feature. + +The "Virtual Work Directory" hook allows populating the working directory +sparsely. The virtual work directory data is typically automatically +generated by an external process. Git will limit what files it checks for +changes as well as which directories are checked for untracked files based +on the path names given. Git will also only update those files listed in the +virtual work directory. + +The hook is invoked when the configuration option core.virtualWorkDir is +set to true. The hook takes one argument, a version (currentl
Re: [PATCH 2/5] ieot: default to not writing IEOT section
On 11/26/2018 2:59 PM, Stefan Beller wrote: +static int record_ieot(void) +{ + int val; + Initialize stack val to zero to ensure proper default. I don't think that is needed here, as we only use `val` when we first write to it via git_config_get_bool. Did you spot this via code review and thought of defensive programming or is there a tool that has a false positive here? Code review and defensive programming. I had to review the code in git_config_get_bool() to see if it always initialized the val even if it didn't find the requested config variable (esp since we don't pass in a default value for this function like we do others). + if (!git_config_get_bool("index.recordoffsettable", &val)) + return val; + return 0; +}
Re: [PATCH 1/5] eoie: default to not writing EOIE section
On 11/20/2018 1:11 AM, Jonathan Nieder wrote: Since 3b1d9e04 (eoie: add End of Index Entry (EOIE) extension, 2018-10-10) Git defaults to writing the new EOIE section when writing out an index file. Usually that is a good thing because it improves threaded performance, but when a Git repository is shared with older versions of Git, it produces a confusing warning: $ git status ignoring EOIE extension HEAD detached at 371ed0defa nothing to commit, working tree clean Let's introduce the new index extension more gently. First we'll roll out the new version of Git that understands it, and then once sufficiently many users are using such a version, we can flip the default to writing it by default. Introduce a '[index] recordEndOfIndexEntries' configuration variable to allow interested users to benefit from this index extension early. Signed-off-by: Jonathan Nieder --- Rebased. No other change from v1. As Jonathan pointed out, it would be nice to have tests here. Ben, any advice for how I could write some in a followup change? E.g. does Derrick Stolee's tracing based testing trick apply here? I suppose a 'test-dump-eoie' could be written along the lines of test-dump-fsmonitor or test-dump-untracked-cache. Unlike those, there isn't much state to dump other than the existence of the extension and the offset. That could be used to test that the new settings are working properly.
Re: [PATCH 5/5] index: offer advice for unknown index extensions
On 11/20/2018 4:26 AM, Ævar Arnfjörð Bjarmason wrote: On Tue, Nov 20 2018, Jonathan Nieder wrote: Just commenting here on the end-state of this since it's easier than each patch at a time: First, do we still need to be doing %.4s instead of just %s? It would be easier for translators / to understand what's going on if it were just %s. I.e. "this is the extension name" v.s. "this is the first 4 bytes of whatever it is...". return error("index uses %.4s extension, which we do not understand", ext); Missing _(). Not the fault of this series, but something to fix while we're at it. Also not the fault of this series, the "is this upper case" test is unportable, but this is probably the tip of the iceberg for git not working on EBCDIC systems. This message should say something like "Index uses the mandatory %s extension" to clarify and distinguish it from the below. We don't understand the upper-case one either, but the important distinction is that one is mandatory, and the other can be dropped. The two messages should make this clear. Also, having advice() for that case is even more valuable since we have a hard error at this point. So something like: "This is likely due to the index having been written by a future version of Git. All-lowercase index extensions are mandatory, as opposed to optional all-uppercase ones which we'll drop with a warning if we see them". I agree that we should have different messages for mandatory and optional extensions. I don't think we should try and educate the end user on the implementation detail that git makes lower cases mandatory and upper case optional (ie drop the 'All-lowercase..." part). They will never see the lower vs upper case difference and can't do anything about it anyway. trace_printf("ignoring %.4s extension\n", ext); + if (advice_unknown_index_extension) { + warning(_("ignoring optional %.4s index extension"), ext); Should start with upper-case. Good that it says "optional". + advise(_("This is likely due to the file having been written by a newer\n" +"version of Git than is reading it. You can upgrade Git to\n" +"take advantage of performance improvements from the updated\n" +"file format.\n" Let's not promise performance improvements with this extension in a future version. We have no idea what the extension is, yeah right now it's going to be true for the extension that prompted this patch series, but may not be in the future. So just something like this for the last sentence: You can try upgrading Git to use this new index format. Agree - not all are guaranteed to be perf related. +"\n" +"Run \"%s\"\n" +"to suppress this message."), + "git config advice.unknownIndexExtension false"); Somewhat of an aside, but if I grep: git grep -C10 'git config advice\..*false' -- '*.[ch]' There's a few existing examples of this, but the majority of advice() messages don't say in the message how you can turn these off. Do we think this a message users would especially like to turn off? I have the opposite impression, it's a one-off in most cases, although not in the case where an editor has an embedded git. I think it would make sense to add this sort of thing to the advice() API, i.e.: advice_with_config_hint(_(""), "unknownIndexExtension"); Which would then know how to consistently print this advice about how to turn off the warning. I like this. I personally never knew you could turn of the "spent xxx seconds finding untracked files" advice until I worked on this patch series. This would help make that feature more discoverable.
Re: [PATCH 4/5] index: make index.threads=true enable ieot and eoie
On 11/20/2018 1:14 AM, Jonathan Nieder wrote: If a user explicitly sets [index] threads = true to read the index using multiple threads, ensure that index writes include the offset table by default to make that possible. This ensures that the user's intent of turning on threading is respected. In other words, permit the following configurations: - index.threads and index.recordOffsetTable unspecified: do not write the offset table yet (to avoid alarming the user with "ignoring IEOT extension" messages when an older version of Git accesses the repository) but do make use of multiple threads to read the index if the supporting offset table is present. This can also be requested explicitly by setting index.threads=true, 0, or >1 and index.recordOffsetTable=false. - index.threads=false or 1: do not write the offset table, and do not make use of the offset table. One can set index.recordOffsetTable=false as well, to be more explicit. - index.threads=true, 0, or >1 and index.recordOffsetTable unspecified: write the offset table and make use of threads at read time. This can also be requested by setting index.threads=true, 0, >1, or unspecified and index.recordOffsetTable=true. Fortunately the complication is temporary: once most Git installations have upgraded to a version with support for the IEOT and EOIE extensions, we can flip the defaults for index.recordEndOfIndexEntries and index.recordOffsetTable to true and eliminate the settings. This looks good. I think this provides good default behavior while enabling fine grained control to those who want/need it. I'm looking forward to the day when we can turn it back on by default so that people can take advantage of the speed improvements.
Re: [PATCH 2/5] ieot: default to not writing IEOT section
On 11/20/2018 1:12 AM, Jonathan Nieder wrote: As with EOIE, popular versions of Git do not support the new IEOT extension yet. When accessing a Git repository written by a more modern version of Git, they correctly ignore the unrecognized section, but in the process they loudly warn ignoring IEOT extension resulting in confusion for users. Introduce the index extension more gently by not writing it yet in this first version with support for it. Soon, once sufficiently many users are running a modern version of Git, we can flip the default so users benefit from this index extension by default. Introduce a '[index] recordOffsetTable' configuration variable to control whether the new index extension is written. Signed-off-by: Jonathan Nieder --- As with patch 1/5, no change from v1 other than rebasing. Documentation/config/index.txt | 7 +++ read-cache.c | 11 ++- 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/Documentation/config/index.txt b/Documentation/config/index.txt index 8e138aba7a..de44183235 100644 --- a/Documentation/config/index.txt +++ b/Documentation/config/index.txt @@ -5,6 +5,13 @@ index.recordEndOfIndexEntries:: reading the index using Git versions before 2.20. Defaults to 'false'. +index.recordOffsetTable:: + Specifies whether the index file should include an "Index Entry + Offset Table" section. This reduces index load time on + multiprocessor machines but produces a message "ignoring IEOT + extension" when reading the index using Git versions before 2.20. + Defaults to 'false'. + index.threads:: Specifies the number of threads to spawn when loading the index. This is meant to reduce index load time on multiprocessor machines. diff --git a/read-cache.c b/read-cache.c index 1e9c772603..f3d5638d9e 100644 --- a/read-cache.c +++ b/read-cache.c @@ -2698,6 +2698,15 @@ static int record_eoie(void) return 0; } +static int record_ieot(void) +{ + int val; + Initialize stack val to zero to ensure proper default. + if (!git_config_get_bool("index.recordoffsettable", &val)) + return val; + return 0; +} + /* * On success, `tempfile` is closed. If it is the temporary file * of a `struct lock_file`, we will therefore effectively perform @@ -2761,7 +2770,7 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, else nr_threads = 1; - if (nr_threads != 1) { + if (nr_threads != 1 && record_ieot()) { int ieot_blocks, cpus; /*
Re: [PATCH 1/5] eoie: default to not writing EOIE section
On 11/20/2018 1:11 AM, Jonathan Nieder wrote: Since 3b1d9e04 (eoie: add End of Index Entry (EOIE) extension, 2018-10-10) Git defaults to writing the new EOIE section when writing out an index file. Usually that is a good thing because it improves threaded performance, but when a Git repository is shared with older versions of Git, it produces a confusing warning: $ git status ignoring EOIE extension HEAD detached at 371ed0defa nothing to commit, working tree clean Let's introduce the new index extension more gently. First we'll roll out the new version of Git that understands it, and then once sufficiently many users are using such a version, we can flip the default to writing it by default. Introduce a '[index] recordEndOfIndexEntries' configuration variable to allow interested users to benefit from this index extension early. Signed-off-by: Jonathan Nieder --- Rebased. No other change from v1. As Jonathan pointed out, it would be nice to have tests here. Ben, any advice for how I could write some in a followup change? E.g. does Derrick Stolee's tracing based testing trick apply here? Documentation/config/index.txt | 7 +++ read-cache.c | 11 ++- t/t1700-split-index.sh | 11 +++ 3 files changed, 24 insertions(+), 5 deletions(-) diff --git a/Documentation/config/index.txt b/Documentation/config/index.txt index 4b94b6bedc..8e138aba7a 100644 --- a/Documentation/config/index.txt +++ b/Documentation/config/index.txt @@ -1,3 +1,10 @@ +index.recordEndOfIndexEntries:: + Specifies whether the index file should include an "End Of Index + Entry" section. This reduces index load time on multiprocessor + machines but produces a message "ignoring EOIE extension" when + reading the index using Git versions before 2.20. Defaults to + 'false'. + index.threads:: Specifies the number of threads to spawn when loading the index. This is meant to reduce index load time on multiprocessor machines. diff --git a/read-cache.c b/read-cache.c index 4ca81286c0..1e9c772603 100644 --- a/read-cache.c +++ b/read-cache.c @@ -2689,6 +2689,15 @@ void update_index_if_able(struct index_state *istate, struct lock_file *lockfile rollback_lock_file(lockfile); } +static int record_eoie(void) +{ + int val; I believe you are going to want to initialize val to 0 here as it is on the stack so is not guaranteed to be zero. + + if (!git_config_get_bool("index.recordendofindexentries", &val)) + return val; + return 0; +} + /* * On success, `tempfile` is closed. If it is the temporary file * of a `struct lock_file`, we will therefore effectively perform @@ -2936,7 +2945,7 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, * read. Write it out regardless of the strip_extensions parameter as we need it * when loading the shared index. */ - if (offset) { + if (offset && record_eoie()) { struct strbuf sb = STRBUF_INIT; write_eoie_extension(&sb, &eoie_c, offset); diff --git a/t/t1700-split-index.sh b/t/t1700-split-index.sh index 2ac47aa0e4..0cbac64e28 100755 --- a/t/t1700-split-index.sh +++ b/t/t1700-split-index.sh @@ -25,14 +25,17 @@ test_expect_success 'enable split index' ' git update-index --split-index && test-tool dump-split-index .git/index >actual && indexversion=$(test-tool index-version <.git/index) && + + # NEEDSWORK: Stop hard-coding checksums. if test "$indexversion" = "4" then - own=3527df833c6c100d3d1d921a9a782d62a8be4b58 - base=746f7ab2ed44fb839efdfbffcf399d0b113fb4cb + own=432ef4b63f32193984f339431fd50ca796493569 + base=508851a7f0dfa8691e9f69c7f055865389012491 else - own=5e9b60117ece18da410ddecc8b8d43766a0e4204 - base=4370042739b31cd17a5c5cd6043a77c9a00df113 + own=8299b0bcd1ac364e5f1d7768efb62fa2da79a339 + base=39d890139ee5356c7ef572216cebcd27aa41f9df fi && + cat >expect <<-EOF && own $own base $base
Re: Git Test Coverage Report (v2.20.0-rc0)
On 11/19/2018 10:40 AM, Derrick Stolee wrote: The test coverage reports started mid-way through this release cycle, so I thought it would be good to do a full review of the new uncovered code since the last release. I eliminated most of the uncovered code due to the following cases: 1. Code was only moved or refactored. 2. Code was related to unusual error conditions (e.g. open_pack_index() fails) The comments below are intended only to point out potential directions to improve test coverage. Some of it is for me to do! Thanks, -Stolee On 11/18/2018 9:54 PM, Derrick Stolee wrote: There are a lot of lines introduced by the IEOT extension in these commits: > Ben Peart 3255089ad: ieot: add Index Entry Offset Table (IEOT) extension > Ben Peart 3b1d9e045: eoie: add End of Index Entry (EOIE) extension > Ben Peart 77ff1127a: read-cache: load cache entries on worker threads > Ben Peart abb4bb838: read-cache: load cache extensions on a worker thread > Ben Peart c780b9cfe: config: add new index.threads config setting > Ben Peart d1664e73a: add: speed up cmd_add() by utilizing read_cache_preload() > Ben Peart fa655d841: checkout: optimize "git checkout -b " These should be hit if you run the test suite with GIT_TEST_INDEX_THREADS=2. Without that, the indexes for the various tests are too small to trigger multi-threaded index reads/writes. From t/README: GIT_TEST_INDEX_THREADS= enables exercising the multi-threaded loading of the index for the whole test suite by bypassing the default number of cache entries and thread minimums. Setting this to 1 will make the index loading single threaded.
Re: [PATCH 3/3] index: do not warn about unrecognized extensions
On 11/13/2018 10:24 PM, Junio C Hamano wrote: Jonathan Nieder writes: We cannot change the past, but for index extensions of the future, there is a straightforward improvement: silence that message except when tracing. This way, the message is still available when debugging, but in everyday use it does not show up so (once most Git users have this patch) we can turn on new optional extensions right away without alarming people. That argument ignores the "let the users know they are using a stale version when they did use (either by accident or deliberately) a more recent one" value, though. Even if we consider that this is only for debugging, I am not sure if tracing is the right place to add. As long as the "optional extensions can be ignored without affecting the correctness" rule holds, there is nothing gained by letting these messages shown for debugging purposes Having recently written a couple of patches that utilize an optional extension - I actually found the warning to be a helpful debugging tool and would like to see them enabled via tracing. It would also be helpful to see the opposite - I'm looking for an optional extension but it is missing. The most common scenario was when I'd be testing my changes in different repos and forget that I needed to force an updated index to be written that contained the extension I was trying to test. The "silently ignore the optional extension" behavior is good for end users but as a developer, I'd like to be able to have it yell at me via tracing. :-) IMHO - if an end user has to turn on tracing, I view that as a failure on our part. No end user should have to understand the inner workings of git to be able to use it effectively. and if there is such a bug (e.g. we introduced an optional extension but the code that wrote an index with an optional extension wrote the non-optional part in such a way that it cannot be correctly handled without the extension that is supposed to be optional) we'd probably want to let users notice without having to explicitly go into a debugging session. If Googling for "ignoring FNCY ext" gives "discard your index with 'reset HEAD', because an index file with FNCY ext cannot be read without understanding it", that may prevent damages from happening in the first place. On the other hand, hiding it behind tracing would mean the user first need to exprience an unknown breakage first and then has to enable tracing among other 47 different things to diagnose and drill down to the root cause.
Re: [PATCH 2/3] ieot: default to not writing IEOT section
On 11/13/2018 4:08 PM, Jonathan Nieder wrote: Hi again, Ben Peart wrote: On 11/13/2018 1:18 PM, Jonathan Nieder wrote: Ben Peart wrote: Why introduce a new setting to disable writing the IEOT extension instead of just using the existing index.threads setting? If index.threads=1 then the IEOT extension isn't written which (I believe) will accomplish the same goal. Do you mean defaulting to index.threads=1? I don't think that would be a good default, but if you have a different change in mind then I'd be happy to hear it. Or do you mean that if the user has explicitly specified index.threads=true, then that should imply index.recordOffsetTable=true so users only have to set one setting to turn it on? I can imagine that working well. Reading the index with multiple threads requires the EOIE and IEOT extensions to exist in the index. If either extension doesn't exist, then the code falls back to the single threaded path. That means you can't have both 1) no warning for old versions of git and 2) multi-threaded reading for new versions of git. If you set index.threads=1, that will prevent the IEOT extension from being written and there will be no "ignoring IEOT extension" warning in older versions of git. With this patch 'as is' you would have to set both index.threads=true and index.recordOffsetTable=true to get multi-threaded index reads. If either is set to false, it will silently drop back to single threaded reads. Sorry, I'm still not understanding what you're proposing. What would be - the default behavior - the mechanism for changing that behavior under your proposal? I consider index.threads=1 to be a bad default. I would understand if you are saying that that should be the default, and I tried to propose a different way to achieve what you're looking for in the quoted reply above (but I'm not understanding whether you like that proposal or not). Today, both the write logic (ie should we write out the IEOT extension) and the read logic (should I use the IEOT, if available, and do multi-threaded reading) are controlled by the single "index.threads" setting. I would like to keep the settings as simple as possible to prevent user confusion. If we have two different settings (index.threads and index.recordoffsettable) then the only combination that will result in the user actually getting multi-threaded reads is when they are both set to true. Any other combination will silently fail. I think it would be confusing if you set index.threads=true and got no error message but didn't get multi-threaded reads either (or vice versa). If you want to prevent any of the scary "ignoring IEOT extension" from ever happening then your only option is to turn off the IEOT writing by default. The downside is that people have to discover and turn it on if they want the improvements. This can be achieved by changing the default for index.threads from "true" to "false." diff --git a/config.c b/config.c index 2ee29f6f86..86f5c14294 100644 --- a/config.c +++ b/config.c @@ -2291,7 +2291,7 @@ int git_config_get_fsmonitor(void) int git_config_get_index_threads(void) { - int is_bool, val = 0; + int is_bool, val = 1; val = git_env_ulong("GIT_TEST_INDEX_THREADS", 0); if (val) If you want to provide a way for a concerned user to disable the message after the first time they have seen it, then they can be instructed to run 'git config --global index.threads false' There is no way to get multi-threaded reads and NOT get the scary message with older versions of git. Multi-threaded reads require the IEOT extension to be written into the index and the existence of the IEOT extension in the index will always generate the scary warning. Jonathan
Re: [PATCH 2/3] ieot: default to not writing IEOT section
On 11/13/2018 1:18 PM, Jonathan Nieder wrote: Hi, Ben Peart wrote: On 11/12/2018 7:39 PM, Jonathan Nieder wrote: As with EOIE, popular versions of Git do not support the new IEOT extension yet. When accessing a Git repository written by a more modern version of Git, they correctly ignore the unrecognized section, but in the process they loudly warn ignoring IEOT extension resulting in confusion for users. Introduce the index extension more gently by not writing it yet in this first version with support for it. [...] Introduce a '[index] recordOffsetTable' configuration variable to control whether the new index extension is written. Why introduce a new setting to disable writing the IEOT extension instead of just using the existing index.threads setting? If index.threads=1 then the IEOT extension isn't written which (I believe) will accomplish the same goal. Do you mean defaulting to index.threads=1? I don't think that would be a good default, but if you have a different change in mind then I'd be happy to hear it. Or do you mean that if the user has explicitly specified index.threads=true, then that should imply index.recordOffsetTable=true so users only have to set one setting to turn it on? I can imagine that working well. Reading the index with multiple threads requires the EOIE and IEOT extensions to exist in the index. If either extension doesn't exist, then the code falls back to the single threaded path. That means you can't have both 1) no warning for old versions of git and 2) multi-threaded reading for new versions of git. If you set index.threads=1, that will prevent the IEOT extension from being written and there will be no "ignoring IEOT extension" warning in older versions of git. With this patch 'as is' you would have to set both index.threads=true and index.recordOffsetTable=true to get multi-threaded index reads. If either is set to false, it will silently drop back to single threaded reads. Thanks, Jonathan
Re: [PATCH 3/3] index: do not warn about unrecognized extensions
On 11/12/2018 7:40 PM, Jonathan Nieder wrote: Documentation/technical/index-format explains: 4-byte extension signature. If the first byte is 'A'..'Z' the extension is optional and can be ignored. This allows gracefully introducing a new index extension without having to rely on all readers having support for it. Mandatory extensions start with a lowercase letter and optional ones start with a capital. Thus the versions of Git acting on a shared local repository do not have to upgrade in lockstep. We almost obey that convention, but there is a problem: when encountering an unrecognized optional extension, we write ignoring FNCY extension to stderr, which alarms users. This means that in practice we have had to introduce index extensions in two steps: first add read support, and then a while later, start writing by default. This delays when users can benefit from improvements to the index format. We cannot change the past, but for index extensions of the future, there is a straightforward improvement: silence that message except when tracing. This way, the message is still available when debugging, but in everyday use it does not show up so (once most Git users have this patch) we can turn on new optional extensions right away without alarming people. The best patch of the bunch. Glad to see it. I'm fine with doing this via advise.unknownIndexExtension as well. Who knows, someone may actually want to see this and not have tracing turned on. I don't know who but it is possible :-) Signed-off-by: Jonathan Nieder --- Thanks for reading. Thoughts? Sincerely, Jonathan read-cache.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/read-cache.c b/read-cache.c index 290bd54708..65530a68c2 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1720,7 +1720,7 @@ static int read_index_extension(struct index_state *istate, if (*ext < 'A' || 'Z' < *ext) return error("index uses %.4s extension, which we do not understand", ext); - fprintf(stderr, "ignoring %.4s extension\n", ext); + trace_printf("ignoring %.4s extension\n", ext); break; } return 0;
Re: [PATCH 2/3] ieot: default to not writing IEOT section
On 11/12/2018 7:39 PM, Jonathan Nieder wrote: As with EOIE, popular versions of Git do not support the new IEOT extension yet. When accessing a Git repository written by a more modern version of Git, they correctly ignore the unrecognized section, but in the process they loudly warn ignoring IEOT extension resulting in confusion for users. Introduce the index extension more gently by not writing it yet in this first version with support for it. Soon, once sufficiently many users are running a modern version of Git, we can flip the default so users benefit from this index extension by default. Introduce a '[index] recordOffsetTable' configuration variable to control whether the new index extension is written. Why introduce a new setting to disable writing the IEOT extension instead of just using the existing index.threads setting? If index.threads=1 then the IEOT extension isn't written which (I believe) will accomplish the same goal. Signed-off-by: Jonathan Nieder --- Documentation/config.txt | 7 +++ read-cache.c | 11 ++- 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/Documentation/config.txt b/Documentation/config.txt index d702379db4..cc66fb7de3 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -2195,6 +2195,13 @@ index.recordEndOfIndexEntries:: reading the index using Git versions before 2.20. Defaults to 'false'. +index.recordOffsetTable:: + Specifies whether the index file should include an "Index Entry + Offset Table" section. This reduces index load time on + multiprocessor machines but produces a message "ignoring IEOT + extension" when reading the index using Git versions before 2.20. + Defaults to 'false'. + index.threads:: Specifies the number of threads to spawn when loading the index. This is meant to reduce index load time on multiprocessor machines. diff --git a/read-cache.c b/read-cache.c index 4bfe93c4c2..290bd54708 100644 --- a/read-cache.c +++ b/read-cache.c @@ -2707,6 +2707,15 @@ static int record_eoie(void) return 0; } +static int record_ieot(void) +{ + int val; + + if (!git_config_get_bool("index.recordoffsettable", &val)) + return val; + return 0; +} + /* * On success, `tempfile` is closed. If it is the temporary file * of a `struct lock_file`, we will therefore effectively perform @@ -2767,7 +2776,7 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, #ifndef NO_PTHREADS nr_threads = git_config_get_index_threads(); - if (nr_threads != 1) { + if (nr_threads != 1 && record_ieot()) { int ieot_blocks, cpus; /*
Re: [PATCH 1/3] eoie: default to not writing EOIE section
On 11/12/2018 8:05 PM, Junio C Hamano wrote: Jonathan Nieder writes: Since 3b1d9e04 (eoie: add End of Index Entry (EOIE) extension, 2018-10-10) Git defaults to writing the new EOIE section when writing out an index file. Usually that is a good thing because it improves threaded performance, but when a Git repository is shared with older versions of Git, it produces a confusing warning: $ git status ignoring EOIE extension HEAD detached at 371ed0defa nothing to commit, working tree clean Let's introduce the new index extension more gently. First we'll roll out the new version of Git that understands it, and then once sufficiently many users are using such a version, we can flip the default to writing it by default. Introduce a '[index] recordEndOfIndexEntries' configuration variable to allow interested users to benefit from this index extension early. Thanks. I am in principle OK with this approach. In fact, I suspect that the default may want to be dynamically determined, and we give this knob to let the users further force their preference. When no extension that benefits from multi-threading is written, the default can stay "no" in future versions of Git, for example. While I can understand the user confusion the warning about ignoring an extension could cause I guess I'm a little surprised that people would see it that often. To see the warning means they are running a new version of git in the same repo as they are running an old version of git. I just haven't ever experienced that (I only ever have one copy of git installed) so am surprised it comes up often enough to warrant this change. That said, if it _is_ that much of an issue, this patch makes sense and provides a way to more gracefully transition into this feature. Even if we had some logic to dynamically determine whether to write it or not, we'd still want to avoid confusing users when it did get written out. diff --git a/Documentation/config.txt b/Documentation/config.txt index 41a9ff2b6a..d702379db4 100644 The timing is a bit unfortunate for any topic to touch this file, and contrib/cocci would not help us in this case X-<. diff --git a/read-cache.c b/read-cache.c index f3a848d61c..4bfe93c4c2 100644 --- a/read-cache.c +++ b/read-cache.c @@ -2698,6 +2698,15 @@ void update_index_if_able(struct index_state *istate, struct lock_file *lockfile rollback_lock_file(lockfile); } +static int record_eoie(void) +{ + int val; + + if (!git_config_get_bool("index.recordendofindexentries", &val)) + return val; + return 0; +} Unconditionally defaulting to no in this round is perfectly fine. Let's make a mental note that this is the place to decide dynamic default in the future when we want to. It would probably have to ask around various "extension writing" helpers if they want to have a say in the outcome (e.g. if there are very many cache entries in the istate, the entry offset table may want to be written and otherwise not). @@ -2945,7 +2954,7 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, * read. Write it out regardless of the strip_extensions parameter as we need it * when loading the shared index. */ - if (offset) { + if (offset && record_eoie()) { struct strbuf sb = STRBUF_INIT; write_eoie_extension(&sb, &eoie_c, offset); diff --git a/t/t1700-split-index.sh b/t/t1700-split-index.sh index 2ac47aa0e4..0cbac64e28 100755 --- a/t/t1700-split-index.sh +++ b/t/t1700-split-index.sh @@ -25,14 +25,17 @@ test_expect_success 'enable split index' ' git update-index --split-index && test-tool dump-split-index .git/index >actual && indexversion=$(test-tool index-version <.git/index) && + + # NEEDSWORK: Stop hard-coding checksums. Also let's stop hard-coding the assumption that the new knob is off by default. Ideally, you'd want to test both cases, right? Perhaps you'd call "git update-index --split-index" we see in the precontext twice, with "-c VAR=false" and "-c VAR=true", to prepare "actual.without-eoie" and "actual.with-eoie", or something like that? Thanks. if test "$indexversion" = "4" then - own=3527df833c6c100d3d1d921a9a782d62a8be4b58 - base=746f7ab2ed44fb839efdfbffcf399d0b113fb4cb + own=432ef4b63f32193984f339431fd50ca796493569 + base=508851a7f0dfa8691e9f69c7f055865389012491 else - own=5e9b60117ece18da410ddecc8b8d43766a0e4204 - base=4370042739b31cd17a5c5cd6043a77c9a00df113 + own=8299b0bcd1ac364e5f1d7768efb62fa2da79a339 + base=39d890139ee5356c7ef572216cebcd27aa41f9df fi && + cat >expect <<-EOF && own $own base $base
Re: [RFC v1] Add virtual file system settings and hook proc
On 11/4/2018 4:01 PM, brian m. carlson wrote: On Sun, Nov 04, 2018 at 07:34:01AM +0100, Duy Nguyen wrote: On Wed, Oct 31, 2018 at 9:53 PM Ben Peart wrote: It's more than a dynamic sparse-checkout because the same list is also used to exclude any file/folder not listed. That means any file not listed won't ever be updated by git (like in 'checkout' for example) so 'stale' files could be left in the working directory. It also means git won't find new/untracked files unless they are specifically added to the list. OK. I'm not at all interested in carrying maintenance burden for some software behind closed doors. I could see values in having a more flexible sparse checkout but this now seems like very tightly designed for GVFS. So unless there's another use case (preferably open source) for this, I don't think this should be added in git.git. I should point out that VFS for Git is an open-source project and will likely have larger use than just at Microsoft. There are both Windows and Mac clients and there are plans for a Linux client as well. Ideally, it would work with an unmodified upstream Git, which is (I assume) why Ben is sending this series. Personally, I don't love the current name used in this series. I don't see this patch as introducing a virtual file system in the Unix sense of that word, and I think calling it that in Git core will be confusing to Unix users. I would prefer to see it as a hook (maybe called "sparse-checkout" or "sparse-exclude"; better names are okay), and simply turn it on based on whether or not there's an appropriate hook file there and whether core.sparseCheckout is on (or possibly with hook.sparseExclude or something). With a design more like that, I don't see a problem with it in principle. I'm really bad at naming so am happy to choose something else that will be more descriptive to the community at large. The name came from the fact that we started with the (equally awful) 'VFS for Git' and this was the big enabling feature in git so for better or worse it got saddled with the same 'VFS' name. In other feedback it was suggested to not add a core.vfs setting that was the path to the hook and I like that. I can change it to core.sparseExclude (unless someone has something better) and hard code the hook name for now. I do like the idea of having config based hooks so when I get some time I will put together a patch series to implement that.
Re: [RFC v1] Add virtual file system settings and hook proc
On 11/5/2018 10:22 AM, Duy Nguyen wrote: On Sun, Nov 4, 2018 at 10:01 PM brian m. carlson wrote: On Sun, Nov 04, 2018 at 07:34:01AM +0100, Duy Nguyen wrote: On Wed, Oct 31, 2018 at 9:53 PM Ben Peart wrote: It's more than a dynamic sparse-checkout because the same list is also used to exclude any file/folder not listed. That means any file not listed won't ever be updated by git (like in 'checkout' for example) so 'stale' files could be left in the working directory. It also means git won't find new/untracked files unless they are specifically added to the list. OK. I'm not at all interested in carrying maintenance burden for some software behind closed doors. I could see values in having a more flexible sparse checkout but this now seems like very tightly designed for GVFS. So unless there's another use case (preferably open source) for this, I don't think this should be added in git.git. I should point out that VFS for Git is an open-source project and will likely have larger use than just at Microsoft. There are both Windows and Mac clients and there are plans for a Linux client as well. Ideally, it would work with an unmodified upstream Git, which is (I assume) why Ben is sending this series. Ah I didn't know that. Thank you. I'll have to look at this GVFS some time then. If we're going to support GVFS though, I think there should be a big (RFC perhaps) series that includes everything to at least give an overview what the end game looks like. Then it could be split up into smaller series. We've always had the goal of not needing a fork at all and are continually working to bring the list of differences to zero but in the mean time, you can see the entire set of changes we've made here [1]. If you look, most of them are changes we are already in process of submitting to git (ie midx, tracing, etc) or patches we fast tracked from master to our branch (your unpack_trees() optimizations for example). Most of the others are small tweaks and features for performance or to smooth integration issues. This RFC contains the core changes that were required to enable VFS for Git. [1] https://github.com/git-for-windows/git/compare/master...Microsoft:gvfs-2.19.1
Re: [RFC v1] Add virtual file system settings and hook proc
On 11/5/2018 10:26 AM, Duy Nguyen wrote: On Mon, Nov 5, 2018 at 12:40 PM Ævar Arnfjörð Bjarmason wrote: On Sun, Nov 04 2018, Duy Nguyen wrote: On Wed, Oct 31, 2018 at 9:53 PM Ben Peart wrote: +core.virtualFilesystem:: + If set, the value of this variable is used as a command which + will identify all files and directories that are present in + the working directory. Git will only track and update files + listed in the virtual file system. Using the virtual file system + will supersede the sparse-checkout settings which will be ignored. + See the "virtual file system" section of linkgit:githooks[6]. It sounds like "virtual file system" is just one of the use cases for this feature, which is more about a dynamic source of sparse-checkout bits. Perhaps name the config key with something along sparse checkout instead of naming it after a use case. It's more than a dynamic sparse-checkout because the same list is also used to exclude any file/folder not listed. That means any file not listed won't ever be updated by git (like in 'checkout' for example) so 'stale' files could be left in the working directory. It also means git won't find new/untracked files unless they are specifically added to the list. OK. I'm not at all interested in carrying maintenance burden for some software behind closed doors. I could see values in having a more flexible sparse checkout but this now seems like very tightly designed for GVFS. So unless there's another use case (preferably open source) for this, I don't think this should be added in git.git. I haven't looked at the patch in any detail beyond skimming it, and you're more familiar with this area... But in principle I'm very interested in getting something like this in git.git, even if we were to assume GVFS was a 100% proprietary implementation. I have nothing against building a GVFS-like solution. If what's submitted can be the building blocks for that, great! But if it was just for GVFS (and it was not available to everybody) then no thank you. Not only is VFS for Git open source and is/will be supported on Windows, Mac and Linux, the interface being proposed is quite generic so should be usable for other implementations. To use it, you just need to provide a hook that will return a list of files git should pay attention to (using a subset of the existing sparse-checkout format). If you see anything that would make using it difficult for other solutions to use, let's fix it now!
Re: [RFC v1] Add virtual file system settings and hook proc
On 11/4/2018 7:02 PM, Junio C Hamano wrote: Ben Peart writes: + if (*dtype == DT_UNKNOWN) + *dtype = get_dtype(NULL, istate, pathname, pathlen); We try to defer paying cost to determine unknown *dtype as late as possible by having this call in last_exclude_matching_from_list(), and not here. If we are doing this, we probably should update the callpaths that call last_exclude_matching_from_list() to make the caller responsible for doing get_dtype() and drop the lazy finding of dtype from the callee. Alternatively, the new "is excluded from vfs" helper can learn to do the lazy get_dtype() just like the existing last_exclude_matching_from_list() does. I suspect the latter may be simpler. In is_excluded_from_virtualfilesystem() dtype can't be lazy because it is always needed (which is why I test and die if it isn't known). You make a call to that function even when virtual-file-system hook is not in use, i.e. instead of the caller saying if (is_vfs_in_use()) { *dtype = get_dtype(...); if (is_excluded_from_vfs(...) > 0) return 1; } your caller makes an unconditional call to is_excluded_from_vfs(). Isn't that the only reason why you break the laziness of determining dtype? You can keep the caller simple by making an unconditional call, but maintain the laziness by updating the callig convention to pass dtype (not *dtype) to the function, e.g.. if (is_excluded_from_vfs(pathname, pathlen, dtype) > 0) return 1; and then at the beginning of the helper if (is_vfs_in_use()) return -1; /* undetermined */ *dtype = get_dtype(...); ... whatever logic it has now ... no? Oops! You're right, I docall get_dtype() even if the vfs isn't in use. I'll add an additional test to avoid doing that. Thank you. I did look into moving the delay load logic into is_excluded_from_vfs() but get_dtype() is static to dir.c and I'd prefer to keep it that way if possible. Your comments are all feedback on the code - how it was implemented, style, etc. Any thoughts on whether this is something we could/should merge into master (after any necessary cleanup)? Would anyone else find this interesting/helpful? I am pretty much neutral. Not strongly opposed to it, but not all that interested until seeing its integration with the "userland" to see how the whole thing works ;-)
[PATCH v1] refresh_index: remove unnecessary calls to preload_index()
From: Ben Peart With refresh_index() learning to utilize preload_index() to speed up its operation there is no longer any benefit to having the caller preload the index first. Remove those unneeded calls by calling read_index() instead of the preload variant. There is no measurable performance impact of this patch - the 2nd call to preload_index() bails out quickly but there is no reason to call it twice. Signed-off-by: Ben Peart --- Notes: Base Ref: Web-Diff: https://github.com/benpeart/git/commit/384f7fed53 Checkout: git fetch https://github.com/benpeart/git no-index-preload-v1 && git checkout 384f7fed53 builtin/commit.c | 2 +- builtin/describe.c | 2 +- builtin/update-index.c | 2 +- sequencer.c| 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/builtin/commit.c b/builtin/commit.c index 074bd9a551..96d336ec3d 100644 --- a/builtin/commit.c +++ b/builtin/commit.c @@ -1363,7 +1363,7 @@ int cmd_status(int argc, const char **argv, const char *prefix) if (status_format != STATUS_FORMAT_PORCELAIN && status_format != STATUS_FORMAT_PORCELAIN_V2) progress_flag = REFRESH_PROGRESS; - read_index_preload(&the_index, &s.pathspec, progress_flag); + read_index(&the_index); refresh_index(&the_index, REFRESH_QUIET|REFRESH_UNMERGED|progress_flag, &s.pathspec, NULL, NULL); diff --git a/builtin/describe.c b/builtin/describe.c index c48c34e866..cc118448ee 100644 --- a/builtin/describe.c +++ b/builtin/describe.c @@ -629,7 +629,7 @@ int cmd_describe(int argc, const char **argv, const char *prefix) struct argv_array args = ARGV_ARRAY_INIT; int fd, result; - read_cache_preload(NULL); + read_cache(); refresh_index(&the_index, REFRESH_QUIET|REFRESH_UNMERGED, NULL, NULL, NULL); fd = hold_locked_index(&index_lock, 0); diff --git a/builtin/update-index.c b/builtin/update-index.c index 07c10bcb7d..0e1dcf0438 100644 --- a/builtin/update-index.c +++ b/builtin/update-index.c @@ -782,7 +782,7 @@ struct refresh_params { static int refresh(struct refresh_params *o, unsigned int flag) { setup_work_tree(); - read_cache_preload(NULL); + read_cache(); *o->has_errors |= refresh_cache(o->flags | flag); return 0; } diff --git a/sequencer.c b/sequencer.c index 9e1ab3a2a7..ab2048ac3a 100644 --- a/sequencer.c +++ b/sequencer.c @@ -1919,7 +1919,7 @@ static int read_and_refresh_cache(struct replay_opts *opts) { struct lock_file index_lock = LOCK_INIT; int index_fd = hold_locked_index(&index_lock, 0); - if (read_index_preload(&the_index, NULL, 0) < 0) { + if (read_index(&the_index) < 0) { rollback_lock_file(&index_lock); return error(_("git %s: failed to read the index"), _(action_name(opts))); base-commit: 095c8dc8c2a9d61783dbae79a7f6e8d80092696f -- 2.18.0.windows.1
Re: [PATCH v1] add: speed up cmd_add() by utilizing read_cache_preload()
On 11/2/2018 11:23 AM, Junio C Hamano wrote: Ben Peart writes: From: Ben Peart During an "add", a call is made to run_diff_files() which calls check_remove() for each index-entry. The preload_index() code distributes some of the costs across multiple threads. Nice. I peeked around and noticed that we already do this in builtin_diff_index() before running run_diff_index() when !cached, and builtin_diff_files(), of course. Thanks for double checking! Because the files checked are restricted to pathspec, adding individual files makes no measurable impact but on a Windows repo with ~200K files, 'git add .' drops from 6.3 seconds to 3.3 seconds for a 47% savings. ;-) diff --git a/builtin/add.c b/builtin/add.c index ad49806ebf..f65c172299 100644 --- a/builtin/add.c +++ b/builtin/add.c @@ -445,11 +445,6 @@ int cmd_add(int argc, const char **argv, const char *prefix) return 0; } - if (read_cache() < 0) - die(_("index file corrupt")); - - die_in_unpopulated_submodule(&the_index, prefix); - /* * Check the "pathspec '%s' did not match any files" block * below before enabling new magic. It is not explained why this is not a mere s/read_cache/&_preload/ in the log message. I can see it is because you wanted to make the pathspec available to preload to further cut down the preloaded paths, and I do not think it has any unintended (negatie) side effect to parse the pathspec before populating the in-core index, but that would have been a good thing to mention in the proposed log message. That is correct. parse_pathspec() was after read_cache() because it _used_ to use the index to determine whether a pathspec is in a submodule. That was fixed by Brandon in Aug 2017 when he cleaned up all submodule config code so it is safe to move read_cache_preload() after the call to parse_pathspec(). How about this for a revised commit message? During an "add", a call is made to run_diff_files() which calls check_remove() for each index-entry. The preload_index() code distributes some of the costs across multiple threads. Limit read_cache_preload() to pathspec, so that it doesn't process more entries than are needed and end up slowing things down instead of speeding them up. Because the files checked are restricted to pathspec, adding individual files makes no measurable impact but on a Windows repo with ~200K files, 'git add .' drops from 6.3 seconds to 3.3 seconds for a 47% savings. @@ -459,6 +454,10 @@ int cmd_add(int argc, const char **argv, const char *prefix) PATHSPEC_SYMLINK_LEADING_PATH, prefix, argv); + if (read_cache_preload(&pathspec) < 0) + die(_("index file corrupt")); + + die_in_unpopulated_submodule(&the_index, prefix); die_path_inside_submodule(&the_index, &pathspec); if (add_new_files) { base-commit: 4ede3d42dfb57f9a41ac96a1f216c62eb7566cc2
[PATCH v1] add: speed up cmd_add() by utilizing read_cache_preload()
From: Ben Peart During an "add", a call is made to run_diff_files() which calls check_remove() for each index-entry. The preload_index() code distributes some of the costs across multiple threads. Because the files checked are restricted to pathspec, adding individual files makes no measurable impact but on a Windows repo with ~200K files, 'git add .' drops from 6.3 seconds to 3.3 seconds for a 47% savings. Signed-off-by: Ben Peart --- Notes: Base Ref: master Web-Diff: https://github.com/benpeart/git/commit/fc4830b545 Checkout: git fetch https://github.com/benpeart/git add-preload-index-v1 && git checkout fc4830b545 builtin/add.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/builtin/add.c b/builtin/add.c index ad49806ebf..f65c172299 100644 --- a/builtin/add.c +++ b/builtin/add.c @@ -445,11 +445,6 @@ int cmd_add(int argc, const char **argv, const char *prefix) return 0; } - if (read_cache() < 0) - die(_("index file corrupt")); - - die_in_unpopulated_submodule(&the_index, prefix); - /* * Check the "pathspec '%s' did not match any files" block * below before enabling new magic. @@ -459,6 +454,10 @@ int cmd_add(int argc, const char **argv, const char *prefix) PATHSPEC_SYMLINK_LEADING_PATH, prefix, argv); + if (read_cache_preload(&pathspec) < 0) + die(_("index file corrupt")); + + die_in_unpopulated_submodule(&the_index, prefix); die_path_inside_submodule(&the_index, &pathspec); if (add_new_files) { base-commit: 4ede3d42dfb57f9a41ac96a1f216c62eb7566cc2 -- 2.18.0.windows.1
Re: [RFC v1] Add virtual file system settings and hook proc
On 10/31/2018 3:11 PM, Duy Nguyen wrote: not really a review, just a couple quick notes.. Perfect! As an RFC, I'm more looking for high level thoughts/notes than a style/syntax code review. On Tue, Oct 30, 2018 at 9:40 PM Ben Peart wrote: From: Ben Peart On index load, clear/set the skip worktree bits based on the virtual file system data. Use virtual file system data to update skip-worktree bit in unpack-trees. Use virtual file system data to exclude files and folders not explicitly requested. Signed-off-by: Ben Peart --- We have taken several steps to make git perform well on very large repos. Some of those steps include: improving underlying algorithms, utilizing multi-threading where possible, and simplifying the behavior of some commands. These changes typically benefit all git repos to varying degrees. While these optimizations all help, they are insufficient to provide adequate performance on the very large repos we often work with. To make git perform well on the very largest repos, we had to make more significant changes. The biggest performance win by far is the work we have done to make git operations O(modified) instead of O(size of repo). This takes advantage of the fact that the number of files a developer has modified is a tiny fraction of the overall repo size. We accomplished this by utilizing the existing internal logic for the skip worktree bit and excludes to tell git to ignore all files and folders other than those that have been modified. This logic is driven by an external process that monitors writes to the repo and communicates the list of files and folders with changes to git via the virtual file system hook in this patch. The external process maintains a list of files and folders that have been modified. When git runs, it requests the list of files and folders that have been modified via the virtual file system hook. Git then sets/clears the skip-worktree bit on the cache entries and builds a hashmap of the modified files/folders that is used by the excludes logic to avoid scanning the entire repo looking for changes and untracked files. With this system, we have been able to make local git command performance on extremely large repos (millions of files, 1/2 million folders) entirely manageable (30 second checkout, 3.5 seconds status, 4 second add, 7 second commit, etc). Our desire is to eliminate all custom patches in our fork of git. To that end, I'm submitting this as an RFC to see how much interest there is and how much willingness to take this type of change into git. Most of these paragraphs (perhaps except the last one) should be part of the commit message. You describe briefly what the patch does but it's even more important to say why you want to do it. +core.virtualFilesystem:: + If set, the value of this variable is used as a command which + will identify all files and directories that are present in + the working directory. Git will only track and update files + listed in the virtual file system. Using the virtual file system + will supersede the sparse-checkout settings which will be ignored. + See the "virtual file system" section of linkgit:githooks[6]. It sounds like "virtual file system" is just one of the use cases for this feature, which is more about a dynamic source of sparse-checkout bits. Perhaps name the config key with something along sparse checkout instead of naming it after a use case. It's more than a dynamic sparse-checkout because the same list is also used to exclude any file/folder not listed. That means any file not listed won't ever be updated by git (like in 'checkout' for example) so 'stale' files could be left in the working directory. It also means git won't find new/untracked files unless they are specifically added to the list. This is a hook. I notice we start to avoid adding real hooks and just add config keys instead. Eventually we should have config-based hooks, but if we're going to add more like this, I think these should be in a separate section, hook.virtualFileSystem or something. That is a great idea. I don't personally like specifying the hook as the 'flag' for whether a feature should be used. I'd rather have it be a bool (enable the feature? true/false) and 1) either have the hook name hard coded (like most existing hooks) or 2) as you suggest add a consistent way to have config-based hooks. Config based hooks could also help provide a consistent way to configure them using GIT_TEST_* environment variables for testing. I don't think the superseding makes sense. There's no reason this could not be used in combination with $GIT_DIR/info/sparse-checkout. If you don't want both, disable the other. One last note. Since this is related to filesystem. Shouldn't it be part of fsmonitor (the protocol, not the impleme
Re: [RFC v1] Add virtual file system settings and hook proc
On 10/30/2018 7:07 PM, Junio C Hamano wrote: Ben Peart writes: diff --git a/config.c b/config.c index 4051e38823..96e05ee0f1 100644 --- a/config.c +++ b/config.c ... @@ -2307,6 +2311,37 @@ int git_config_get_index_threads(void) return 0; /* auto */ } +int git_config_get_virtualfilesystem(void) +{ + if (git_config_get_pathname("core.virtualfilesystem", &core_virtualfilesystem)) + core_virtualfilesystem = getenv("GIT_TEST_VIRTUALFILESYSTEM"); + + if (core_virtualfilesystem && !*core_virtualfilesystem) + core_virtualfilesystem = NULL; + + if (core_virtualfilesystem) { + /* +* Some git commands spawn helpers and redirect the index to a different +* location. These include "difftool -d" and the sequencer +* (i.e. `git rebase -i`, `git cherry-pick` and `git revert`) and others. +* In those instances we don't want to update their temporary index with +* our virtualization data. +*/ + char *default_index_file = xstrfmt("%s/%s", the_repository->gitdir, "index"); + int should_run_hook = !strcmp(default_index_file, the_repository->index_file); + + free(default_index_file); + if (should_run_hook) { + /* virtual file system relies on the sparse checkout logic so force it on */ + core_apply_sparse_checkout = 1; + return 1; + } + core_virtualfilesystem = NULL; It would be a small leak if this came from config_get_pathname(), but if it came from $GIT_TEST_VFS env, we cannot free it X-<. A helper function called *_get_X() that does not return X as its return value or updating the location pointed by its *dst parameter, and instead only stores its finding in a global variable feels somewhat odd. It smells more like "find out", "probe", "check", etc. I agree. Frankly, I think it should just return whether it should be used or not (bool) and the hook name should be fixed. I got push back when I did that for fsmonitor so I made this the same in an effort to head off that same feedback. diff --git a/dir.c b/dir.c index 47c2fca8dc..3097b0e446 100644 --- a/dir.c +++ b/dir.c @@ -21,6 +21,7 @@ #include "ewah/ewok.h" #include "fsmonitor.h" #include "submodule-config.h" +#include "virtualfilesystem.h" /* * Tells read_directory_recursive how a file or directory should be treated. @@ -1109,6 +1110,18 @@ int is_excluded_from_list(const char *pathname, struct exclude_list *el, struct index_state *istate) { struct exclude *exclude; + + /* +* The virtual file system data is used to prevent git from traversing +* any part of the tree that is not in the virtual file system. Return +* 1 to exclude the entry if it is not found in the virtual file system, +* else fall through to the regular excludes logic as it may further exclude. +*/ This comment will sit better immediately in front of the call to "is excluded from vfs?" helper function. + if (*dtype == DT_UNKNOWN) + *dtype = get_dtype(NULL, istate, pathname, pathlen); We try to defer paying cost to determine unknown *dtype as late as possible by having this call in last_exclude_matching_from_list(), and not here. If we are doing this, we probably should update the callpaths that call last_exclude_matching_from_list() to make the caller responsible for doing get_dtype() and drop the lazy finding of dtype from the callee. Alternatively, the new "is excluded from vfs" helper can learn to do the lazy get_dtype() just like the existing last_exclude_matching_from_list() does. I suspect the latter may be simpler. In is_excluded_from_virtualfilesystem() dtype can't be lazy because it is always needed (which is why I test and die if it isn't known). I considered doing the test/call to get_dtype() within is_excluded_from_virtualfilesystem() but didn't like making it dependent on istate just so I could move the get_dtype() call in one level. It is functionally identical so I can easily move it in if that is preferred. + if (is_excluded_from_virtualfilesystem(pathname, pathlen, *dtype) > 0) + return 1; + exclude = last_exclude_matching_from_list(pathname, pathlen, basename, dtype, el, istate); if (exclude) @@ -1324,8 +1337,20 @@ struct exclude *last_exclude_matching(struct dir_struct *dir, int is_excluded(struct dir_struct *dir, struct index_state *istate, const char *pathname, int *dtype_p) { - struct exclude *exclude =
[RFC v1] Add virtual file system settings and hook proc
From: Ben Peart On index load, clear/set the skip worktree bits based on the virtual file system data. Use virtual file system data to update skip-worktree bit in unpack-trees. Use virtual file system data to exclude files and folders not explicitly requested. Signed-off-by: Ben Peart --- We have taken several steps to make git perform well on very large repos. Some of those steps include: improving underlying algorithms, utilizing multi-threading where possible, and simplifying the behavior of some commands. These changes typically benefit all git repos to varying degrees. While these optimizations all help, they are insufficient to provide adequate performance on the very large repos we often work with. To make git perform well on the very largest repos, we had to make more significant changes. The biggest performance win by far is the work we have done to make git operations O(modified) instead of O(size of repo). This takes advantage of the fact that the number of files a developer has modified is a tiny fraction of the overall repo size. We accomplished this by utilizing the existing internal logic for the skip worktree bit and excludes to tell git to ignore all files and folders other than those that have been modified. This logic is driven by an external process that monitors writes to the repo and communicates the list of files and folders with changes to git via the virtual file system hook in this patch. The external process maintains a list of files and folders that have been modified. When git runs, it requests the list of files and folders that have been modified via the virtual file system hook. Git then sets/clears the skip-worktree bit on the cache entries and builds a hashmap of the modified files/folders that is used by the excludes logic to avoid scanning the entire repo looking for changes and untracked files. With this system, we have been able to make local git command performance on extremely large repos (millions of files, 1/2 million folders) entirely manageable (30 second checkout, 3.5 seconds status, 4 second add, 7 second commit, etc). Our desire is to eliminate all custom patches in our fork of git. To that end, I'm submitting this as an RFC to see how much interest there is and how much willingness to take this type of change into git. Notes: Base Ref: master Web-Diff: https://github.com/benpeart/git/commit/c06b290d2f Checkout: git fetch https://github.com/benpeart/git virtual-filesystem-v1 && git checkout c06b290d2f Documentation/config.txt | 8 + Documentation/githooks.txt | 20 ++ Makefile | 1 + cache.h | 1 + config.c | 37 +++- config.h | 1 + dir.c| 34 +++- environment.c| 1 + read-cache.c | 2 + t/README | 3 + t/t1092-virtualfilesystem.sh | 354 +++ t/t1092/virtualfilesystem| 23 +++ unpack-trees.c | 23 ++- virtualfilesystem.c | 308 ++ virtualfilesystem.h | 25 +++ 15 files changed, 833 insertions(+), 8 deletions(-) create mode 100755 t/t1092-virtualfilesystem.sh create mode 100755 t/t1092/virtualfilesystem create mode 100644 virtualfilesystem.c create mode 100644 virtualfilesystem.h diff --git a/Documentation/config.txt b/Documentation/config.txt index 09e95e9e98..dd4b834375 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -441,6 +441,14 @@ core.fsmonitor:: avoiding unnecessary processing of files that have not changed. See the "fsmonitor-watchman" section of linkgit:githooks[5]. +core.virtualFilesystem:: + If set, the value of this variable is used as a command which + will identify all files and directories that are present in + the working directory. Git will only track and update files + listed in the virtual file system. Using the virtual file system + will supersede the sparse-checkout settings which will be ignored. + See the "virtual file system" section of linkgit:githooks[6]. + core.trustctime:: If false, the ctime differences between the index and the working tree are ignored; useful when the inode change time diff --git a/Documentation/githooks.txt b/Documentation/githooks.txt index 959044347e..87f9ad2a77 100644 --- a/Documentation/githooks.txt +++ b/Documentation/githooks.txt @@ -492,6 +492,26 @@ This hook is invoked by `git-p4 submit`. It takes no parameters and nothing from standard input. Exiting with non-zero status from this script prevent `git-p4 submit` from launching. Run `git-p4 submit --help` for details. +virtualFilesystem +~~ + +"Virtual File System" allows populating the working directory sparsely. +The projection data is typically automatically generated by an exter
[PATCH v1] speed up refresh_index() by utilizing preload_index()
From: Ben Peart Speed up refresh_index() by utilizing preload_index() to do most of the work spread across multiple threads. This works because most cache entries will get marked CE_UPTODATE so that refresh_cache_ent() can bail out early when called from within refresh_index(). On a Windows repo with ~200K files, this drops refresh times from 6.64 seconds to 2.87 seconds for a savings of 57%. Signed-off-by: Ben Peart --- Notes: Base Ref: master Web-Diff: https://github.com/benpeart/git/commit/feee1054c2 Checkout: git fetch https://github.com/benpeart/git refresh-index-multithread-preload-v1 && git checkout feee1054c2 cache.h | 3 +++ preload-index.c | 8 read-cache.c| 6 ++ 3 files changed, 13 insertions(+), 4 deletions(-) diff --git a/cache.h b/cache.h index f7fabdde8f..883099db08 100644 --- a/cache.h +++ b/cache.h @@ -659,6 +659,9 @@ extern int daemonize(void); /* Initialize and use the cache information */ struct lock_file; extern int read_index(struct index_state *); +extern void preload_index(struct index_state *index, + const struct pathspec *pathspec, + unsigned int refresh_flags); extern int read_index_preload(struct index_state *, const struct pathspec *pathspec, unsigned int refresh_flags); diff --git a/preload-index.c b/preload-index.c index 9e7152ab14..222792ccbc 100644 --- a/preload-index.c +++ b/preload-index.c @@ -9,7 +9,7 @@ #include "progress.h" #ifdef NO_PTHREADS -static void preload_index(struct index_state *index, +void preload_index(struct index_state *index, const struct pathspec *pathspec, unsigned int refresh_flags) { @@ -100,9 +100,9 @@ static void *preload_thread(void *_data) return NULL; } -static void preload_index(struct index_state *index, - const struct pathspec *pathspec, - unsigned int refresh_flags) +void preload_index(struct index_state *index, + const struct pathspec *pathspec, + unsigned int refresh_flags) { int threads, i, work, offset; struct thread_data data[MAX_PARALLEL]; diff --git a/read-cache.c b/read-cache.c index d57958233e..53733d651d 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1496,6 +1496,12 @@ int refresh_index(struct index_state *istate, unsigned int flags, typechange_fmt = (in_porcelain ? "T\t%s\n" : "%s needs update\n"); added_fmt = (in_porcelain ? "A\t%s\n" : "%s needs update\n"); unmerged_fmt = (in_porcelain ? "U\t%s\n" : "%s: needs merge\n"); + /* +* Use the multi-threaded preload_index() to refresh most of the +* cache entries quickly then in the single threaded loop below, +* we only have to do the special cases that are left. +*/ + preload_index(istate, pathspec, 0); for (i = 0; i < istate->cache_nr; i++) { struct cache_entry *ce, *new_entry; int cache_errno = 0; base-commit: c670b1f876521c9f7cd40184bf7ed05aad843433 -- 2.9.2.gvfs.2.27918.g0990287eef
Re: [PATCH v2 06/10] preload-index.c: remove #ifdef NO_PTHREADS
On 10/29/2018 1:26 PM, Duy Nguyen wrote: On Mon, Oct 29, 2018 at 6:21 PM Ben Peart wrote: @@ -114,6 +104,9 @@ static void preload_index(struct index_state *index, threads = index->cache_nr / THREAD_COST; if ((index->cache_nr > 1) && (threads < 2) && git_env_bool("GIT_TEST_PRELOAD_INDEX", 0)) threads = 2; + cpus = online_cpus(); + if (threads > cpus) + threads = cpus; if (threads < 2) return; trace_performance_enter(); Capping the number of threads to online_cpus() does not always make sense. In this case (or at least the original use case where we stat() over nfs) we want to issue as many requests to nfs server as possible to reduce latency and it has nothing to do with how many cores we have. Using more threads than cores might make sense since threads are blocked by nfs client, but we still want to send more to the server. That makes sense. Some test will be necessary then. I guess HAVE_THREADS is functionally the same as online_cpus() == 1. For some reason, I prefer the latter - probably because I'm typically already calling it and so it doesn't feel like a 'special' value/test I have to remember. I just know I need to make sure the logic works correctly with any number of cps greater than zero. :-)
Re: [PATCH 09/10] read-cache.c: remove #ifdef NO_PTHREADS
On 10/29/2018 1:21 PM, Duy Nguyen wrote: On Mon, Oct 29, 2018 at 6:05 PM Ben Peart wrote: @@ -2756,8 +2745,11 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, if (ce_write(&c, newfd, &hdr, sizeof(hdr)) < 0) return -1; -#ifndef NO_PTHREADS - nr_threads = git_config_get_index_threads(); + if (HAVE_THREADS) This shouldn't be needed either. My assumption was that if someone explicitly asked for multiple threads we're better off failing than silently ignoring their request. The default/automatic setting will detect the number of cpus and behave correctly. No. I can have ~/.gitconfig shared between different machines. One with multithread support and one without. I should not have to fall back to conditional includes in order to use the same config file on the machine without multithread. If you want to share the ~/.gitconfig across machines that have different numbers of CPUs, it makes more sense to me to leave the setting at the "auto" setting rather than specifically overriding it to a number that won't work on at least one of your machines. Then it all "just works" without the need to conditional includes.
Re: [PATCH v2 08/10] read-cache.c: remove #ifdef NO_PTHREADS
On 10/29/2018 10:30 AM, Jeff King wrote: On Sat, Oct 27, 2018 at 07:30:06PM +0200, Nguyễn Thái Ngọc Duy wrote: -#ifndef NO_PTHREADS - nr_threads = git_config_get_index_threads(); + if (HAVE_THREADS) { + nr_threads = git_config_get_index_threads(); - /* TODO: does creating more threads than cores help? */ - if (!nr_threads) { - nr_threads = istate->cache_nr / THREAD_COST; - cpus = online_cpus(); - if (nr_threads > cpus) - nr_threads = cpus; + /* TODO: does creating more threads than cores help? */ + if (!nr_threads) { + nr_threads = istate->cache_nr / THREAD_COST; + cpus = online_cpus(); + if (nr_threads > cpus) + nr_threads = cpus; + } + } else { + nr_threads = 1; } I'd have thought we could just rely on online_cpus() returning 1 here to avoid having to ask "do we even have thread support?". But I guess that TODO comment implies that we might one day two 2 threads on a single CPU. -Peff And here is the patch I created when testing out the idea of removing NO_PTHREADS. It doesn't require any 'HAVE_THREADS' tests. diff --git a/read-cache.c b/read-cache.c index 1df5c16dbc..1f088799fc 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1920,19 +1920,15 @@ struct index_entry_offset_table struct index_entry_offset entries[FLEX_ARRAY]; }; -#ifndef NO_PTHREADS static struct index_entry_offset_table *read_ieot_extension(const char *mmap, size_t mmap_size, size_t offset ); static void write_ieot_extension(struct strbuf *sb, struct index_entry_offset_table *ieot); -#endif static size_t read_eoie_extension(const char *mmap, size_t mmap_size); static void write_eoie_extension(struct strbuf *sb, git_hash_ctx *eoie_context, size_t offset); struct load_index_extensions { -#ifndef NO_PTHREADS pthread_t pthread; -#endif struct index_state *istate; const char *mmap; size_t mmap_size; @@ -2010,8 +2006,6 @@ static unsigned long load_all_cache_entries(struct index_state *istate, return consumed; } -#ifndef NO_PTHREADS - /* * Mostly randomly chosen maximum thread counts: we * cap the parallelism to online_cpus() threads, and we want @@ -2122,7 +2116,6 @@ static unsigned long load_cache_entries_threaded(struct index_state *istate, con return consumed; } -#endif /* remember to discard_cache() before reading a different cache! */ int do_read_index(struct index_state *istate, const char *path, int must_exist) @@ -2135,10 +2128,8 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist) size_t mmap_size; struct load_index_extensions p; size_t extension_offset = 0; -#ifndef NO_PTHREADS int nr_threads, cpus; struct index_entry_offset_table *ieot = NULL; -#endif if (istate->initialized) return istate->cache_nr; @@ -2181,16 +2172,12 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist ) src_offset = sizeof(*hdr); -#ifndef NO_PTHREADS nr_threads = git_config_get_index_threads(); - - /* TODO: does creating more threads than cores help? */ - if (!nr_threads) { + if (!nr_threads) nr_threads = istate->cache_nr / THREAD_COST; - cpus = online_cpus(); - if (nr_threads > cpus) - nr_threads = cpus; - } + cpus = online_cpus(); + if (nr_threads > cpus) + nr_threads = cpus; if (nr_threads > 1) { extension_offset = read_eoie_extension(mmap, mmap_size); @@ -2219,22 +2206,16 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist ) } else { src_offset += load_all_cache_entries(istate, mmap, mmap_size, src_offset); } -#else - src_offset += load_all_cache_entries(istate, mmap, mmap_size, src_offset); -#endif istate->timestamp.sec = st.st_mtime; istate->timestamp.nsec = ST_MTIME_NSEC(st); /* if we created a thread, join it otherwise load the extensions on the primary thread */ -#ifndef NO_PTHREADS if (extension_offset) { int ret = pthread_join(p.pthread, NULL); if (ret) die(_("unable to join load_index_extensions thread: %s"), strerror(ret)); - } -#endif - if (!extension_offset) { + } else { p.src_offset = src_offset; load_index_extensions(&p); } @@ -2756,7 +2737,6 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, if (ce_write(&c, newfd, &hdr, sizeof(hdr)) < 0) return -1; -#ifndef NO_PTHREADS nr_threads = git_config_get_index_t
Re: [PATCH v2 06/10] preload-index.c: remove #ifdef NO_PTHREADS
On 10/27/2018 1:30 PM, Nguyễn Thái Ngọc Duy wrote: Signed-off-by: Nguyễn Thái Ngọc Duy --- preload-index.c | 15 ++- 1 file changed, 2 insertions(+), 13 deletions(-) diff --git a/preload-index.c b/preload-index.c index 9e7152ab14..0e24886aca 100644 --- a/preload-index.c +++ b/preload-index.c @@ -7,17 +7,7 @@ #include "fsmonitor.h" #include "config.h" #include "progress.h" - -#ifdef NO_PTHREADS -static void preload_index(struct index_state *index, - const struct pathspec *pathspec, - unsigned int refresh_flags) -{ - ; /* nothing */ -} -#else - -#include +#include "thread-utils.h" /* * Mostly randomly chosen maximum thread counts: we @@ -108,7 +98,7 @@ static void preload_index(struct index_state *index, struct thread_data data[MAX_PARALLEL]; struct progress_data pd; - if (!core_preload_index) + if (!HAVE_THREADS || !core_preload_index) return; threads = index->cache_nr / THREAD_COST; @@ -151,7 +141,6 @@ static void preload_index(struct index_state *index, trace_performance_leave("preload index"); } -#endif int read_index_preload(struct index_state *index, const struct pathspec *pathspec, In the theory that code speaks louder than comments, here is the patch I used when testing out the idea of getting rid of NO_PTHREADS: git diff head~1 preload-index.c diff --git a/preload-index.c b/preload-index.c index 9e7152ab14..266bc9d8d7 100644 --- a/preload-index.c +++ b/preload-index.c @@ -7,17 +7,7 @@ #include "fsmonitor.h" #include "config.h" #include "progress.h" - -#ifdef NO_PTHREADS -static void preload_index(struct index_state *index, - const struct pathspec *pathspec, - unsigned int refresh_flags) -{ - ; /* nothing */ -} -#else - -#include +#include "thread-utils.h" /* * Mostly randomly chosen maximum thread counts: we @@ -104,7 +94,7 @@ static void preload_index(struct index_state *index, const struct pathspec *pathspec, unsigned int refresh_flags) { - int threads, i, work, offset; + int cpus, threads, i, work, offset; struct thread_data data[MAX_PARALLEL]; struct progress_data pd; @@ -114,6 +104,9 @@ static void preload_index(struct index_state *index, threads = index->cache_nr / THREAD_COST; if ((index->cache_nr > 1) && (threads < 2) && git_env_bool("GIT_TEST_PRELOAD_INDEX", 0)) threads = 2; + cpus = online_cpus(); + if (threads > cpus) + threads = cpus; if (threads < 2) return; trace_performance_enter(); @@ -151,7 +144,6 @@ static void preload_index(struct index_state *index, trace_performance_leave("preload index"); } -#endif int read_index_preload(struct index_state *index, const struct pathspec *pathspec,
Re: [PATCH v2 08/10] read-cache.c: remove #ifdef NO_PTHREADS
On 10/29/2018 10:30 AM, Jeff King wrote: On Sat, Oct 27, 2018 at 07:30:06PM +0200, Nguyễn Thái Ngọc Duy wrote: -#ifndef NO_PTHREADS - nr_threads = git_config_get_index_threads(); + if (HAVE_THREADS) { + nr_threads = git_config_get_index_threads(); - /* TODO: does creating more threads than cores help? */ - if (!nr_threads) { - nr_threads = istate->cache_nr / THREAD_COST; - cpus = online_cpus(); - if (nr_threads > cpus) - nr_threads = cpus; + /* TODO: does creating more threads than cores help? */ + if (!nr_threads) { + nr_threads = istate->cache_nr / THREAD_COST; + cpus = online_cpus(); + if (nr_threads > cpus) + nr_threads = cpus; + } + } else { + nr_threads = 1; } I'd have thought we could just rely on online_cpus() returning 1 here to avoid having to ask "do we even have thread support?". But I guess that TODO comment implies that we might one day two 2 threads on a single CPU. -Peff See my earlier response - the HAVE_THREADS tests are not needed. We can remove the "TODO" comment - I tested it and it wasn't faster to have more threads than CPUs.
Re: [PATCH 09/10] read-cache.c: remove #ifdef NO_PTHREADS
On 10/27/2018 3:10 AM, Nguyễn Thái Ngọc Duy wrote: Signed-off-by: Nguyễn Thái Ngọc Duy --- read-cache.c | 49 ++--- 1 file changed, 18 insertions(+), 31 deletions(-) diff --git a/read-cache.c b/read-cache.c index d57958233e..ba870bc3fd 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1920,19 +1920,15 @@ struct index_entry_offset_table struct index_entry_offset entries[FLEX_ARRAY]; }; -#ifndef NO_PTHREADS static struct index_entry_offset_table *read_ieot_extension(const char *mmap, size_t mmap_size, size_t offset); static void write_ieot_extension(struct strbuf *sb, struct index_entry_offset_table *ieot); -#endif static size_t read_eoie_extension(const char *mmap, size_t mmap_size); static void write_eoie_extension(struct strbuf *sb, git_hash_ctx *eoie_context, size_t offset); struct load_index_extensions { -#ifndef NO_PTHREADS pthread_t pthread; -#endif struct index_state *istate; const char *mmap; size_t mmap_size; @@ -2010,8 +2006,6 @@ static unsigned long load_all_cache_entries(struct index_state *istate, return consumed; } -#ifndef NO_PTHREADS - /* * Mostly randomly chosen maximum thread counts: we * cap the parallelism to online_cpus() threads, and we want @@ -2122,7 +2116,6 @@ static unsigned long load_cache_entries_threaded(struct index_state *istate, con return consumed; } -#endif /* remember to discard_cache() before reading a different cache! */ int do_read_index(struct index_state *istate, const char *path, int must_exist) @@ -2135,10 +2128,8 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist) size_t mmap_size; struct load_index_extensions p; size_t extension_offset = 0; -#ifndef NO_PTHREADS int nr_threads, cpus; struct index_entry_offset_table *ieot = NULL; -#endif if (istate->initialized) return istate->cache_nr; @@ -2181,15 +2172,18 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist) src_offset = sizeof(*hdr); -#ifndef NO_PTHREADS - nr_threads = git_config_get_index_threads(); + if (HAVE_THREADS) { In this case, nr_threads is already capped to online_cpus() below so this HAVE_THREADS test isn't needed. + nr_threads = git_config_get_index_threads(); - /* TODO: does creating more threads than cores help? */ - if (!nr_threads) { - nr_threads = istate->cache_nr / THREAD_COST; - cpus = online_cpus(); - if (nr_threads > cpus) - nr_threads = cpus; + /* TODO: does creating more threads than cores help? */ + if (!nr_threads) { + nr_threads = istate->cache_nr / THREAD_COST; + cpus = online_cpus(); + if (nr_threads > cpus) + nr_threads = cpus; + } + } else { + nr_threads = 1; } if (nr_threads > 1) { @@ -2219,21 +2213,16 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist) } else { src_offset += load_all_cache_entries(istate, mmap, mmap_size, src_offset); } -#else - src_offset += load_all_cache_entries(istate, mmap, mmap_size, src_offset); -#endif istate->timestamp.sec = st.st_mtime; istate->timestamp.nsec = ST_MTIME_NSEC(st); /* if we created a thread, join it otherwise load the extensions on the primary thread */ -#ifndef NO_PTHREADS - if (extension_offset) { + if (HAVE_THREADS && extension_offset) { Here extension_offset won't be set if there wasn't a thread created so the HAVE_THREADS test is not needed. int ret = pthread_join(p.pthread, NULL); if (ret) die(_("unable to join load_index_extensions thread: %s"), strerror(ret)); } -#endif if (!extension_offset) { p.src_offset = src_offset; load_index_extensions(&p); @@ -2756,8 +2745,11 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, if (ce_write(&c, newfd, &hdr, sizeof(hdr)) < 0) return -1; -#ifndef NO_PTHREADS - nr_threads = git_config_get_index_threads(); + if (HAVE_THREADS) This shouldn't be needed either. My assumption was that if someone explicitly asked for multiple threads we're better off failing than silently ignoring their request. The default/automatic setting will detect the number of cpus and behave correctly. + nr_threads = git_config_get_index_threads(); + else + nr_threads = 1; + if (nr_threads != 1) { int ieot_blocks, cpus; @@ -2787,7 +2779,6 @@ static int do_write_index(struct index_state *istate, struct temp
Re: [PATCH 07/10] preload-index.c: remove #ifdef NO_PTHREADS
On 10/27/2018 3:10 AM, Nguyễn Thái Ngọc Duy wrote: Signed-off-by: Nguyễn Thái Ngọc Duy --- preload-index.c | 15 ++- 1 file changed, 2 insertions(+), 13 deletions(-) diff --git a/preload-index.c b/preload-index.c index 9e7152ab14..0e24886aca 100644 --- a/preload-index.c +++ b/preload-index.c @@ -7,17 +7,7 @@ #include "fsmonitor.h" #include "config.h" #include "progress.h" - -#ifdef NO_PTHREADS -static void preload_index(struct index_state *index, - const struct pathspec *pathspec, - unsigned int refresh_flags) -{ - ; /* nothing */ -} -#else - -#include +#include "thread-utils.h" /* * Mostly randomly chosen maximum thread counts: we @@ -108,7 +98,7 @@ static void preload_index(struct index_state *index, struct thread_data data[MAX_PARALLEL]; struct progress_data pd; - if (!core_preload_index) + if (!HAVE_THREADS || !core_preload_index) I also "hoped in general that the non-threaded code paths could mostly just collapse into the same as the "threads == 1" cases, and we wouldn't have to "are threads even supported" in a lot of places." In this case, if we cap the threads to online_cpus() the later test for 'if (threads < 2)' would do that. I haven't measured this specific case but in the other cases I have measured, having more threads than cpus did not result in a performance win. return; threads = index->cache_nr / THREAD_COST; @@ -151,7 +141,6 @@ static void preload_index(struct index_state *index, trace_performance_leave("preload index"); } -#endif int read_index_preload(struct index_state *index, const struct pathspec *pathspec,
Re: [PATCH/RFC] thread-utils: better wrapper to avoid #ifdef NO_PTHREADS
On 10/23/2018 4:28 PM, Jeff King wrote: On Thu, Oct 18, 2018 at 08:05:22PM +0200, Nguyễn Thái Ngọc Duy wrote: On Thu, Oct 18, 2018 at 7:09 PM Jeff King wrote: In this particular case though I think we should be able to avoid so much #if if we make a wrapper for pthread api that would return an error or something when pthread is not available. But similar situation may happen elsewhere too. Yeah, I think that is generally the preferred method anyway, just because of readability and simplicity. I've wanted to do this for a while, so let's test the water and see if it's well received. This patch is a proof of concept that adds just enough macros so that I can build index-pack.c on a single thread mode with zero #ifdef related to NO_PTHREADS. Besides readability and simplicity, it reduces the chances of breaking conditional builds (e.g. you rename a variable name but forgot that the variable is in #if block that is not used by your compiler/platform). Yes, I love this. We're already halfway there with things like read_lock() in index-pack and elsewhere, which are conditionally no-ops. The resulting code is much easier to read, I think. I am also very much in favor of this. I updated a couple of places threading is being used that I've been working in (preload-index and read-cache) and both are much simplified using your proof of concept patch. Performance-wise I don't think there is any loss for single thread mode. I rely on compilers recognizing HAVE_THREADS being a constant and remove dead code or at least optimize in favor of non-dead code. Memory-wise, yes we use some more memory in single thread mode. But we don't have zillions of mutexes or thread id, so a bit extra memory does not worry me so much. Yeah, I don't think carrying around a handful of ints is going to be a big deal. Just to be complete, there _is_ an additional cost. Today, code paths that are only executed when there are pthreads available are excluded from the binary (via #ifdef). With this change, those code paths would now be included causing some code bloat to NO_PTHREAD threaded images. One example of this is in read-cache.c where the ieot read/write functions aren't included for NO_PTHREAD but now would be. I also think we may want to make a fundamental shift in our view of thread support. In the early days, it was "well, this is a thing that modern systems can take advantage of for certain commands". But these days I suspect it is more like "there are a handful of legacy systems that do not even support threads". I don't think we should break the build on those legacy systems, but it's probably OK to stop thinking of it as "non-threaded platforms are the default and must pay zero cost" and more as "threaded platforms are the default, and non-threaded ones are OK to pay a small cost as long as they still work". I agree though I'm still curious if there are still no-threaded platforms taking new versions of git. Perhaps we should do the depreciation warning you suggested elsewhere and see how much push back we get. It's unlikely we'd get lucky and be able to stop supporting them completely but it's worth asking! @@ -74,4 +79,29 @@ int init_recursive_mutex(pthread_mutex_t *m) pthread_mutexattr_destroy(&a); } return ret; +#else + return ENOSYS; +#endif +} I suspect some of these ENOSYS could just become a silent success. ("yep, I initialized your dummy mutex"). But it probably doesn't matter much either way, as we would not generally even bother checking this return. +#ifdef NO_PTHREADS +int dummy_pthread_create(pthread_t *pthread, const void *attr, +void *(*fn)(void *), void *data) +{ + return ENOSYS; } Whereas for this one, ENOSYS makes a lot of sense (we should avoid the threaded code-path anyway when we see that online_cpus()==1, and this would let us know when we mess that up). This highlights something anyone writing multi-threaded code will need to pay attention to that wasn't an issue before. If you attempt to create more threads than online_cpus(), the pthread_create() call will fail and needs to be handled gracefully. One example of this is in preload-index.c where (up to) 20 threads are created irrespective of what online_cpus() returns and if pthread_create() fails, it just dies. The logic would need to be updated for this to work correctly. I still think this is a much simpler issue to deal with than what we have today with having to write/debug multiple code paths but I did want to point it out for completeness. +int dummy_pthread_init(void *data) +{ + /* +* Do nothing. +* +* The main purpose of this function is to break compiler's +* flow analysis or it may realize that functions like +* pthread_mutex_init() is no-op, which means the (static) +* variable is not used/initialized at all and trigger +* -Wunused-vari
Re: [PATCH v4 2/3] reset: add new reset.quiet config setting
On 10/25/2018 5:26 AM, Junio C Hamano wrote: Junio C Hamano writes: To be honest, I find the second sentence in your rewrite even more confusing. It reads as if `reset.quiet` configuration variable can be used to restore the "show what is yet to be added" behaviour, due to the parenthetical mention of the default behaviour without any configuration. The command reports what is yet to be added to the index after `reset` by default. It can be made to only report errors with the `--quiet` option, or setting `reset.quiet` configuration variable to `true` (the latter can be overriden with `--no-quiet`). That may not be much better, though X-<. In any case, the comments are getting closer to the bikeshedding territory, that can be easily addressed incrementally. I am getting the impression that everbody agrees that the change is desirable, sufficiently documented and properly implemented. Shall we mark it for "Will merge to 'next'" in the what's cooking report and leave further refinements to incremental updates as needed? While not great, I think it is good enough. I don't think either of the last couple of rewrite attempts were clearly better than what is in the latest patch. I'd agree we should merge to 'next' and if someone comes up with something great, we can update it then.
Re: [PATCH v1] load_cache_entries_threaded: remove unused src_offset parameter
On 10/22/2018 7:05 PM, Junio C Hamano wrote: Jeff King writes: If nobody uses it, should we drop the return value, too? Like: Yup. I'm good with that. At one point I also had the additional #ifndef NO_PTHREADS lines but it was starting to get messy with the threaded vs non-threaded code paths so I removed them. I'm fine with which ever people find more readable. It does make me wonder if there are still platforms taking new build of git that don't support threads. Do we still need to write/test/debug/read through the single threaded code paths? Is the diff Peff sent enough or do you want me to send another iteration on the patch? Thanks, Ben diff --git a/read-cache.c b/read-cache.c index 78c9516eb7..4b44a2eae5 100644 --- a/read-cache.c +++ b/read-cache.c @@ -2052,12 +2052,11 @@ static void *load_cache_entries_thread(void *_data) return NULL; } -static unsigned long load_cache_entries_threaded(struct index_state *istate, const char *mmap, size_t mmap_size, -int nr_threads, struct index_entry_offset_table *ieot) +static void load_cache_entries_threaded(struct index_state *istate, const char *mmap, size_t mmap_size, + int nr_threads, struct index_entry_offset_table *ieot) { int i, offset, ieot_blocks, ieot_start, err; struct load_cache_entries_thread_data *data; - unsigned long consumed = 0; /* a little sanity checking */ if (istate->name_hash_initialized) @@ -2115,12 +2114,9 @@ static unsigned long load_cache_entries_threaded(struct index_state *istate, con if (err) die(_("unable to join load_cache_entries thread: %s"), strerror(err)); mem_pool_combine(istate->ce_mem_pool, p->ce_mem_pool); - consumed += p->consumed; } free(data); - - return consumed; } #endif -Peff
[PATCH v4 0/3] speed up git reset
From: Ben Peart Updated the wording in the documentation and commit messages to (hopefully) make it clearer. Added the warning about 'reset --quiet' to the advice system so that it can be turned off. Base Ref: Web-Diff: https://github.com/benpeart/git/commit/8a2fef45d4 Checkout: git fetch https://github.com/benpeart/git reset-refresh-index-v4 && git checkout 8a2fef45d4 ### Patches Ben Peart (3): reset: don't compute unstaged changes after reset when --quiet reset: add new reset.quiet config setting reset: warn when refresh_index() takes more than 2 seconds Documentation/config.txt| 7 +++ Documentation/git-reset.txt | 5 - advice.c| 2 ++ advice.h| 1 + builtin/reset.c | 15 ++- 5 files changed, 28 insertions(+), 2 deletions(-) base-commit: ca63497355222acefcca02b9cbb540a4768f3286 -- 2.18.0.windows.1
[PATCH v4 1/3] reset: don't compute unstaged changes after reset when --quiet
From: Ben Peart When git reset is run with the --quiet flag, don't bother finding any additional unstaged changes as they won't be output anyway. This speeds up the git reset command by avoiding having to lstat() every file looking for changes that aren't going to be reported anyway. The savings can be significant. In a repo on Windows with 200K files "git reset" drops from 7.16 seconds to 0.32 seconds for a savings of 96%. Signed-off-by: Ben Peart --- builtin/reset.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/builtin/reset.c b/builtin/reset.c index 11cd0dcb8c..04f0d9b4f5 100644 --- a/builtin/reset.c +++ b/builtin/reset.c @@ -375,7 +375,7 @@ int cmd_reset(int argc, const char **argv, const char *prefix) int flags = quiet ? REFRESH_QUIET : REFRESH_IN_PORCELAIN; if (read_from_tree(&pathspec, &oid, intent_to_add)) return 1; - if (get_git_work_tree()) + if (!quiet && get_git_work_tree()) refresh_index(&the_index, flags, NULL, NULL, _("Unstaged changes after reset:")); } else { -- 2.18.0.windows.1
[PATCH v4 3/3] reset: warn when refresh_index() takes more than 2 seconds
From: Ben Peart refresh_index() is done after a reset command as an optimization. Because it can be an expensive call, warn the user if it takes more than 2 seconds and tell them how to avoid it using the --quiet command line option or reset.quiet config setting. Signed-off-by: Ben Peart --- Documentation/config.txt | 4 advice.c | 2 ++ advice.h | 1 + builtin/reset.c | 14 +- 4 files changed, 20 insertions(+), 1 deletion(-) diff --git a/Documentation/config.txt b/Documentation/config.txt index a2d1b8b116..415db31def 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -333,6 +333,10 @@ advice.*:: commitBeforeMerge:: Advice shown when linkgit:git-merge[1] refuses to merge to avoid overwriting local changes. + resetQuiet:: + Advice to consider using the `--quiet` option to linkgit:git-reset[1] + when the command takes more than 2 seconds to enumerate unstaged + changes after reset. resolveConflict:: Advice shown by various commands when conflicts prevent the operation from being performed. diff --git a/advice.c b/advice.c index 3561cd64e9..5f35656409 100644 --- a/advice.c +++ b/advice.c @@ -12,6 +12,7 @@ int advice_push_needs_force = 1; int advice_status_hints = 1; int advice_status_u_option = 1; int advice_commit_before_merge = 1; +int advice_reset_quiet_warning = 1; int advice_resolve_conflict = 1; int advice_implicit_identity = 1; int advice_detached_head = 1; @@ -65,6 +66,7 @@ static struct { { "statusHints", &advice_status_hints }, { "statusUoption", &advice_status_u_option }, { "commitBeforeMerge", &advice_commit_before_merge }, + { "resetQuiet", &advice_reset_quiet_warning }, { "resolveConflict", &advice_resolve_conflict }, { "implicitIdentity", &advice_implicit_identity }, { "detachedHead", &advice_detached_head }, diff --git a/advice.h b/advice.h index ab24df0fd0..696bf0e7d2 100644 --- a/advice.h +++ b/advice.h @@ -12,6 +12,7 @@ extern int advice_push_needs_force; extern int advice_status_hints; extern int advice_status_u_option; extern int advice_commit_before_merge; +extern int advice_reset_quiet_warning; extern int advice_resolve_conflict; extern int advice_implicit_identity; extern int advice_detached_head; diff --git a/builtin/reset.c b/builtin/reset.c index 3b43aee544..b31a0eae8a 100644 --- a/builtin/reset.c +++ b/builtin/reset.c @@ -25,6 +25,8 @@ #include "submodule.h" #include "submodule-config.h" +#define REFRESH_INDEX_DELAY_WARNING_IN_MS (2 * 1000) + static const char * const git_reset_usage[] = { N_("git reset [--mixed | --soft | --hard | --merge | --keep] [-q] []"), N_("git reset [-q] [] [--] ..."), @@ -376,9 +378,19 @@ int cmd_reset(int argc, const char **argv, const char *prefix) int flags = quiet ? REFRESH_QUIET : REFRESH_IN_PORCELAIN; if (read_from_tree(&pathspec, &oid, intent_to_add)) return 1; - if (!quiet && get_git_work_tree()) + if (!quiet && get_git_work_tree()) { + uint64_t t_begin, t_delta_in_ms; + + t_begin = getnanotime(); refresh_index(&the_index, flags, NULL, NULL, _("Unstaged changes after reset:")); + t_delta_in_ms = (getnanotime() - t_begin) / 100; + if (advice_reset_quiet_warning && t_delta_in_ms > REFRESH_INDEX_DELAY_WARNING_IN_MS) { + printf(_("\nIt took %.2f seconds to enumerate unstaged changes after reset. You can\n" + "use '--quiet' to avoid this. Set the config setting reset.quiet to true\n" + "to make this the default.\n"), t_delta_in_ms / 1000.0); + } + } } else { int err = reset_index(&oid, reset_type, quiet); if (reset_type == KEEP && !err) -- 2.18.0.windows.1
[PATCH v4 2/3] reset: add new reset.quiet config setting
From: Ben Peart Add a reset.quiet config setting that sets the default value of the --quiet flag when running the reset command. This enables users to change the default behavior to take advantage of the performance advantages of avoiding the scan for unstaged changes after reset. Defaults to false. Signed-off-by: Ben Peart --- Documentation/config.txt| 3 +++ Documentation/git-reset.txt | 5 - builtin/reset.c | 1 + 3 files changed, 8 insertions(+), 1 deletion(-) diff --git a/Documentation/config.txt b/Documentation/config.txt index f6f4c21a54..a2d1b8b116 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -2728,6 +2728,9 @@ rerere.enabled:: `$GIT_DIR`, e.g. if "rerere" was previously used in the repository. +reset.quiet:: + When set to true, 'git reset' will default to the '--quiet' option. + include::sendemail-config.txt[] sequence.editor:: diff --git a/Documentation/git-reset.txt b/Documentation/git-reset.txt index 1d697d9962..2dac95c71a 100644 --- a/Documentation/git-reset.txt +++ b/Documentation/git-reset.txt @@ -95,7 +95,10 @@ OPTIONS -q:: --quiet:: - Be quiet, only report errors. +--no-quiet:: + Be quiet, only report errors. The default behavior is set by the + `reset.quiet` config option. `--quiet` and `--no-quiet` will + override the default behavior. EXAMPLES diff --git a/builtin/reset.c b/builtin/reset.c index 04f0d9b4f5..3b43aee544 100644 --- a/builtin/reset.c +++ b/builtin/reset.c @@ -306,6 +306,7 @@ int cmd_reset(int argc, const char **argv, const char *prefix) }; git_config(git_reset_config, NULL); + git_config_get_bool("reset.quiet", &quiet); argc = parse_options(argc, argv, prefix, options, git_reset_usage, PARSE_OPT_KEEP_DASHDASH); -- 2.18.0.windows.1
Re: [PATCH v3 2/3] reset: add new reset.quiet config setting
On 10/22/2018 10:45 AM, Duy Nguyen wrote: On Mon, Oct 22, 2018 at 3:38 PM Ben Peart wrote: From: Ben Peart Add a reset.quiet config setting that sets the default value of the --quiet flag when running the reset command. This enables users to change the default behavior to take advantage of the performance advantages of avoiding the scan for unstaged changes after reset. Defaults to false. Signed-off-by: Ben Peart --- Documentation/config.txt| 3 +++ Documentation/git-reset.txt | 4 +++- builtin/reset.c | 1 + 3 files changed, 7 insertions(+), 1 deletion(-) diff --git a/Documentation/config.txt b/Documentation/config.txt index f6f4c21a54..a2d1b8b116 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -2728,6 +2728,9 @@ rerere.enabled:: `$GIT_DIR`, e.g. if "rerere" was previously used in the repository. +reset.quiet:: + When set to true, 'git reset' will default to the '--quiet' option. + With 'nd/config-split' topic moving pretty much all config keys out of config.txt, you probably want to do the same for this series: add this in a new file called Documentation/reset-config.txt then include the file here like the sendemail one below. Seems a bit overkill to pull a line of documentation into a separate file and replace it with a line of 'import' logic. Perhaps if/when there is more documentation to pull out that would make more sense. include::sendemail-config.txt[] sequence.editor::
Re: [PATCH v1 2/2] reset: add new reset.quietDefault config setting
On 10/23/2018 5:13 AM, Ævar Arnfjörð Bjarmason wrote: On Wed, Oct 17 2018, Jeff King wrote: On Wed, Oct 17, 2018 at 02:19:59PM -0400, Eric Sunshine wrote: On Wed, Oct 17, 2018 at 12:40 PM Ben Peart wrote: Add a reset.quietDefault config setting that sets the default value of the --quiet flag when running the reset command. This enables users to change the default behavior to take advantage of the performance advantages of avoiding the scan for unstaged changes after reset. Defaults to false. As with the previous patch, my knee-jerk reaction is that this really feels wrong being tied to --quiet. It's particularly unintuitive. What I _could_ see, and what would feel more natural is if you add a new option (say, --optimize) which is more general, incorporating whatever optimizations become available in the future, not just this one special-case. A side-effect of --optimize is that it implies --quiet, and that is something which can and should be documented. Heh, I just wrote something very similar elsewhere in the thread. I'm still not sure if it's a dumb idea, but at least we can be dumb together. Same here. I'm in general if favor of having the ability to configure porcelain command-line options, but in this case it seems like it would be more logical to head for something like: core.uiMessaging=[default,exhaustive,lossyButFaster,quiet] Where default would be our current "exhaustive", and this --quiet case would be covered by lossyButFaster, but also things like the "--no-ahead-behind" flag for git-status. This sounds like an easy way to choose a set of default values that we think make sense to get bundled together. That could be a way for users to quickly choose a set of good defaults but I still think you would want find grained control over the individual settings. Coming up with the set of values to bundle together, figuring out the hierarchy of precedence for this new global config->individual config->individual command line, updating the code to make it all work is outside the scope of this particular patch series. Just on this implementation: The usual idiom for flags as config is command.flag=xyz, not command.flagDefault=xyz, so this should be reset.quiet. Thanks, I agree and fixed that in later iterations.
Re: [PATCH v3 2/3] reset: add new reset.quiet config setting
On 10/22/2018 4:06 PM, Jeff King wrote: On Mon, Oct 22, 2018 at 08:13:32PM +0100, Ramsay Jones wrote: -q:: --quiet:: - Be quiet, only report errors. +--no-quiet:: + Be quiet, only report errors. The default behavior respects the + `reset.quiet` config option, or `--no-quiet` if that is not set. Sorry, I can't quite parse this; -q,--quiet and --no-quiet on the command line (should) trump whatever rest.quiet is set to in the configuration. Is that not the case? That is the case, and what was meant by "the default behavior" (i.e., the behavior when none of these is used). Maybe there's a more clear way of saying that. -Peff Is this more clear? -q:: --quiet:: --no-quiet:: Be quiet, only report errors. The default behavior is set by the `reset.quiet` config option. `--quiet` and `--no-quiet` will overwrite the default behavior.
Re: [PATCH v3 3/3] reset: warn when refresh_index() takes more than 2 seconds
On 10/22/2018 8:23 PM, Junio C Hamano wrote: Ben Peart writes: From: Ben Peart refresh_index() is done after a reset command as an optimization. Because it can be an expensive call, warn the user if it takes more than 2 seconds and tell them how to avoid it using the --quiet command line option or reset.quiet config setting. I am moderately negative on this step. It will irritate users who know about and still choose not to use the "--quiet" option, because they want to gain performance in later real work and/or they want to know what paths are now dirty. A working tree that needs long time to refresh will take long time to instead do "cached stat info says it may be modified so let's run 'diff' for real---we may discover that there wasn't any change after all" when a "git diff" is run after a "reset --quiet" that does not refresh; i.e. there would be valid reasons to run "reset" without "--quiet". It feels a bit irresponsible to throw an ad without informing pros-and-cons and to pretend that we are advising on BCP. In general, we do *not* advertise new features randomly like this. Thanks. The previous two steps looks quite sensible. The challenge I'm trying to address is the discoverability of this significant performance win. In earlier review feedback, all mention of this option speeding up reset was removed. I added this patch to enable users to find out it even exists as an option. While I modeled this on the untracked files/--uno and ahead/behind logic, I missed adding this to the 'advice' logic so that it can be turned off and avoid irritating users. I'll send an updated patch that corrects that.
RE: [PATCH v3 1/3] reset: don't compute unstaged changes after reset when --quiet
> -Original Message- > From: Johannes Schindelin > Sent: Monday, October 22, 2018 4:45 PM > To: Ben Peart > Cc: git@vger.kernel.org; gits...@pobox.com; Ben Peart > ; p...@peff.net; sunsh...@sunshineco.com > Subject: Re: [PATCH v3 1/3] reset: don't compute unstaged changes after > reset when --quiet > > Hi Ben, > > On Mon, 22 Oct 2018, Ben Peart wrote: > > > From: Ben Peart > > > > When git reset is run with the --quiet flag, don't bother finding any > > additional unstaged changes as they won't be output anyway. This speeds > up > > the git reset command by avoiding having to lstat() every file looking for > > changes that aren't going to be reported anyway. > > > > The savings can be significant. In a repo with 200K files "git reset" > > drops from 7.16 seconds to 0.32 seconds for a savings of 96%. > > That's very nice! > > Those numbers, just out of curiosity, are they on Windows? Or on Linux? > It's safe to assume all my numbers are on Windows. :-) > Ciao, > Dscho
[PATCH v1] load_cache_entries_threaded: remove unused src_offset parameter
From: Ben Peart Remove the src_offset parameter which is unused as a result of switching to the IEOT table of offsets. Also stop incrementing src_offset in the multi-threaded codepath as it is no longer used and could cause confusion. Signed-off-by: Ben Peart --- Notes: Base Ref: Web-Diff: https://github.com/benpeart/git/commit/7360721408 Checkout: git fetch https://github.com/benpeart/git read-index-multithread-no-src-offset-v1 && git checkout 7360721408 read-cache.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/read-cache.c b/read-cache.c index f9fa6a7979..6db6f0f220 100644 --- a/read-cache.c +++ b/read-cache.c @@ -2037,7 +2037,7 @@ static void *load_cache_entries_thread(void *_data) } static unsigned long load_cache_entries_threaded(struct index_state *istate, const char *mmap, size_t mmap_size, - unsigned long src_offset, int nr_threads, struct index_entry_offset_table *ieot) + int nr_threads, struct index_entry_offset_table *ieot) { int i, offset, ieot_blocks, ieot_start, err; struct load_cache_entries_thread_data *data; @@ -2198,7 +2198,7 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist) ieot = read_ieot_extension(mmap, mmap_size, extension_offset); if (ieot) { - src_offset += load_cache_entries_threaded(istate, mmap, mmap_size, src_offset, nr_threads, ieot); + load_cache_entries_threaded(istate, mmap, mmap_size, nr_threads, ieot); free(ieot); } else { src_offset += load_all_cache_entries(istate, mmap, mmap_size, src_offset); base-commit: f58b85df6937e3f3d9ef26bb52a513c8ada17ffc -- 2.18.0.windows.1
Re: [PATCH v8 7/7] read-cache: load cache entries on worker threads
On 10/21/2018 10:14 PM, Junio C Hamano wrote: Jeff King writes: On Wed, Oct 10, 2018 at 11:59:38AM -0400, Ben Peart wrote: +static unsigned long load_cache_entries_threaded(struct index_state *istate, const char *mmap, size_t mmap_size, + unsigned long src_offset, int nr_threads, struct index_entry_offset_table *ieot) The src_offset parameter isn't used in this function. In early versions of the series, it was used to feed the p->start_offset field of each load_cache_entries_thread_data. But after the switch to ieot, we don't, and instead feed p->ieot_start. But we always begin that at 0. Is that right (and we can drop the parameter), or should this logic: + offset = ieot_start = 0; + ieot_blocks = DIV_ROUND_UP(ieot->nr, nr_threads); + for (i = 0; i < nr_threads; i++) { [...] be starting at src_offset instead of 0? I think "offset" has nothing to do with the offset into the mmapped region of memory. It is an integer index into a (virtual) array that is a concatenation of ieot->entries[].entries[], and it is correct to count from zero. The value taken from that array using the index is used to compute the offset into the mmapped region. Unlike load_all_cache_entries() called from the other side of the same if() statement in the same caller, this does not depend on the fact that the first index entry in the mmapped region appears immediately after the index-file header. It goes from the offsets into the file that are recorded in the entry offset table that is an index extension, so the sizeof(*hdr) that initializes src_offset is not used by the codepath. The number of bytes consumed, i.e. its return value from the function, is not really used, either, as the caller does not use src_offset for anything other than updating it with "+=" and passing it to this function (which does not use it) when it calls this function (i.e. when ieot extension exists--and by definition when that extension exists extension_offset is not 0, so we do not make the final load_index_extensions() call in the caller that uses src_offset). Thanks for discovering/analyzing this. You're right, I missed removing this when we switched from a single offset to an array of offsets via the IEOT. I'll send a patch to fix both issues shortly.
[PATCH v3 2/3] reset: add new reset.quiet config setting
From: Ben Peart Add a reset.quiet config setting that sets the default value of the --quiet flag when running the reset command. This enables users to change the default behavior to take advantage of the performance advantages of avoiding the scan for unstaged changes after reset. Defaults to false. Signed-off-by: Ben Peart --- Documentation/config.txt| 3 +++ Documentation/git-reset.txt | 4 +++- builtin/reset.c | 1 + 3 files changed, 7 insertions(+), 1 deletion(-) diff --git a/Documentation/config.txt b/Documentation/config.txt index f6f4c21a54..a2d1b8b116 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -2728,6 +2728,9 @@ rerere.enabled:: `$GIT_DIR`, e.g. if "rerere" was previously used in the repository. +reset.quiet:: + When set to true, 'git reset' will default to the '--quiet' option. + include::sendemail-config.txt[] sequence.editor:: diff --git a/Documentation/git-reset.txt b/Documentation/git-reset.txt index 1d697d9962..51a427a34a 100644 --- a/Documentation/git-reset.txt +++ b/Documentation/git-reset.txt @@ -95,7 +95,9 @@ OPTIONS -q:: --quiet:: - Be quiet, only report errors. +--no-quiet:: + Be quiet, only report errors. The default behavior respects the + `reset.quiet` config option, or `--no-quiet` if that is not set. EXAMPLES diff --git a/builtin/reset.c b/builtin/reset.c index 04f0d9b4f5..3b43aee544 100644 --- a/builtin/reset.c +++ b/builtin/reset.c @@ -306,6 +306,7 @@ int cmd_reset(int argc, const char **argv, const char *prefix) }; git_config(git_reset_config, NULL); + git_config_get_bool("reset.quiet", &quiet); argc = parse_options(argc, argv, prefix, options, git_reset_usage, PARSE_OPT_KEEP_DASHDASH); -- 2.18.0.windows.1
[PATCH v3 1/3] reset: don't compute unstaged changes after reset when --quiet
From: Ben Peart When git reset is run with the --quiet flag, don't bother finding any additional unstaged changes as they won't be output anyway. This speeds up the git reset command by avoiding having to lstat() every file looking for changes that aren't going to be reported anyway. The savings can be significant. In a repo with 200K files "git reset" drops from 7.16 seconds to 0.32 seconds for a savings of 96%. Signed-off-by: Ben Peart --- builtin/reset.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/builtin/reset.c b/builtin/reset.c index 11cd0dcb8c..04f0d9b4f5 100644 --- a/builtin/reset.c +++ b/builtin/reset.c @@ -375,7 +375,7 @@ int cmd_reset(int argc, const char **argv, const char *prefix) int flags = quiet ? REFRESH_QUIET : REFRESH_IN_PORCELAIN; if (read_from_tree(&pathspec, &oid, intent_to_add)) return 1; - if (get_git_work_tree()) + if (!quiet && get_git_work_tree()) refresh_index(&the_index, flags, NULL, NULL, _("Unstaged changes after reset:")); } else { -- 2.18.0.windows.1
[PATCH v3 3/3] reset: warn when refresh_index() takes more than 2 seconds
From: Ben Peart refresh_index() is done after a reset command as an optimization. Because it can be an expensive call, warn the user if it takes more than 2 seconds and tell them how to avoid it using the --quiet command line option or reset.quiet config setting. Signed-off-by: Ben Peart --- builtin/reset.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/builtin/reset.c b/builtin/reset.c index 3b43aee544..d95a27d52e 100644 --- a/builtin/reset.c +++ b/builtin/reset.c @@ -25,6 +25,8 @@ #include "submodule.h" #include "submodule-config.h" +#define REFRESH_INDEX_DELAY_WARNING_IN_MS (2 * 1000) + static const char * const git_reset_usage[] = { N_("git reset [--mixed | --soft | --hard | --merge | --keep] [-q] []"), N_("git reset [-q] [] [--] ..."), @@ -376,9 +378,19 @@ int cmd_reset(int argc, const char **argv, const char *prefix) int flags = quiet ? REFRESH_QUIET : REFRESH_IN_PORCELAIN; if (read_from_tree(&pathspec, &oid, intent_to_add)) return 1; - if (!quiet && get_git_work_tree()) + if (!quiet && get_git_work_tree()) { + uint64_t t_begin, t_delta_in_ms; + + t_begin = getnanotime(); refresh_index(&the_index, flags, NULL, NULL, _("Unstaged changes after reset:")); + t_delta_in_ms = (getnanotime() - t_begin) / 100; + if (t_delta_in_ms > REFRESH_INDEX_DELAY_WARNING_IN_MS) { + printf(_("\nIt took %.2f seconds to enumerate unstaged changes after reset. You can\n" + "use '--quiet' to avoid this. Set the config setting reset.quiet to true\n" + "to make this the default."), t_delta_in_ms / 1000.0); + } + } } else { int err = reset_index(&oid, reset_type, quiet); if (reset_type == KEEP && !err) -- 2.18.0.windows.1
[PATCH v3 0/3] speed up git reset
From: Ben Peart Reworded the documentation for git-reset per review feedback. Base Ref: Web-Diff: https://github.com/benpeart/git/commit/1228898917 Checkout: git fetch https://github.com/benpeart/git reset-refresh-index-v3 && git checkout 1228898917 ### Interdiff (v2..v3): diff --git a/Documentation/git-reset.txt b/Documentation/git-reset.txt index 1d697d9962..51a427a34a 100644 --- a/Documentation/git-reset.txt +++ b/Documentation/git-reset.txt @@ -95,7 +95,9 @@ OPTIONS -q:: --quiet:: - Be quiet, only report errors. +--no-quiet:: + Be quiet, only report errors. The default behavior respects the + `reset.quiet` config option, or `--no-quiet` if that is not set. EXAMPLES ### Patches Ben Peart (3): reset: don't compute unstaged changes after reset when --quiet reset: add new reset.quiet config setting reset: warn when refresh_index() takes more than 2 seconds Documentation/config.txt| 3 +++ Documentation/git-reset.txt | 4 +++- builtin/reset.c | 15 ++- 3 files changed, 20 insertions(+), 2 deletions(-) base-commit: ca63497355222acefcca02b9cbb540a4768f3286 -- 2.18.0.windows.1
Re: [PATCH v2 2/3] reset: add new reset.quiet config setting
On 10/19/2018 1:11 PM, Jeff King wrote: On Fri, Oct 19, 2018 at 01:10:34PM -0400, Eric Sunshine wrote: On Fri, Oct 19, 2018 at 12:46 PM Jeff King wrote: On Fri, Oct 19, 2018 at 12:36:44PM -0400, Eric Sunshine wrote: How does the user reverse this for a particular git-reset invocation? There is no --no-quiet or --verbose option. Perhaps you want to use OPT__VERBOSITY() instead of OPT__QUIET() in builtin/reset.c and document that --verbose overrides --quiet and reset.quiet (or something like that). I think OPT__QUIET() provides --no-quiet, since it's really an OPT_COUNTUP() under the hood. Saying "--no-quiet" should reset it back to 0. Okay. In any case, --no-quiet probably ought to be mentioned alongside the "reset.quiet" option (and perhaps in git-reset.txt to as a way to reverse "reset.quiet"). Yes, I'd agree with that. -Peff Makes sense. I'll update the docs to say: -q:: --quiet:: --no-quiet:: Be quiet, only report errors. + With --no-quiet report errors and unstaged changes after reset.
Re: [PATCH v2 2/3] reset: add new reset.quiet config setting
On 10/19/2018 12:46 PM, Jeff King wrote: On Fri, Oct 19, 2018 at 12:36:44PM -0400, Eric Sunshine wrote: On Fri, Oct 19, 2018 at 12:12 PM Ben Peart wrote: Add a reset.quiet config setting that sets the default value of the --quiet flag when running the reset command. This enables users to change the default behavior to take advantage of the performance advantages of avoiding the scan for unstaged changes after reset. Defaults to false. Signed-off-by: Ben Peart --- diff --git a/Documentation/config.txt b/Documentation/config.txt @@ -2728,6 +2728,9 @@ rerere.enabled:: +reset.quiet:: + When set to true, 'git reset' will default to the '--quiet' option. How does the user reverse this for a particular git-reset invocation? There is no --no-quiet or --verbose option. Perhaps you want to use OPT__VERBOSITY() instead of OPT__QUIET() in builtin/reset.c and document that --verbose overrides --quiet and reset.quiet (or something like that). I think OPT__QUIET() provides --no-quiet, since it's really an OPT_COUNTUP() under the hood. Saying "--no-quiet" should reset it back to 0. Thanks Peff. That is correct as confirmed by: C:\Repos\VSO\src>git reset --no-quiet Unstaged changes after reset: M init.ps1 It took 6.74 seconds to enumerate unstaged changes after reset. You can use '--quiet' to avoid this. Set the config setting reset.quiet to true to make this the default. -Peff
[PATCH v2 2/3] reset: add new reset.quiet config setting
From: Ben Peart Add a reset.quiet config setting that sets the default value of the --quiet flag when running the reset command. This enables users to change the default behavior to take advantage of the performance advantages of avoiding the scan for unstaged changes after reset. Defaults to false. Signed-off-by: Ben Peart --- Documentation/config.txt | 3 +++ builtin/reset.c | 1 + 2 files changed, 4 insertions(+) diff --git a/Documentation/config.txt b/Documentation/config.txt index f6f4c21a54..a2d1b8b116 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -2728,6 +2728,9 @@ rerere.enabled:: `$GIT_DIR`, e.g. if "rerere" was previously used in the repository. +reset.quiet:: + When set to true, 'git reset' will default to the '--quiet' option. + include::sendemail-config.txt[] sequence.editor:: diff --git a/builtin/reset.c b/builtin/reset.c index 04f0d9b4f5..3b43aee544 100644 --- a/builtin/reset.c +++ b/builtin/reset.c @@ -306,6 +306,7 @@ int cmd_reset(int argc, const char **argv, const char *prefix) }; git_config(git_reset_config, NULL); + git_config_get_bool("reset.quiet", &quiet); argc = parse_options(argc, argv, prefix, options, git_reset_usage, PARSE_OPT_KEEP_DASHDASH); -- 2.18.0.windows.1
[PATCH v2 1/3] reset: don't compute unstaged changes after reset when --quiet
From: Ben Peart When git reset is run with the --quiet flag, don't bother finding any additional unstaged changes as they won't be output anyway. This speeds up the git reset command by avoiding having to lstat() every file looking for changes that aren't going to be reported anyway. The savings can be significant. In a repo with 200K files "git reset" drops from 7.16 seconds to 0.32 seconds for a savings of 96%. Signed-off-by: Ben Peart --- builtin/reset.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/builtin/reset.c b/builtin/reset.c index 11cd0dcb8c..04f0d9b4f5 100644 --- a/builtin/reset.c +++ b/builtin/reset.c @@ -375,7 +375,7 @@ int cmd_reset(int argc, const char **argv, const char *prefix) int flags = quiet ? REFRESH_QUIET : REFRESH_IN_PORCELAIN; if (read_from_tree(&pathspec, &oid, intent_to_add)) return 1; - if (get_git_work_tree()) + if (!quiet && get_git_work_tree()) refresh_index(&the_index, flags, NULL, NULL, _("Unstaged changes after reset:")); } else { -- 2.18.0.windows.1
[PATCH v2 3/3] reset: warn when refresh_index() takes more than 2 seconds
From: Ben Peart refresh_index() is done after a reset command as an optimization. Because it can be an expensive call, warn the user if it takes more than 2 seconds and tell them how to avoid it using the --quiet command line option or reset.quiet config setting. Signed-off-by: Ben Peart --- builtin/reset.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/builtin/reset.c b/builtin/reset.c index 3b43aee544..d95a27d52e 100644 --- a/builtin/reset.c +++ b/builtin/reset.c @@ -25,6 +25,8 @@ #include "submodule.h" #include "submodule-config.h" +#define REFRESH_INDEX_DELAY_WARNING_IN_MS (2 * 1000) + static const char * const git_reset_usage[] = { N_("git reset [--mixed | --soft | --hard | --merge | --keep] [-q] []"), N_("git reset [-q] [] [--] ..."), @@ -376,9 +378,19 @@ int cmd_reset(int argc, const char **argv, const char *prefix) int flags = quiet ? REFRESH_QUIET : REFRESH_IN_PORCELAIN; if (read_from_tree(&pathspec, &oid, intent_to_add)) return 1; - if (!quiet && get_git_work_tree()) + if (!quiet && get_git_work_tree()) { + uint64_t t_begin, t_delta_in_ms; + + t_begin = getnanotime(); refresh_index(&the_index, flags, NULL, NULL, _("Unstaged changes after reset:")); + t_delta_in_ms = (getnanotime() - t_begin) / 100; + if (t_delta_in_ms > REFRESH_INDEX_DELAY_WARNING_IN_MS) { + printf(_("\nIt took %.2f seconds to enumerate unstaged changes after reset. You can\n" + "use '--quiet' to avoid this. Set the config setting reset.quiet to true\n" + "to make this the default."), t_delta_in_ms / 1000.0); + } + } } else { int err = reset_index(&oid, reset_type, quiet); if (reset_type == KEEP && !err) -- 2.18.0.windows.1
[PATCH v2 0/3] speed up git reset
From: Ben Peart This itteration avoids the refresh_index() call completely if 'quiet'. The advantage of this is that "git refresh" without any pathspec is also significantly sped up. Also added a notification if finding unstaged changes after reset takes longer than 2 seconds to make users aware of the option to speed it up if they don't need the unstaged changes after reset to be output. It also renames the new config setting reset.quietDefault to reset.quiet. Base Ref: Web-Diff: https://github.com/benpeart/git/commit/50d3415ef1 Checkout: git fetch https://github.com/benpeart/git reset-refresh-index-v2 && git checkout 50d3415ef1 ### Interdiff (v1..v2): diff --git a/Documentation/config.txt b/Documentation/config.txt index a5cf4c019b..a2d1b8b116 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -2728,11 +2728,8 @@ rerere.enabled:: `$GIT_DIR`, e.g. if "rerere" was previously used in the repository. -reset.quietDefault:: - Sets the default value of the "quiet" option for the reset command. - Choosing "quiet" can optimize the performance of the reset command - by avoiding the scan of all files in the repo looking for additional - unstaged changes. Defaults to false. +reset.quiet:: + When set to true, 'git reset' will default to the '--quiet' option. include::sendemail-config.txt[] diff --git a/Documentation/git-reset.txt b/Documentation/git-reset.txt index 8610309b55..1d697d9962 100644 --- a/Documentation/git-reset.txt +++ b/Documentation/git-reset.txt @@ -95,9 +95,7 @@ OPTIONS -q:: --quiet:: - Be quiet, only report errors. Can optimize the performance of reset - by avoiding scaning all files in the repo looking for additional - unstaged changes. + Be quiet, only report errors. EXAMPLES diff --git a/builtin/reset.c b/builtin/reset.c index 7d151d48a0..d95a27d52e 100644 --- a/builtin/reset.c +++ b/builtin/reset.c @@ -25,6 +25,8 @@ #include "submodule.h" #include "submodule-config.h" +#define REFRESH_INDEX_DELAY_WARNING_IN_MS (2 * 1000) + static const char * const git_reset_usage[] = { N_("git reset [--mixed | --soft | --hard | --merge | --keep] [-q] []"), N_("git reset [-q] [] [--] ..."), @@ -306,7 +308,7 @@ int cmd_reset(int argc, const char **argv, const char *prefix) }; git_config(git_reset_config, NULL); - git_config_get_bool("reset.quietDefault", &quiet); + git_config_get_bool("reset.quiet", &quiet); argc = parse_options(argc, argv, prefix, options, git_reset_usage, PARSE_OPT_KEEP_DASHDASH); @@ -376,9 +378,19 @@ int cmd_reset(int argc, const char **argv, const char *prefix) int flags = quiet ? REFRESH_QUIET : REFRESH_IN_PORCELAIN; if (read_from_tree(&pathspec, &oid, intent_to_add)) return 1; - if (get_git_work_tree()) - refresh_index(&the_index, flags, quiet ? &pathspec : NULL, NULL, + if (!quiet && get_git_work_tree()) { + uint64_t t_begin, t_delta_in_ms; + + t_begin = getnanotime(); + refresh_index(&the_index, flags, NULL, NULL, _("Unstaged changes after reset:")); + t_delta_in_ms = (getnanotime() - t_begin) / 100; + if (t_delta_in_ms > REFRESH_INDEX_DELAY_WARNING_IN_MS) { + printf(_("\nIt took %.2f seconds to enumerate unstaged changes after reset. You can\n" + "use '--quiet' to avoid this. Set the config setting reset.quiet to true\n" + "to make this the default."), t_delta_in_ms / 1000.0); + } + } } else { int err = reset_index(&oid, reset_type, quiet); if (reset_type == KEEP && !err) ### Patches Ben Peart (3): reset: don't compute unstaged changes after reset when --quiet reset: add new reset.quiet config setting reset: warn when refresh_index() takes more than 2 seconds Documentation/config.txt | 3 +++ builtin/reset.c | 15 ++- 2 files changed, 17 insertions(+), 1 deletion(-) base-commit: ca63497355222acefcca02b9cbb540a4768f3286 -- 2.18.0.windows.1
Re: [PATCH v1 1/2] reset: don't compute unstaged changes after reset when --quiet
On 10/18/2018 2:26 PM, Duy Nguyen wrote: On Thu, Oct 18, 2018 at 8:18 PM Ben Peart wrote: I actually started my effort to speed up reset by attempting to multi-thread refresh_index(). You can see a work in progress at: https://github.com/benpeart/git/pull/new/refresh-index-multithread-gvfs The patch doesn't always work as it is still not thread safe. When it works, it's great but I ran into to many difficulties trying to debug the remaining threading issues (even adding print statements would change the timing and the repro would disappear). It will take a lot of code review to discover and fix the remaining non-thread safe code paths. In addition, the optimized code path that takes advantage of fsmonitor, uses multiple threads, fscache, etc _already exists_ in preload_index(). Trying to recreate all those optimizations in refresh_index() is (as I discovered) a daunting task. Why not make refresh_index() run preload_index() first (or the parallel lstat part to be precise), and only do the heavy content-based refresh in single thread mode? Head smack! Why didn't I think of that? That is a terrific suggestion. Calling preload_index() right before the big for loop in refresh_index() is a trivial and effective way to do the bulk of the updating with the optimized code. After doing that, most of the cache entries can bail out quickly down in refresh_cache_ent() when it tests ce_uptodate(ce). Here are the numbers using that optimization (hot cache, averaged across 3 runs): 0.32 git add asdf 1.67 git reset asdf 1.68 git status 3.67 Total vs without it: 0.32 git add asdf 2.48 git reset asdf 1.50 git status 4.30 Total For a savings in the reset command of 32% and 15% overall. Clearly doing the refresh_index() faster is not as much savings as not doing it at all. Given how simple this patch is, I think it makes sense to do both so that we have optimized each path to is fullest.
Re: [PATCH v1 1/2] reset: don't compute unstaged changes after reset when --quiet
On 10/18/2018 2:36 AM, Jeff King wrote: On Thu, Oct 18, 2018 at 12:40:48PM +0900, Junio C Hamano wrote: Jeff King writes: Whereas for the new config variable, you'd probably set it not because you want it quiet all the time, but because you want to get some time savings. So there it does make sense to me to explain. Other than that, this seems like an obvious and easy win. It does feel a little hacky (you're really losing something in the output, and ideally we'd just be able to give that answer quickly), but this may be OK as a hack in the interim. After "git reset --quiet -- this/area/" with this change, any operation you'd do next that needs to learn if working tree files are different from what is recorded in the index outside that area will have to spend more cycles, because the refresh done by "reset" is now limited to the area. So if your final goal is "make 'reset' as fast as possible", this is an obvious and easy win. For other goals, i.e. "make the overall experience of using Git feel faster", it is not so obvious to me, though. The final goal is to make git faster (especially on larger repos) and this proposal accomplishes that. Let's look at why that is. By scoping down (or eliminating) what refresh_index() has to lstat() at the end of the reset command, clearly the reset command is faster. Yes, the index isn't as "fresh" because not everything was updated but that doesn't typically impact the performance of subsequent commands. On the next command, git still has to lstat() every file because it isn't sure what changes could have happened in the file system. As a result, the overall impact is that we have had to lstat() every file one fewer times between the two commands. A net win overall. In addition, the preload_index() code that does the lstat() command is highly optimized across multiple threads (and on Windows takes advantage of the fscache). This means that it can lstat() every file _much_ faster than the single threaded loop in refresh_index(). This also makes the overall performance of the pair of git commands faster as we got rid of the slow lstat() loop and kept the fast one. Here are some numbers to demonstrate that. These are hot cache numbers as they are easier to generate. Cold cache numbers make the net perf win significantly better as the cost for the reset jumps from 2.43 seconds to 7.16 seconds. 0.32 git add asdf 0.31 git -c reset.quiet=true reset asdf 1.34 git status 1.97 Total 0.32 git add asdf 2.43 git -c reset.quiet=false reset asdf 1.32 git status 4.07 Total Note the status command after the reset doesn't really change as it still must lstat() every file (the 0.02 difference is well within the variability of run to run differences). FWIW, none of these numbers are using fsmonitor. There was additional discussion about whether this should be tied to the "--quiet" option and how it should be documented. One option would be to change the default behavior of reset so that it doesn't do the refresh_index() call at all. This speeds up reset by default so there are no user discoverability issues but changes the default behavior which is an issue. Another option that was suggested was to add a separate flag that could be passed to reset so that the "quiet" and "fast" options don't get conflated. I don't care for that option because the two options (at this point and for the foreseeable future) would be identical in behavior from the end users perspective. It was also suggested that there be a single "fast and quiet" option for all of git instead of separate options for each command. I worry about that because now we're forcing users to choose between the "fast and quiet" version of git and the "slow and noisy" version. How do we help them decide which they want? That seems difficult to explain so that they can make a rational choice and also hard to discover. I also have to wonder who would say "give me the slow and noisy version please." :) I'd prefer we systematically move towards a model where the default values that are chosen for various settings throughout the code are all configurable via settings. All defaults by necessity make certain assumptions about user preference, data shape, machine performance, etc and if those assumptions don't match the user's environment then the hard coded defaults aren't appropriate. We do our best but its going to be hit or miss. A consistent way to be able to change those defaults would be very useful in those circumstances. To be clear, I'm not proposing we do a wholesale update of our preferences model at this point in time - that seems like a significant undertaking and I don't want to tie this specific optimization to a potential change in how default settings work. To move this forward, here is what I propose: 1) If the '--quiet' flag is passed, we silently take advantage of the fact we can avoid having to do an "extra" lstat()
[PATCH v1 2/2] reset: add new reset.quietDefault config setting
From: Ben Peart Add a reset.quietDefault config setting that sets the default value of the --quiet flag when running the reset command. This enables users to change the default behavior to take advantage of the performance advantages of avoiding the scan for unstaged changes after reset. Defaults to false. Signed-off-by: Ben Peart --- Documentation/config.txt | 6 ++ builtin/reset.c | 1 + 2 files changed, 7 insertions(+) diff --git a/Documentation/config.txt b/Documentation/config.txt index f6f4c21a54..a5cf4c019b 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -2728,6 +2728,12 @@ rerere.enabled:: `$GIT_DIR`, e.g. if "rerere" was previously used in the repository. +reset.quietDefault:: + Sets the default value of the "quiet" option for the reset command. + Choosing "quiet" can optimize the performance of the reset command + by avoiding the scan of all files in the repo looking for additional + unstaged changes. Defaults to false. + include::sendemail-config.txt[] sequence.editor:: diff --git a/builtin/reset.c b/builtin/reset.c index 0822798616..7d151d48a0 100644 --- a/builtin/reset.c +++ b/builtin/reset.c @@ -306,6 +306,7 @@ int cmd_reset(int argc, const char **argv, const char *prefix) }; git_config(git_reset_config, NULL); + git_config_get_bool("reset.quietDefault", &quiet); argc = parse_options(argc, argv, prefix, options, git_reset_usage, PARSE_OPT_KEEP_DASHDASH); -- 2.18.0.windows.1
[PATCH v1 1/2] reset: don't compute unstaged changes after reset when --quiet
From: Ben Peart When git reset is run with the --quiet flag, don't bother finding any additional unstaged changes as they won't be output anyway. This speeds up the git reset command by avoiding having to lstat() every file looking for changes that aren't going to be reported anyway. The savings can be significant. In a repo with 200K files "git reset" drops from 7.16 seconds to 0.32 seconds for a savings of 96%. Signed-off-by: Ben Peart --- Documentation/git-reset.txt | 4 +++- builtin/reset.c | 2 +- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/Documentation/git-reset.txt b/Documentation/git-reset.txt index 1d697d9962..8610309b55 100644 --- a/Documentation/git-reset.txt +++ b/Documentation/git-reset.txt @@ -95,7 +95,9 @@ OPTIONS -q:: --quiet:: - Be quiet, only report errors. + Be quiet, only report errors. Can optimize the performance of reset + by avoiding scaning all files in the repo looking for additional + unstaged changes. EXAMPLES diff --git a/builtin/reset.c b/builtin/reset.c index 11cd0dcb8c..0822798616 100644 --- a/builtin/reset.c +++ b/builtin/reset.c @@ -376,7 +376,7 @@ int cmd_reset(int argc, const char **argv, const char *prefix) if (read_from_tree(&pathspec, &oid, intent_to_add)) return 1; if (get_git_work_tree()) - refresh_index(&the_index, flags, NULL, NULL, + refresh_index(&the_index, flags, quiet ? &pathspec : NULL, NULL, _("Unstaged changes after reset:")); } else { int err = reset_index(&oid, reset_type, quiet); -- 2.18.0.windows.1
[PATCH v1 0/2] speed up git reset
From: Ben Peart The reset (mixed) command unstages the specified file(s) and then shows you the remaining unstaged changes. This can make the command slow on larger repos because at the end it calls refresh_index() which has a single thread that loops through all the entries calling lstat() for every file. If the user passes the --quiet switch, reset doesn�t display the remaining unstaged changes but it still does all the work to find them, it just doesn�t print them out so passing "--quiet" doesn�t help performance. This patch series will: 1) change the behavior of "git reset --quiet" so that it no longer computes the remaining unstaged changes. 2) add a new config setting so that "--quiet" can be configured as the default so that the default performance of "git reset" is improved. The performance benefit of this can be significant. In a repo with 200K files "git reset foo" performance drops from 7.16 seconds to 0.32 seconds for a savings of 96%. Even with the small git repo, reset times drop from 0.191 seconds to 0.043 seconds for a savings of 77%. Base Ref: master Web-Diff: https://github.com/benpeart/git/commit/2295a310d0 Checkout: git fetch https://github.com/benpeart/git reset-refresh-index-v1 && git checkout 2295a310d0 Ben Peart (2): reset: don't compute unstaged changes after reset when --quiet reset: add new reset.quietDefault config setting Documentation/config.txt| 6 ++ Documentation/git-reset.txt | 4 +++- builtin/reset.c | 3 ++- 3 files changed, 11 insertions(+), 2 deletions(-) base-commit: a4b8ab5363a32f283a61ef3a962853556d136c0e -- 2.18.0.windows.1
Re: [PATCH v8 0/7] speed up index load through parallelization
fixup! IEOT error messages Enable localizing new error messages and improve the error message for invalid IEOT extension sizes. Signed-off-by: Ben Peart --- read-cache.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/read-cache.c b/read-cache.c index 7acc2c86f4..f9fa6a7979 100644 --- a/read-cache.c +++ b/read-cache.c @@ -3480,7 +3480,7 @@ static struct index_entry_offset_table *read_ieot_extension(const char *mmap, si /* validate the version is IEOT_VERSION */ ext_version = get_be32(index); if (ext_version != IEOT_VERSION) { - error("invalid IEOT version %d", ext_version); + error(_("invalid IEOT version %d"), ext_version); return NULL; } index += sizeof(uint32_t); @@ -3488,7 +3488,7 @@ static struct index_entry_offset_table *read_ieot_extension(const char *mmap, si /* extension size - version bytes / bytes per entry */ nr = (extsize - sizeof(uint32_t)) / (sizeof(uint32_t) + sizeof(uint32_t)); if (!nr) { - error("invalid number of IEOT entries %d", nr); + error(_("invalid IEOT extension size %d"), extsize); return NULL; } ieot = xmalloc(sizeof(struct index_entry_offset_table) -- 2.18.0.windows.1 On 10/14/2018 8:28 AM, Duy Nguyen wrote: On Wed, Oct 10, 2018 at 5:59 PM Ben Peart wrote: @@ -3460,14 +3479,18 @@ static struct index_entry_offset_table *read_ieot_extension(const char *mmap, si /* validate the version is IEOT_VERSION */ ext_version = get_be32(index); - if (ext_version != IEOT_VERSION) + if (ext_version != IEOT_VERSION) { + error("invalid IEOT version %d", ext_version); Please wrap this string in _() so that it can be translated. return NULL; + } index += sizeof(uint32_t); /* extension size - version bytes / bytes per entry */ nr = (extsize - sizeof(uint32_t)) / (sizeof(uint32_t) + sizeof(uint32_t)); - if (!nr) + if (!nr) { + error("invalid number of IEOT entries %d", nr); Ditto. And reporting extsize may be more useful than nr, which we know is zero, but we don't know why it's calculated zero unless we know extsize.
[PATCH v8 6/7] ieot: add Index Entry Offset Table (IEOT) extension
From: Ben Peart This patch enables addressing the CPU cost of loading the index by adding additional data to the index that will allow us to efficiently multi- thread the loading and conversion of cache entries. It accomplishes this by adding an (optional) index extension that is a table of offsets to blocks of cache entries in the index file. To make this work for V4 indexes, when writing the cache entries, it periodically "resets" the prefix-compression by encoding the current entry as if the path name for the previous entry is completely different and saves the offset of that entry in the IEOT. Basically, with V4 indexes, it generates offsets into blocks of prefix-compressed entries. Signed-off-by: Ben Peart --- Documentation/technical/index-format.txt | 18 +++ read-cache.c | 196 ++- 2 files changed, 211 insertions(+), 3 deletions(-) diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt index 6bc2d90f7f..7c4d67aa6a 100644 --- a/Documentation/technical/index-format.txt +++ b/Documentation/technical/index-format.txt @@ -337,3 +337,21 @@ The remaining data of each directory block is grouped by type: SHA-1("TREE" + + "REUC" + ) + +== Index Entry Offset Table + + The Index Entry Offset Table (IEOT) is used to help address the CPU + cost of loading the index by enabling multi-threading the process of + converting cache entries from the on-disk format to the in-memory format. + The signature for this extension is { 'I', 'E', 'O', 'T' }. + + The extension consists of: + + - 32-bit version (currently 1) + + - A number of index offset entries each consisting of: + +- 32-bit offset from the begining of the file to the first cache entry + in this block of entries. + +- 32-bit count of cache entries in this block diff --git a/read-cache.c b/read-cache.c index 2214b3153d..3ace29d58f 100644 --- a/read-cache.c +++ b/read-cache.c @@ -45,6 +45,7 @@ #define CACHE_EXT_UNTRACKED 0x554E5452 /* "UNTR" */ #define CACHE_EXT_FSMONITOR 0x46534D4E /* "FSMN" */ #define CACHE_EXT_ENDOFINDEXENTRIES 0x454F4945 /* "EOIE" */ +#define CACHE_EXT_INDEXENTRYOFFSETTABLE 0x49454F54 /* "IEOT" */ /* changes that can be kept in $GIT_DIR/index (basically all extensions) */ #define EXTMASK (RESOLVE_UNDO_CHANGED | CACHE_TREE_CHANGED | \ @@ -1696,6 +1697,7 @@ static int read_index_extension(struct index_state *istate, read_fsmonitor_extension(istate, data, sz); break; case CACHE_EXT_ENDOFINDEXENTRIES: + case CACHE_EXT_INDEXENTRYOFFSETTABLE: /* already handled in do_read_index() */ break; default: @@ -1888,6 +1890,23 @@ static size_t estimate_cache_size(size_t ondisk_size, unsigned int entries) return ondisk_size + entries * per_entry; } +struct index_entry_offset +{ + /* starting byte offset into index file, count of index entries in this block */ + int offset, nr; +}; + +struct index_entry_offset_table +{ + int nr; + struct index_entry_offset entries[FLEX_ARRAY]; +}; + +#ifndef NO_PTHREADS +static struct index_entry_offset_table *read_ieot_extension(const char *mmap, size_t mmap_size, size_t offset); +static void write_ieot_extension(struct strbuf *sb, struct index_entry_offset_table *ieot); +#endif + static size_t read_eoie_extension(const char *mmap, size_t mmap_size); static void write_eoie_extension(struct strbuf *sb, git_hash_ctx *eoie_context, size_t offset); @@ -1929,6 +1948,15 @@ static void *load_index_extensions(void *_data) return NULL; } +/* + * Mostly randomly chosen maximum thread counts: we + * cap the parallelism to online_cpus() threads, and we want + * to have at least 1 cache entries per thread for it to + * be worth starting a thread. + */ + +#define THREAD_COST(1) + /* remember to discard_cache() before reading a different cache! */ int do_read_index(struct index_state *istate, const char *path, int must_exist) { @@ -2521,6 +2549,9 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, struct strbuf previous_name_buf = STRBUF_INIT, *previous_name; int drop_cache_tree = istate->drop_cache_tree; off_t offset; + int ieot_blocks = 1; + struct index_entry_offset_table *ieot = NULL; + int nr, nr_threads; for (i = removed = extended = 0; i < entries; i++) { if (cache[i]->ce_flags & CE_REMOVE) @@ -2554,10 +2585,44 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, if (ce_write(&c, newfd, &hdr, sizeof(hdr)) < 0) return -1; +#ifndef NO_PTHREADS + nr_threads = git_config_get_index_threads(); + if (nr_threads != 1) {
[PATCH v8 5/7] read-cache: load cache extensions on a worker thread
From: Ben Peart This patch helps address the CPU cost of loading the index by loading the cache extensions on a worker thread in parallel with loading the cache entries. In some cases, loading the extensions takes longer than loading the cache entries so this patch utilizes the new EOIE to start the thread to load the extensions before loading all the cache entries in parallel. This is possible because the current extensions don't access the cache entries in the index_state structure so are OK that they don't all exist yet. The CACHE_EXT_TREE, CACHE_EXT_RESOLVE_UNDO, and CACHE_EXT_UNTRACKED extensions don't even get a pointer to the index so don't have access to the cache entries. CACHE_EXT_LINK only uses the index_state to initialize the split index. CACHE_EXT_FSMONITOR only uses the index_state to save the fsmonitor last update and dirty flags. I used p0002-read-cache.sh to generate some performance data: Test w/100,000 files reduced the time by 0.53% Test w/1,000,000 files reduced the time by 27.78% Signed-off-by: Ben Peart --- read-cache.c | 95 +++- 1 file changed, 79 insertions(+), 16 deletions(-) diff --git a/read-cache.c b/read-cache.c index 4781515252..2214b3153d 100644 --- a/read-cache.c +++ b/read-cache.c @@ -23,6 +23,7 @@ #include "split-index.h" #include "utf8.h" #include "fsmonitor.h" +#include "thread-utils.h" /* Mask for the name length in ce_flags in the on-disk index */ @@ -1890,6 +1891,44 @@ static size_t estimate_cache_size(size_t ondisk_size, unsigned int entries) static size_t read_eoie_extension(const char *mmap, size_t mmap_size); static void write_eoie_extension(struct strbuf *sb, git_hash_ctx *eoie_context, size_t offset); +struct load_index_extensions +{ +#ifndef NO_PTHREADS + pthread_t pthread; +#endif + struct index_state *istate; + const char *mmap; + size_t mmap_size; + unsigned long src_offset; +}; + +static void *load_index_extensions(void *_data) +{ + struct load_index_extensions *p = _data; + unsigned long src_offset = p->src_offset; + + while (src_offset <= p->mmap_size - the_hash_algo->rawsz - 8) { + /* After an array of active_nr index entries, +* there can be arbitrary number of extended +* sections, each of which is prefixed with +* extension name (4-byte) and section length +* in 4-byte network byte order. +*/ + uint32_t extsize = get_be32(p->mmap + src_offset + 4); + if (read_index_extension(p->istate, +p->mmap + src_offset, +p->mmap + src_offset + 8, +extsize) < 0) { + munmap((void *)p->mmap, p->mmap_size); + die(_("index file corrupt")); + } + src_offset += 8; + src_offset += extsize; + } + + return NULL; +} + /* remember to discard_cache() before reading a different cache! */ int do_read_index(struct index_state *istate, const char *path, int must_exist) { @@ -1900,6 +1939,11 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist) const char *mmap; size_t mmap_size; const struct cache_entry *previous_ce = NULL; + struct load_index_extensions p; + size_t extension_offset = 0; +#ifndef NO_PTHREADS + int nr_threads; +#endif if (istate->initialized) return istate->cache_nr; @@ -1936,6 +1980,30 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist) istate->cache = xcalloc(istate->cache_alloc, sizeof(*istate->cache)); istate->initialized = 1; + p.istate = istate; + p.mmap = mmap; + p.mmap_size = mmap_size; + +#ifndef NO_PTHREADS + nr_threads = git_config_get_index_threads(); + if (!nr_threads) + nr_threads = online_cpus(); + + if (nr_threads > 1) { + extension_offset = read_eoie_extension(mmap, mmap_size); + if (extension_offset) { + int err; + + p.src_offset = extension_offset; + err = pthread_create(&p.pthread, NULL, load_index_extensions, &p); + if (err) + die(_("unable to create load_index_extensions thread: %s"), strerror(err)); + + nr_threads--; + } + } +#endif + if (istate->version == 4) { mem_pool_init(&istate->ce_mem_pool, estimate_cache_size_from_compressed(istate->cache_nr)); @@ -1960,22 +2028,17 @@ i
[PATCH v8 7/7] read-cache: load cache entries on worker threads
From: Ben Peart This patch helps address the CPU cost of loading the index by utilizing the Index Entry Offset Table (IEOT) to divide loading and conversion of the cache entries across multiple threads in parallel. I used p0002-read-cache.sh to generate some performance data: Test w/100,000 files reduced the time by 32.24% Test w/1,000,000 files reduced the time by -4.77% Note that on the 1,000,000 files case, multi-threading the cache entry parsing does not yield a performance win. This is because the cost to parse the index extensions in this repo, far outweigh the cost of loading the cache entries. The high cost of parsing the index extensions is driven by the cache tree and the untracked cache extensions. As this is currently the longest pole, any reduction in this time will reduce the overall index load times so is worth further investigation in another patch series. Signed-off-by: Ben Peart --- read-cache.c | 230 ++- 1 file changed, 193 insertions(+), 37 deletions(-) diff --git a/read-cache.c b/read-cache.c index 3ace29d58f..7acc2c86f4 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1720,7 +1720,8 @@ int read_index(struct index_state *istate) return read_index_from(istate, get_index_file(), get_git_dir()); } -static struct cache_entry *create_from_disk(struct index_state *istate, +static struct cache_entry *create_from_disk(struct mem_pool *ce_mem_pool, + unsigned int version, struct ondisk_cache_entry *ondisk, unsigned long *ent_size, const struct cache_entry *previous_ce) @@ -1737,7 +1738,7 @@ static struct cache_entry *create_from_disk(struct index_state *istate, * number of bytes to be stripped from the end of the previous name, * and the bytes to append to the result, to come up with its name. */ - int expand_name_field = istate->version == 4; + int expand_name_field = version == 4; /* On-disk flags are just 16 bits */ flags = get_be16(&ondisk->flags); @@ -1761,16 +1762,17 @@ static struct cache_entry *create_from_disk(struct index_state *istate, const unsigned char *cp = (const unsigned char *)name; size_t strip_len, previous_len; - previous_len = previous_ce ? previous_ce->ce_namelen : 0; + /* If we're at the begining of a block, ignore the previous name */ strip_len = decode_varint(&cp); - if (previous_len < strip_len) { - if (previous_ce) + if (previous_ce) { + previous_len = previous_ce->ce_namelen; + if (previous_len < strip_len) die(_("malformed name field in the index, near path '%s'"), - previous_ce->name); - else - die(_("malformed name field in the index in the first path")); + previous_ce->name); + copy_len = previous_len - strip_len; + } else { + copy_len = 0; } - copy_len = previous_len - strip_len; name = (const char *)cp; } @@ -1780,7 +1782,7 @@ static struct cache_entry *create_from_disk(struct index_state *istate, len += copy_len; } - ce = mem_pool__ce_alloc(istate->ce_mem_pool, len); + ce = mem_pool__ce_alloc(ce_mem_pool, len); ce->ce_stat_data.sd_ctime.sec = get_be32(&ondisk->ctime.sec); ce->ce_stat_data.sd_mtime.sec = get_be32(&ondisk->mtime.sec); @@ -1948,6 +1950,52 @@ static void *load_index_extensions(void *_data) return NULL; } +/* + * A helper function that will load the specified range of cache entries + * from the memory mapped file and add them to the given index. + */ +static unsigned long load_cache_entry_block(struct index_state *istate, + struct mem_pool *ce_mem_pool, int offset, int nr, const char *mmap, + unsigned long start_offset, const struct cache_entry *previous_ce) +{ + int i; + unsigned long src_offset = start_offset; + + for (i = offset; i < offset + nr; i++) { + struct ondisk_cache_entry *disk_ce; + struct cache_entry *ce; + unsigned long consumed; + + disk_ce = (struct ondisk_cache_entry *)(mmap + src_offset); + ce = create_from_disk(ce_mem_pool, istate->version, disk_ce, &consumed, previous_ce); + set_index_entry(istate, i, ce); + + src_offset += consumed; +
[PATCH v8 4/7] config: add new index.threads config setting
From: Ben Peart Add support for a new index.threads config setting which will be used to control the threading code in do_read_index(). A value of 0 will tell the index code to automatically determine the correct number of threads to use. A value of 1 will make the code single threaded. A value greater than 1 will set the maximum number of threads to use. For testing purposes, this setting can be overwritten by setting the GIT_TEST_INDEX_THREADS= environment variable to a value greater than 0. Signed-off-by: Ben Peart --- Documentation/config.txt | 7 +++ config.c | 18 ++ config.h | 1 + t/README | 5 + t/t1700-split-index.sh | 5 + 5 files changed, 36 insertions(+) diff --git a/Documentation/config.txt b/Documentation/config.txt index ad0f4510c3..8fd973b76b 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -2413,6 +2413,13 @@ imap:: The configuration variables in the 'imap' section are described in linkgit:git-imap-send[1]. +index.threads:: + Specifies the number of threads to spawn when loading the index. + This is meant to reduce index load time on multiprocessor machines. + Specifying 0 or 'true' will cause Git to auto-detect the number of + CPU's and set the number of threads accordingly. Specifying 1 or + 'false' will disable multithreading. Defaults to 'true'. + index.version:: Specify the version with which new index files should be initialized. This does not affect existing repositories. diff --git a/config.c b/config.c index 3461993f0a..2ee29f6f86 100644 --- a/config.c +++ b/config.c @@ -2289,6 +2289,24 @@ int git_config_get_fsmonitor(void) return 0; } +int git_config_get_index_threads(void) +{ + int is_bool, val = 0; + + val = git_env_ulong("GIT_TEST_INDEX_THREADS", 0); + if (val) + return val; + + if (!git_config_get_bool_or_int("index.threads", &is_bool, &val)) { + if (is_bool) + return val ? 0 : 1; + else + return val; + } + + return 0; /* auto */ +} + NORETURN void git_die_config_linenr(const char *key, const char *filename, int linenr) { diff --git a/config.h b/config.h index ab46e0165d..a06027e69b 100644 --- a/config.h +++ b/config.h @@ -250,6 +250,7 @@ extern int git_config_get_untracked_cache(void); extern int git_config_get_split_index(void); extern int git_config_get_max_percent_split_change(void); extern int git_config_get_fsmonitor(void); +extern int git_config_get_index_threads(void); /* This dies if the configured or default date is in the future */ extern int git_config_get_expiry(const char *key, const char **output); diff --git a/t/README b/t/README index 3ea6c85460..8f5c0620ea 100644 --- a/t/README +++ b/t/README @@ -327,6 +327,11 @@ GIT_TEST_COMMIT_GRAPH=, when true, forces the commit-graph to be written after every 'git commit' command, and overrides the 'core.commitGraph' setting to true. +GIT_TEST_INDEX_THREADS= enables exercising the multi-threaded loading +of the index for the whole test suite by bypassing the default number of +cache entries and thread minimums. Setting this to 1 will make the +index loading single threaded. + Naming Tests diff --git a/t/t1700-split-index.sh b/t/t1700-split-index.sh index 8e17f8e7a0..ef9349bd70 100755 --- a/t/t1700-split-index.sh +++ b/t/t1700-split-index.sh @@ -6,7 +6,12 @@ test_description='split index mode tests' # We need total control of index splitting here sane_unset GIT_TEST_SPLIT_INDEX + +# Testing a hard coded SHA against an index with an extension +# that can vary from run to run is problematic so we disable +# those extensions. sane_unset GIT_FSMONITOR_TEST +sane_unset GIT_TEST_INDEX_THREADS test_expect_success 'enable split index' ' git config splitIndex.maxPercentChange 100 && -- 2.18.0.windows.1
[PATCH v8 2/7] read-cache: clean up casting and byte decoding
From: Ben Peart This patch does a clean up pass to minimize the casting required to work with the memory mapped index (mmap). It also makes the decoding of network byte order more consistent by using get_be32() where possible. Signed-off-by: Ben Peart --- read-cache.c | 23 +++ 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/read-cache.c b/read-cache.c index 583a4fb1f8..6ba99e2c96 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1650,7 +1650,7 @@ int verify_index_checksum; /* Allow fsck to force verification of the cache entry order. */ int verify_ce_order; -static int verify_hdr(struct cache_header *hdr, unsigned long size) +static int verify_hdr(const struct cache_header *hdr, unsigned long size) { git_hash_ctx c; unsigned char hash[GIT_MAX_RAWSZ]; @@ -1674,7 +1674,7 @@ static int verify_hdr(struct cache_header *hdr, unsigned long size) } static int read_index_extension(struct index_state *istate, - const char *ext, void *data, unsigned long sz) + const char *ext, const char *data, unsigned long sz) { switch (CACHE_EXT(ext)) { case CACHE_EXT_TREE: @@ -1889,8 +1889,8 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist) int fd, i; struct stat st; unsigned long src_offset; - struct cache_header *hdr; - void *mmap; + const struct cache_header *hdr; + const char *mmap; size_t mmap_size; const struct cache_entry *previous_ce = NULL; @@ -1918,7 +1918,7 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist) die_errno("unable to map index file"); close(fd); - hdr = mmap; + hdr = (const struct cache_header *)mmap; if (verify_hdr(hdr, mmap_size) < 0) goto unmap; @@ -1943,7 +1943,7 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist) struct cache_entry *ce; unsigned long consumed; - disk_ce = (struct ondisk_cache_entry *)((char *)mmap + src_offset); + disk_ce = (struct ondisk_cache_entry *)(mmap + src_offset); ce = create_from_disk(istate, disk_ce, &consumed, previous_ce); set_index_entry(istate, i, ce); @@ -1961,21 +1961,20 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist) * in 4-byte network byte order. */ uint32_t extsize; - memcpy(&extsize, (char *)mmap + src_offset + 4, 4); - extsize = ntohl(extsize); + extsize = get_be32(mmap + src_offset + 4); if (read_index_extension(istate, -(const char *) mmap + src_offset, -(char *) mmap + src_offset + 8, +mmap + src_offset, +mmap + src_offset + 8, extsize) < 0) goto unmap; src_offset += 8; src_offset += extsize; } - munmap(mmap, mmap_size); + munmap((void *)mmap, mmap_size); return istate->cache_nr; unmap: - munmap(mmap, mmap_size); + munmap((void *)mmap, mmap_size); die("index file corrupt"); } -- 2.18.0.windows.1
[PATCH v8 1/7] read-cache.c: optimize reading index format v4
From: Nguyễn Thái Ngọc Duy Index format v4 requires some more computation to assemble a path based on a previous one. The current code is not very efficient because - it doubles memory copy, we assemble the final path in a temporary first before putting it back to a cache_entry - strbuf_remove() in expand_name_field() is not exactly a good fit for stripping a part at the end, _setlen() would do the same job and is much cheaper. - the open-coded loop to find the end of the string in expand_name_field() can't beat an optimized strlen() This patch avoids the temporary buffer and writes directly to the new cache_entry, which addresses the first two points. The last point could also be avoided if the total string length fits in the first 12 bits of ce_flags, if not we fall back to strlen(). Running "test-tool read-cache 100" on webkit.git (275k files), reading v2 only takes 4.226 seconds, while v4 takes 5.711 seconds, 35% more time. The patch reduces read time on v4 to 4.319 seconds. Signed-off-by: Nguyễn Thái Ngọc Duy --- read-cache.c | 128 --- 1 file changed, 60 insertions(+), 68 deletions(-) diff --git a/read-cache.c b/read-cache.c index 8d04d78a58..583a4fb1f8 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1713,63 +1713,24 @@ int read_index(struct index_state *istate) return read_index_from(istate, get_index_file(), get_git_dir()); } -static struct cache_entry *cache_entry_from_ondisk(struct mem_pool *mem_pool, - struct ondisk_cache_entry *ondisk, - unsigned int flags, - const char *name, - size_t len) -{ - struct cache_entry *ce = mem_pool__ce_alloc(mem_pool, len); - - ce->ce_stat_data.sd_ctime.sec = get_be32(&ondisk->ctime.sec); - ce->ce_stat_data.sd_mtime.sec = get_be32(&ondisk->mtime.sec); - ce->ce_stat_data.sd_ctime.nsec = get_be32(&ondisk->ctime.nsec); - ce->ce_stat_data.sd_mtime.nsec = get_be32(&ondisk->mtime.nsec); - ce->ce_stat_data.sd_dev = get_be32(&ondisk->dev); - ce->ce_stat_data.sd_ino = get_be32(&ondisk->ino); - ce->ce_mode = get_be32(&ondisk->mode); - ce->ce_stat_data.sd_uid = get_be32(&ondisk->uid); - ce->ce_stat_data.sd_gid = get_be32(&ondisk->gid); - ce->ce_stat_data.sd_size = get_be32(&ondisk->size); - ce->ce_flags = flags & ~CE_NAMEMASK; - ce->ce_namelen = len; - ce->index = 0; - hashcpy(ce->oid.hash, ondisk->sha1); - memcpy(ce->name, name, len); - ce->name[len] = '\0'; - return ce; -} - -/* - * Adjacent cache entries tend to share the leading paths, so it makes - * sense to only store the differences in later entries. In the v4 - * on-disk format of the index, each on-disk cache entry stores the - * number of bytes to be stripped from the end of the previous name, - * and the bytes to append to the result, to come up with its name. - */ -static unsigned long expand_name_field(struct strbuf *name, const char *cp_) -{ - const unsigned char *ep, *cp = (const unsigned char *)cp_; - size_t len = decode_varint(&cp); - - if (name->len < len) - die("malformed name field in the index"); - strbuf_remove(name, name->len - len, len); - for (ep = cp; *ep; ep++) - ; /* find the end */ - strbuf_add(name, cp, ep - cp); - return (const char *)ep + 1 - cp_; -} - -static struct cache_entry *create_from_disk(struct mem_pool *mem_pool, +static struct cache_entry *create_from_disk(struct index_state *istate, struct ondisk_cache_entry *ondisk, unsigned long *ent_size, - struct strbuf *previous_name) + const struct cache_entry *previous_ce) { struct cache_entry *ce; size_t len; const char *name; unsigned int flags; + size_t copy_len; + /* +* Adjacent cache entries tend to share the leading paths, so it makes +* sense to only store the differences in later entries. In the v4 +* on-disk format of the index, each on-disk cache entry stores the +* number of bytes to be stripped from the end of the previous name, +* and the bytes to append to the result, to come up with its name. +*/ + int expand_name_field = istate->version == 4; /* On-disk flags are just 16 bits */ flags = get_be16(&ondisk->flags); @@ -1789,21 +1750,54 @@ static struct cache_entry *create_from_disk(struct mem_pool *mem_pool, else name = ondisk->name; - if (!previous_name) { - /* v3 and earlier */ - if (len == CE_NAMEMASK) -
[PATCH v8 3/7] eoie: add End of Index Entry (EOIE) extension
From: Ben Peart The End of Index Entry (EOIE) is used to locate the end of the variable length index entries and the beginning of the extensions. Code can take advantage of this to quickly locate the index extensions without having to parse through all of the index entries. The EOIE extension is always written out to the index file including to the shared index when using the split index feature. Because it is always written out, the SHA checksums in t/t1700-split-index.sh were updated to reflect its inclusion. It is written as an optional extension to ensure compatibility with other git implementations that do not yet support it. It is always written out to ensure it is available as often as possible to speed up index operations. Because it must be able to be loaded before the variable length cache entries and other index extensions, this extension must be written last. The signature for this extension is { 'E', 'O', 'I', 'E' }. The extension consists of: - 32-bit offset to the end of the index entries - 160-bit SHA-1 over the extension types and their sizes (but not their contents). E.g. if we have "TREE" extension that is N-bytes long, "REUC" extension that is M-bytes long, followed by "EOIE", then the hash would be: SHA-1("TREE" + + "REUC" + ) Signed-off-by: Ben Peart --- Documentation/technical/index-format.txt | 23 read-cache.c | 158 +-- t/t1700-split-index.sh | 8 +- 3 files changed, 177 insertions(+), 12 deletions(-) diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt index db3572626b..6bc2d90f7f 100644 --- a/Documentation/technical/index-format.txt +++ b/Documentation/technical/index-format.txt @@ -314,3 +314,26 @@ The remaining data of each directory block is grouped by type: - An ewah bitmap, the n-th bit indicates whether the n-th index entry is not CE_FSMONITOR_VALID. + +== End of Index Entry + + The End of Index Entry (EOIE) is used to locate the end of the variable + length index entries and the begining of the extensions. Code can take + advantage of this to quickly locate the index extensions without having + to parse through all of the index entries. + + Because it must be able to be loaded before the variable length cache + entries and other index extensions, this extension must be written last. + The signature for this extension is { 'E', 'O', 'I', 'E' }. + + The extension consists of: + + - 32-bit offset to the end of the index entries + + - 160-bit SHA-1 over the extension types and their sizes (but not + their contents). E.g. if we have "TREE" extension that is N-bytes + long, "REUC" extension that is M-bytes long, followed by "EOIE", + then the hash would be: + + SHA-1("TREE" + + + "REUC" + ) diff --git a/read-cache.c b/read-cache.c index 6ba99e2c96..4781515252 100644 --- a/read-cache.c +++ b/read-cache.c @@ -43,6 +43,7 @@ #define CACHE_EXT_LINK 0x6c696e6b/* "link" */ #define CACHE_EXT_UNTRACKED 0x554E5452 /* "UNTR" */ #define CACHE_EXT_FSMONITOR 0x46534D4E /* "FSMN" */ +#define CACHE_EXT_ENDOFINDEXENTRIES 0x454F4945 /* "EOIE" */ /* changes that can be kept in $GIT_DIR/index (basically all extensions) */ #define EXTMASK (RESOLVE_UNDO_CHANGED | CACHE_TREE_CHANGED | \ @@ -1693,6 +1694,9 @@ static int read_index_extension(struct index_state *istate, case CACHE_EXT_FSMONITOR: read_fsmonitor_extension(istate, data, sz); break; + case CACHE_EXT_ENDOFINDEXENTRIES: + /* already handled in do_read_index() */ + break; default: if (*ext < 'A' || 'Z' < *ext) return error("index uses %.4s extension, which we do not understand", @@ -1883,6 +1887,9 @@ static size_t estimate_cache_size(size_t ondisk_size, unsigned int entries) return ondisk_size + entries * per_entry; } +static size_t read_eoie_extension(const char *mmap, size_t mmap_size); +static void write_eoie_extension(struct strbuf *sb, git_hash_ctx *eoie_context, size_t offset); + /* remember to discard_cache() before reading a different cache! */ int do_read_index(struct index_state *istate, const char *path, int must_exist) { @@ -2190,11 +2197,15 @@ static int ce_write(git_hash_ctx *context, int fd, void *data, unsigned int len) return 0; } -static int write_index_ext_header(git_hash_ctx *context, int fd, - unsigned int ext, unsigned int sz) +static int write_index_ext_header(git_hash_ctx *context, git_hash_ctx *eoie_context, + int fd, unsigned int
[PATCH v8 0/7] speed up index load through parallelization
From: Ben Peart Fixed issues identified in review the most impactful probably being plugging some leaks and improved error handling. Also added better error messages and some code cleanup to code I'd touched. The biggest change in the interdiff is the impact of renaming ieot_offset to ieot_start and ieot_work to ieot_blocks in hopes of making it easier to read and understand the code. Base Ref: master Web-Diff: https://github.com/benpeart/git/commit/6caa0bac46 Checkout: git fetch https://github.com/benpeart/git read-index-multithread-v8 && git checkout 6caa0bac46 ### Interdiff (v7..v8): diff --git a/read-cache.c b/read-cache.c index 14402a0738..7acc2c86f4 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1901,7 +1901,7 @@ struct index_entry_offset struct index_entry_offset_table { int nr; - struct index_entry_offset entries[0]; + struct index_entry_offset entries[FLEX_ARRAY]; }; #ifndef NO_PTHREADS @@ -1935,9 +1935,7 @@ static void *load_index_extensions(void *_data) * extension name (4-byte) and section length * in 4-byte network byte order. */ - uint32_t extsize; - memcpy(&extsize, p->mmap + src_offset + 4, 4); - extsize = ntohl(extsize); + uint32_t extsize = get_be32(p->mmap + src_offset + 4); if (read_index_extension(p->istate, p->mmap + src_offset, p->mmap + src_offset + 8, @@ -2015,8 +2013,8 @@ struct load_cache_entries_thread_data int offset; const char *mmap; struct index_entry_offset_table *ieot; - int ieot_offset;/* starting index into the ieot array */ - int ieot_work; /* count of ieot entries to process */ + int ieot_start; /* starting index into the ieot array */ + int ieot_blocks;/* count of ieot entries to process */ unsigned long consumed; /* return # of bytes in index file processed */ }; @@ -2030,8 +2028,9 @@ static void *load_cache_entries_thread(void *_data) int i; /* iterate across all ieot blocks assigned to this thread */ - for (i = p->ieot_offset; i < p->ieot_offset + p->ieot_work; i++) { - p->consumed += load_cache_entry_block(p->istate, p->ce_mem_pool, p->offset, p->ieot->entries[i].nr, p->mmap, p->ieot->entries[i].offset, NULL); + for (i = p->ieot_start; i < p->ieot_start + p->ieot_blocks; i++) { + p->consumed += load_cache_entry_block(p->istate, p->ce_mem_pool, + p->offset, p->ieot->entries[i].nr, p->mmap, p->ieot->entries[i].offset, NULL); p->offset += p->ieot->entries[i].nr; } return NULL; @@ -2040,48 +2039,45 @@ static void *load_cache_entries_thread(void *_data) static unsigned long load_cache_entries_threaded(struct index_state *istate, const char *mmap, size_t mmap_size, unsigned long src_offset, int nr_threads, struct index_entry_offset_table *ieot) { - int i, offset, ieot_work, ieot_offset, err; + int i, offset, ieot_blocks, ieot_start, err; struct load_cache_entries_thread_data *data; unsigned long consumed = 0; - int nr; /* a little sanity checking */ if (istate->name_hash_initialized) BUG("the name hash isn't thread safe"); mem_pool_init(&istate->ce_mem_pool, 0); - data = xcalloc(nr_threads, sizeof(struct load_cache_entries_thread_data)); /* ensure we have no more threads than we have blocks to process */ if (nr_threads > ieot->nr) nr_threads = ieot->nr; - data = xcalloc(nr_threads, sizeof(struct load_cache_entries_thread_data)); + data = xcalloc(nr_threads, sizeof(*data)); - offset = ieot_offset = 0; - ieot_work = DIV_ROUND_UP(ieot->nr, nr_threads); + offset = ieot_start = 0; + ieot_blocks = DIV_ROUND_UP(ieot->nr, nr_threads); for (i = 0; i < nr_threads; i++) { struct load_cache_entries_thread_data *p = &data[i]; - int j; + int nr, j; - if (ieot_offset + ieot_work > ieot->nr) - ieot_work = ieot->nr - ieot_offset; + if (ieot_start + ieot_blocks > ieot->nr) + ieot_blocks = ieot->nr - ieot_start; p->istate = istate; p->offset = offset; p->mmap = mmap; p->ieot = ieot; - p->ieot_offset = ieot_offset; - p->ieot_work = ieot_work; + p->ieot_start = ieot_start; + p->ieot_blocks = ieot_blocks;
Re: [PATCH] unpack-trees: allow missing CE_SKIP_WORKTREE objs
On 10/9/2018 5:30 AM, Junio C Hamano wrote: Jonathan Tan writes: @@ -1635,6 +1635,7 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options o->result.cache_tree = cache_tree(); if (!cache_tree_fully_valid(o->result.cache_tree)) cache_tree_update(&o->result, + WRITE_TREE_SKIP_WORKTREE_MISSING_OK | WRITE_TREE_SILENT | WRITE_TREE_REPAIR); } H. Should this be passing the bit unconditionally? Shouldn't it be set only when we are doing lazy clone? A non-lazy clone that merely uses sparse checkout has nowhere else to turn to if it loses a blob object that currently is not necessary to complete a checkout, unlike a repository with promisor. I agree. I believe this logic should only be triggered when running in a partial clone repo. Otherwise, we're potentially changing the behavior of 'normal' repos unnecessarily.
Re: [PATCH] unpack-trees: allow missing CE_SKIP_WORKTREE objs
On 10/8/2018 5:48 PM, Jonathan Tan wrote: Whenever a sparse checkout occurs, the existence of all blobs in the index is verified, whether or not they are included or excluded by the .git/info/sparse-checkout specification. This degrades performance, significantly in the case of a partial clone, because a lazy fetch occurs whenever the existence of a missing blob is checked. At the point of invoking cache_tree_update() in unpack_trees(), CE_SKIP_WORKTREE is already set on all excluded blobs (mark_new_skip_worktree() is called twice to set CE_NEW_SKIP_WORKTREE, then apply_sparse_checkout() is called which copies over CE_NEW_SKIP_WORKTREE to CE_SKIP_WORKTREE). cache_tree_update() can use this information to know which blobs are excluded, and thus skip the checking of these. Because cache_tree_update() is used from multiple places, this new behavior is guarded by a new flag WRITE_TREE_SKIP_WORKTREE_MISSING_OK. Implement this new flag, and teach unpack_trees() to invoke cache_tree_update() with this new flag. I wonder if preventing the download of all missing blobs should be limited to only the checkout command. When you looked at the other places that call cache_tree_update(), does it make sense that they trigger the download of all the missing blobs? For example, I suspect you would not want them all downloaded just to do a merge-recursive. In full disclosure, we implemented this a couple of years ago [1] for GVFS and opted to do the logic a little differently. We found that we didn't want to trigger the download of all missing blobs in cache_tree_update() for _any_ command that was executing in a partially cloned repo. This is safe because if any individual blob is actually needed, it will get downloaded "on demand" already. [1] https://github.com/Microsoft/git/commit/ec865500d98 Signed-off-by: Jonathan Tan --- cache-tree.c | 6 +- cache-tree.h | 4 t/t1090-sparse-checkout-scope.sh | 33 unpack-trees.c | 1 + 4 files changed, 43 insertions(+), 1 deletion(-) diff --git a/cache-tree.c b/cache-tree.c index 5ce51468f0..340caf2d34 100644 --- a/cache-tree.c +++ b/cache-tree.c @@ -246,6 +246,8 @@ static int update_one(struct cache_tree *it, int missing_ok = flags & WRITE_TREE_MISSING_OK; int dryrun = flags & WRITE_TREE_DRY_RUN; int repair = flags & WRITE_TREE_REPAIR; + int skip_worktree_missing_ok = + flags & WRITE_TREE_SKIP_WORKTREE_MISSING_OK; int to_invalidate = 0; int i; @@ -356,7 +358,9 @@ static int update_one(struct cache_tree *it, } if (is_null_oid(oid) || - (mode != S_IFGITLINK && !missing_ok && !has_object_file(oid))) { + (mode != S_IFGITLINK && !missing_ok && +!(skip_worktree_missing_ok && ce_skip_worktree(ce)) && +!has_object_file(oid))) { strbuf_release(&buffer); if (expected_missing) return -1; diff --git a/cache-tree.h b/cache-tree.h index 0ab6784ffe..655d487619 100644 --- a/cache-tree.h +++ b/cache-tree.h @@ -40,6 +40,10 @@ void cache_tree_verify(struct index_state *); #define WRITE_TREE_DRY_RUN 4 #define WRITE_TREE_SILENT 8 #define WRITE_TREE_REPAIR 16 +/* + * Do not check for the presence of cache entry objects with CE_SKIP_WORKTREE. + */ +#define WRITE_TREE_SKIP_WORKTREE_MISSING_OK 32 /* error return codes */ #define WRITE_TREE_UNREADABLE_INDEX (-1) diff --git a/t/t1090-sparse-checkout-scope.sh b/t/t1090-sparse-checkout-scope.sh index 25d7c700f6..090b7fc3d3 100755 --- a/t/t1090-sparse-checkout-scope.sh +++ b/t/t1090-sparse-checkout-scope.sh @@ -63,4 +63,37 @@ test_expect_success 'return to full checkout of master' ' test "$(cat b)" = "modified" ' +test_expect_success 'in partial clone, sparse checkout only fetches needed blobs' ' + test_create_repo server && + git clone "file://$(pwd)/server" client && + + test_config -C server uploadpack.allowfilter 1 && + test_config -C server uploadpack.allowanysha1inwant 1 && + echo a >server/a && + echo bb >server/b && + mkdir server/c && + echo ccc >server/c/c && + git -C server add a b c/c && + git -C server commit -m message && + + test_config -C client core.sparsecheckout 1 && + test_config -C client extensions.partialclone origin && + echo "!/*" >client/.git/info/sparse-checkout && + echo "/a" >>client/.git/info/sparse-checkout && + git -C client fetch --filter=blob:none origin && + git -C client checkout FETCH_HEAD && + + git -C client rev-list HEAD \ + --quiet --objects --missing=print >unsorted_actual && + ( + printf "?" && + git hash-object server/b && + printf "?" && +