Re: Git 2.8.1 fails test 32 of t7300-clean.sh, breaks profile build
Thanks for fixing the missing SANITY prerequisite Stefan. As for the error handling logic in setup.c: is_nonbare_repository_dir (was clean.c: is_git_repository) my reasoning is as follows: READ_GITFILE_ERR_STAT_FAILED READ_GITFILE_ERR_NOT_A_FILE: When checking random paths for .git files these are the common error modes, file is not there or it is a directory. This should not be interpreted as a valid .git file. READ_GITFILE_ERR_OPEN_FAILED READ_GITFILE_ERR_READ_FAILED: Here we found a .git file but could not open and read it to verify that it is valid. Treating it as valid is the safest option for clean. For resolve_gitlink_ref I think it maybe leads to the creation of a redundant ref cache entries but I don't think this is a problem unless someone has a huge amount of unreadable .git files lying around. READ_GITFILE_ERR_TOO_LARGE: File is absurdly large (1MB), very unlikely to be a valid .git file. READ_GITFILE_ERR_INVALID_FORMAT READ_GITFILE_ERR_NO_PATH: File is malformed in some way, either the "gitdir:" prefix is missing or the path is missing. Could theoretically be a corrupted .git file but seems unlikely. READ_GITFILE_ERR_NOT_A_REPO: This is maybe a bit more suspicious. Here we have found a .git file, it has the format "gitdir: some/path" but is_git_directory("some/path") returned false, meaning that "some/path" does not fulfill: /* * Test if it looks like we're at a git directory. * We want to see: * * - either an objects/ directory _or_ the proper *GIT_OBJECT_DIRECTORY environment variable * - a refs/ directory * - either a HEAD symlink or a HEAD file that is formatted as *a proper "ref:", or a regular file HEAD that has a properly *formatted sha1 object name. */ Technically we don't have a valid .git file here but we have something that really tries to be. I guess it is debatable what the correct conservative choice is here and if it is the same for all callers. /Erik -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] t7300: fix broken && chains
On Mon, Aug 31, 2015 at 8:54 PM, Jeff King <p...@peff.net> wrote: > On Sun, Aug 30, 2015 at 11:18:09AM +0200, Erik Elfström wrote: > > Unfortunately CHAIN_LINT cannot reach inside a nested subshell. I cannot > think of a way to make it do so, and besides, that is also the way to > override the &&-chain precedence. So I think it is not really possible > to get better coverage here. And such cases as these are not really very > interesting (e.g., here we exclude only minor minor setup steps from the > &&-chain). > > -Peff I actually hacked together an ugly script to check for more broken chains. It's not working very well and I haven't checked the output thoroughly but I did see quite a few broken chains. Many are of the (mostly harmless) form: ( echo "100644 $o5 0a" echo "100644 $o0 0c" echo "16 $c1 0d" ) >expected && I'd estimate that there are hundreds of these (see t3030 for examples). I'm not sure if you care about these? As you say they are not really very interesting cases. /Erik -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] t7300: fix broken && chains
On Mon, Aug 31, 2015 at 6:58 PM, Junio C Hamanowrote: > > Many of the constructs we see here shows clearly that this is an > ancient part of the codebase ;-), as we would be using the one > parameter form of "git init" and more test_* helpers if we were > writing this script in today's Git codebase. It may have been > better if you didn't do "while we are here" and corrected only the > &&-chain in patch 1/2 and then updated the style of the tests to > take advantage of the newer facilities recent test-lib has in a > separate patch 2/2, but this will do at least for now. > > Will queue. > > Thanks. I can do a re-roll with the chain fix in the first patch and a more thorough modernization of t7300 in separate patches if you'd like? I almost went this way for v1 but decided to limit the scope for the first version. (Forgot to include the list in my first reply, sorry... And forgot to turn off HTML in the second... sigh, sorry for the spam) /Erik -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] t7300: fix broken chains
While we are here, remove some boilerplate by using test_commit. Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- t/t7300-clean.sh | 16 1 file changed, 4 insertions(+), 12 deletions(-) diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh index 27557d6..86ceb38 100755 --- a/t/t7300-clean.sh +++ b/t/t7300-clean.sh @@ -432,9 +432,7 @@ test_expect_success 'nested git work tree' ' ( cd foo git init - hello.world - git add . - git commit -a -m nested + test_commit nested hello.world ) ( cd bar @@ -443,9 +441,7 @@ test_expect_success 'nested git work tree' ' ( cd baz/boo git init - deeper.world - git add . - git commit -a -m deeply.nested + test_commit deeply.nested deeper.world ) git clean -f -d test -f foo/.git/index @@ -601,9 +597,7 @@ test_expect_success 'force removal of nested git work tree' ' ( cd foo git init - hello.world - git add . - git commit -a -m nested + test_commit nested hello.world ) ( cd bar @@ -612,9 +606,7 @@ test_expect_success 'force removal of nested git work tree' ' ( cd baz/boo git init - deeper.world - git add . - git commit -a -m deeply.nested + test_commit deeply.nested deeper.world ) git clean -f -f -d ! test -d foo -- 2.4.4.410.gd6567e3 -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8 1/5] setup: add gentle version of read_gitfile
On Fri, Jun 26, 2015 at 11:03 AM, Jeff King p...@peff.net wrote: I happened to be playing with clang's static analyzer today, and it noticed that there is a subtle use-after-free here. Doh, sorry about that. Thanks for fixing my bug. /Erik -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v8 3/5] t7300: add tests to document behavior of clean and nested git
Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- t/t7300-clean.sh | 142 +++ 1 file changed, 142 insertions(+) diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh index 99be5d9..fbfdf2d 100755 --- a/t/t7300-clean.sh +++ b/t/t7300-clean.sh @@ -455,6 +455,148 @@ test_expect_success 'nested git work tree' ' ! test -d bar ' +test_expect_failure 'should clean things that almost look like git but are not' ' + rm -fr almost_git almost_bare_git almost_submodule + mkdir -p almost_git/.git/objects + mkdir -p almost_git/.git/refs + cat almost_git/.git/HEAD -\EOF + garbage + EOF + cp -r almost_git/.git/ almost_bare_git + mkdir almost_submodule/ + cat almost_submodule/.git -\EOF + garbage + EOF + test_when_finished rm -rf almost_* + ## This will fail due to die(Invalid gitfile format: %s, path); in + ## setup.c:read_gitfile. + git clean -f -d + test_path_is_missing almost_git + test_path_is_missing almost_bare_git + test_path_is_missing almost_submodule +' + +test_expect_success 'should not clean submodules' ' + rm -fr repo to_clean sub1 sub2 + mkdir repo to_clean + ( + cd repo + git init + test_commit msg hello.world + ) + git submodule add ./repo/.git sub1 + git commit -m sub1 + git branch before_sub2 + git submodule add ./repo/.git sub2 + git commit -m sub2 + git checkout before_sub2 + to_clean/should_clean.this + git clean -f -d + test_path_is_file repo/.git/index + test_path_is_file repo/hello.world + test_path_is_file sub1/.git + test_path_is_file sub1/hello.world + test_path_is_file sub2/.git + test_path_is_file sub2/hello.world + test_path_is_missing to_clean +' + +test_expect_failure 'should avoid cleaning possible submodules' ' + rm -fr to_clean possible_sub1 + mkdir to_clean possible_sub1 + test_when_finished rm -rf possible_sub* + echo gitdir: foo possible_sub1/.git + possible_sub1/hello.world + chmod 0 possible_sub1/.git + to_clean/should_clean.this + git clean -f -d + test_path_is_file possible_sub1/.git + test_path_is_file possible_sub1/hello.world + test_path_is_missing to_clean +' + +test_expect_failure 'nested (empty) git should be kept' ' + rm -fr empty_repo to_clean + git init empty_repo + mkdir to_clean + to_clean/should_clean.this + git clean -f -d + test_path_is_file empty_repo/.git/HEAD + test_path_is_missing to_clean +' + +test_expect_success 'nested bare repositories should be cleaned' ' + rm -fr bare1 bare2 subdir + git init --bare bare1 + git clone --local --bare . bare2 + mkdir subdir + cp -r bare2 subdir/bare3 + git clean -f -d + test_path_is_missing bare1 + test_path_is_missing bare2 + test_path_is_missing subdir +' + +test_expect_success 'nested (empty) bare repositories should be cleaned even when in .git' ' + rm -fr strange_bare + mkdir strange_bare + git init --bare strange_bare/.git + git clean -f -d + test_path_is_missing strange_bare +' + +test_expect_failure 'nested (non-empty) bare repositories should be cleaned even when in .git' ' + rm -fr strange_bare + mkdir strange_bare + git clone --local --bare . strange_bare/.git + git clean -f -d + test_path_is_missing strange_bare +' + +test_expect_success 'giving path in nested git work tree will remove it' ' + rm -fr repo + mkdir repo + ( + cd repo + git init + mkdir -p bar/baz + test_commit msg bar/baz/hello.world + ) + git clean -f -d repo/bar/baz + test_path_is_file repo/.git/HEAD + test_path_is_dir repo/bar/ + test_path_is_missing repo/bar/baz +' + +test_expect_success 'giving path to nested .git will not remove it' ' + rm -fr repo + mkdir repo untracked + ( + cd repo + git init + test_commit msg hello.world + ) + git clean -f -d repo/.git + test_path_is_file repo/.git/HEAD + test_path_is_dir repo/.git/refs + test_path_is_dir repo/.git/objects + test_path_is_dir untracked/ +' + +test_expect_success 'giving path to nested .git/ will remove contents' ' + rm -fr repo untracked + mkdir repo untracked + ( + cd repo + git init + test_commit msg hello.world + ) + git clean -f -d repo/.git/ + test_path_is_dir repo/.git + test_dir_is_empty repo/.git + test_path_is_dir untracked/ +' + test_expect_success 'force removal
[PATCH v8 4/5] p7300: add performance tests for clean
The tests are run in dry-run mode to avoid having to restore the test directories for each timed iteration. Using dry-run is an acceptable compromise since we are mostly interested in the initial computation of what to clean and not so much in the cleaning it self. Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- t/perf/p7300-clean.sh | 31 +++ 1 file changed, 31 insertions(+) create mode 100755 t/perf/p7300-clean.sh diff --git a/t/perf/p7300-clean.sh b/t/perf/p7300-clean.sh new file mode 100755 index 000..ec94cdd --- /dev/null +++ b/t/perf/p7300-clean.sh @@ -0,0 +1,31 @@ +#!/bin/sh + +test_description=Test git-clean performance + +. ./perf-lib.sh + +test_perf_default_repo +test_checkout_worktree + +test_expect_success 'setup untracked directory with many sub dirs' ' + rm -rf 500_sub_dirs 10_sub_dirs clean_test_dir + mkdir 500_sub_dirs 10_sub_dirs clean_test_dir + for i in $(test_seq 1 500) + do + mkdir 500_sub_dirs/dir$i || return $? + done + for i in $(test_seq 1 200) + do + cp -r 500_sub_dirs 10_sub_dirs/dir$i || return $? + done +' + +test_perf 'clean many untracked sub dirs, check for nested git' ' + git clean -n -q -f -d 10_sub_dirs/ +' + +test_perf 'clean many untracked sub dirs, ignore nested git' ' + git clean -n -q -f -f -d 10_sub_dirs/ +' + +test_done -- 2.4.3.373.gc496bfb.dirty -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v8 0/5] Improving performance of git clean
This is v8 of this series. v7 can be found here: http://thread.gmane.org/gmane.comp.version-control.git/271220 Changes in v8: * change name and type of file size limit variable in read_git_file_gently Erik Elfström (5): setup: add gentle version of read_gitfile setup: sanity check file size in read_gitfile_gently t7300: add tests to document behavior of clean and nested git p7300: add performance tests for clean clean: improve performance when removing lots of directories builtin/clean.c | 30 +-- cache.h | 12 - setup.c | 91 +--- t/perf/p7300-clean.sh | 31 +++ t/t7300-clean.sh | 140 ++ 5 files changed, 280 insertions(+), 24 deletions(-) create mode 100755 t/perf/p7300-clean.sh -- 2.4.3.373.gc496bfb.dirty -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v8 2/5] setup: sanity check file size in read_gitfile_gently
read_gitfile_gently will allocate a buffer to fit the entire file that should be read. Add a sanity check of the file size before opening to avoid allocating a potentially huge amount of memory if we come across a large file that someone happened to name .git. The limit is set to a sufficiently unreasonable size that should never be exceeded by a genuine .git file. Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- cache.h | 1 + setup.c | 7 +++ 2 files changed, 8 insertions(+) diff --git a/cache.h b/cache.h index 25578cb..858d9b3 100644 --- a/cache.h +++ b/cache.h @@ -454,6 +454,7 @@ extern const char *get_git_work_tree(void); #define READ_GITFILE_ERR_INVALID_FORMAT 5 #define READ_GITFILE_ERR_NO_PATH 6 #define READ_GITFILE_ERR_NOT_A_REPO 7 +#define READ_GITFILE_ERR_TOO_LARGE 8 extern const char *read_gitfile_gently(const char *path, int *return_error_code); #define read_gitfile(path) read_gitfile_gently((path), NULL) extern const char *resolve_gitdir(const char *suspect); diff --git a/setup.c b/setup.c index 4748b63..a03ca94 100644 --- a/setup.c +++ b/setup.c @@ -414,6 +414,7 @@ static void update_linked_gitdir(const char *gitfile, const char *gitdir) */ const char *read_gitfile_gently(const char *path, int *return_error_code) { + const int max_file_size = 1 20; /* 1MB */ int error_code = 0; char *buf = NULL; char *dir = NULL; @@ -430,6 +431,10 @@ const char *read_gitfile_gently(const char *path, int *return_error_code) error_code = READ_GITFILE_ERR_NOT_A_FILE; goto cleanup_return; } + if (st.st_size max_file_size) { + error_code = READ_GITFILE_ERR_TOO_LARGE; + goto cleanup_return; + } fd = open(path, O_RDONLY); if (fd 0) { error_code = READ_GITFILE_ERR_OPEN_FAILED; @@ -489,6 +494,8 @@ cleanup_return: return NULL; case READ_GITFILE_ERR_OPEN_FAILED: die_errno(Error opening '%s', path); + case READ_GITFILE_ERR_TOO_LARGE: + die(Too large to be a .git file: '%s', path); case READ_GITFILE_ERR_READ_FAILED: die(Error reading %s, path); case READ_GITFILE_ERR_INVALID_FORMAT: -- 2.4.3.373.gc496bfb.dirty -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v8 1/5] setup: add gentle version of read_gitfile
read_gitfile will die on most error cases. This makes it unsuitable for speculative calls. Extract the core logic and provide a gentle version that returns NULL on failure. The first usecase of the new gentle version will be to probe for submodules during git clean. Helped-by: Junio C Hamano gits...@pobox.com Helped-by: Jeff King p...@peff.net Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- cache.h | 11 - setup.c | 84 ++--- 2 files changed, 75 insertions(+), 20 deletions(-) diff --git a/cache.h b/cache.h index 571c98f..25578cb 100644 --- a/cache.h +++ b/cache.h @@ -446,7 +446,16 @@ extern int get_common_dir(struct strbuf *sb, const char *gitdir); extern const char *get_git_namespace(void); extern const char *strip_namespace(const char *namespaced_ref); extern const char *get_git_work_tree(void); -extern const char *read_gitfile(const char *path); + +#define READ_GITFILE_ERR_STAT_FAILED 1 +#define READ_GITFILE_ERR_NOT_A_FILE 2 +#define READ_GITFILE_ERR_OPEN_FAILED 3 +#define READ_GITFILE_ERR_READ_FAILED 4 +#define READ_GITFILE_ERR_INVALID_FORMAT 5 +#define READ_GITFILE_ERR_NO_PATH 6 +#define READ_GITFILE_ERR_NOT_A_REPO 7 +extern const char *read_gitfile_gently(const char *path, int *return_error_code); +#define read_gitfile(path) read_gitfile_gently((path), NULL) extern const char *resolve_gitdir(const char *suspect); extern void set_git_work_tree(const char *tree); diff --git a/setup.c b/setup.c index 863ddfd..4748b63 100644 --- a/setup.c +++ b/setup.c @@ -406,35 +406,53 @@ static void update_linked_gitdir(const char *gitfile, const char *gitdir) /* * Try to read the location of the git directory from the .git file, * return path to git directory if found. + * + * On failure, if return_error_code is not NULL, return_error_code + * will be set to an error code and NULL will be returned. If + * return_error_code is NULL the function will die instead (for most + * cases). */ -const char *read_gitfile(const char *path) +const char *read_gitfile_gently(const char *path, int *return_error_code) { - char *buf; - char *dir; + int error_code = 0; + char *buf = NULL; + char *dir = NULL; const char *slash; struct stat st; int fd; ssize_t len; - if (stat(path, st)) - return NULL; - if (!S_ISREG(st.st_mode)) - return NULL; + if (stat(path, st)) { + error_code = READ_GITFILE_ERR_STAT_FAILED; + goto cleanup_return; + } + if (!S_ISREG(st.st_mode)) { + error_code = READ_GITFILE_ERR_NOT_A_FILE; + goto cleanup_return; + } fd = open(path, O_RDONLY); - if (fd 0) - die_errno(Error opening '%s', path); + if (fd 0) { + error_code = READ_GITFILE_ERR_OPEN_FAILED; + goto cleanup_return; + } buf = xmalloc(st.st_size + 1); len = read_in_full(fd, buf, st.st_size); close(fd); - if (len != st.st_size) - die(Error reading %s, path); + if (len != st.st_size) { + error_code = READ_GITFILE_ERR_READ_FAILED; + goto cleanup_return; + } buf[len] = '\0'; - if (!starts_with(buf, gitdir: )) - die(Invalid gitfile format: %s, path); + if (!starts_with(buf, gitdir: )) { + error_code = READ_GITFILE_ERR_INVALID_FORMAT; + goto cleanup_return; + } while (buf[len - 1] == '\n' || buf[len - 1] == '\r') len--; - if (len 9) - die(No path in gitfile: %s, path); + if (len 9) { + error_code = READ_GITFILE_ERR_NO_PATH; + goto cleanup_return; + } buf[len] = '\0'; dir = buf + 8; @@ -448,14 +466,42 @@ const char *read_gitfile(const char *path) free(buf); buf = dir; } - - if (!is_git_directory(dir)) - die(Not a git repository: %s, dir); - + if (!is_git_directory(dir)) { + error_code = READ_GITFILE_ERR_NOT_A_REPO; + goto cleanup_return; + } update_linked_gitdir(path, dir); path = real_path(dir); +cleanup_return: free(buf); + + if (return_error_code) + *return_error_code = error_code; + + if (error_code) { + if (return_error_code) + return NULL; + + switch (error_code) { + case READ_GITFILE_ERR_STAT_FAILED: + case READ_GITFILE_ERR_NOT_A_FILE: + return NULL; + case READ_GITFILE_ERR_OPEN_FAILED: + die_errno(Error opening '%s', path); + case READ_GITFILE_ERR_READ_FAILED: + die(Error reading %s, path); + case READ_GITFILE_ERR_INVALID_FORMAT
[PATCH v7 1/5] setup: add gentle version of read_gitfile
read_gitfile will die on most error cases. This makes it unsuitable for speculative calls. Extract the core logic and provide a gentle version that returns NULL on failure. The first usecase of the new gentle version will be to probe for submodules during git clean. Helped-by: Junio C Hamano gits...@pobox.com Helped-by: Jeff King p...@peff.net Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- cache.h | 11 - setup.c | 84 ++--- 2 files changed, 75 insertions(+), 20 deletions(-) diff --git a/cache.h b/cache.h index 571c98f..25578cb 100644 --- a/cache.h +++ b/cache.h @@ -446,7 +446,16 @@ extern int get_common_dir(struct strbuf *sb, const char *gitdir); extern const char *get_git_namespace(void); extern const char *strip_namespace(const char *namespaced_ref); extern const char *get_git_work_tree(void); -extern const char *read_gitfile(const char *path); + +#define READ_GITFILE_ERR_STAT_FAILED 1 +#define READ_GITFILE_ERR_NOT_A_FILE 2 +#define READ_GITFILE_ERR_OPEN_FAILED 3 +#define READ_GITFILE_ERR_READ_FAILED 4 +#define READ_GITFILE_ERR_INVALID_FORMAT 5 +#define READ_GITFILE_ERR_NO_PATH 6 +#define READ_GITFILE_ERR_NOT_A_REPO 7 +extern const char *read_gitfile_gently(const char *path, int *return_error_code); +#define read_gitfile(path) read_gitfile_gently((path), NULL) extern const char *resolve_gitdir(const char *suspect); extern void set_git_work_tree(const char *tree); diff --git a/setup.c b/setup.c index 863ddfd..4748b63 100644 --- a/setup.c +++ b/setup.c @@ -406,35 +406,53 @@ static void update_linked_gitdir(const char *gitfile, const char *gitdir) /* * Try to read the location of the git directory from the .git file, * return path to git directory if found. + * + * On failure, if return_error_code is not NULL, return_error_code + * will be set to an error code and NULL will be returned. If + * return_error_code is NULL the function will die instead (for most + * cases). */ -const char *read_gitfile(const char *path) +const char *read_gitfile_gently(const char *path, int *return_error_code) { - char *buf; - char *dir; + int error_code = 0; + char *buf = NULL; + char *dir = NULL; const char *slash; struct stat st; int fd; ssize_t len; - if (stat(path, st)) - return NULL; - if (!S_ISREG(st.st_mode)) - return NULL; + if (stat(path, st)) { + error_code = READ_GITFILE_ERR_STAT_FAILED; + goto cleanup_return; + } + if (!S_ISREG(st.st_mode)) { + error_code = READ_GITFILE_ERR_NOT_A_FILE; + goto cleanup_return; + } fd = open(path, O_RDONLY); - if (fd 0) - die_errno(Error opening '%s', path); + if (fd 0) { + error_code = READ_GITFILE_ERR_OPEN_FAILED; + goto cleanup_return; + } buf = xmalloc(st.st_size + 1); len = read_in_full(fd, buf, st.st_size); close(fd); - if (len != st.st_size) - die(Error reading %s, path); + if (len != st.st_size) { + error_code = READ_GITFILE_ERR_READ_FAILED; + goto cleanup_return; + } buf[len] = '\0'; - if (!starts_with(buf, gitdir: )) - die(Invalid gitfile format: %s, path); + if (!starts_with(buf, gitdir: )) { + error_code = READ_GITFILE_ERR_INVALID_FORMAT; + goto cleanup_return; + } while (buf[len - 1] == '\n' || buf[len - 1] == '\r') len--; - if (len 9) - die(No path in gitfile: %s, path); + if (len 9) { + error_code = READ_GITFILE_ERR_NO_PATH; + goto cleanup_return; + } buf[len] = '\0'; dir = buf + 8; @@ -448,14 +466,42 @@ const char *read_gitfile(const char *path) free(buf); buf = dir; } - - if (!is_git_directory(dir)) - die(Not a git repository: %s, dir); - + if (!is_git_directory(dir)) { + error_code = READ_GITFILE_ERR_NOT_A_REPO; + goto cleanup_return; + } update_linked_gitdir(path, dir); path = real_path(dir); +cleanup_return: free(buf); + + if (return_error_code) + *return_error_code = error_code; + + if (error_code) { + if (return_error_code) + return NULL; + + switch (error_code) { + case READ_GITFILE_ERR_STAT_FAILED: + case READ_GITFILE_ERR_NOT_A_FILE: + return NULL; + case READ_GITFILE_ERR_OPEN_FAILED: + die_errno(Error opening '%s', path); + case READ_GITFILE_ERR_READ_FAILED: + die(Error reading %s, path); + case READ_GITFILE_ERR_INVALID_FORMAT
[PATCH v7 3/5] t7300: add tests to document behavior of clean and nested git
Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- t/t7300-clean.sh | 142 +++ 1 file changed, 142 insertions(+) diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh index 99be5d9..fbfdf2d 100755 --- a/t/t7300-clean.sh +++ b/t/t7300-clean.sh @@ -455,6 +455,148 @@ test_expect_success 'nested git work tree' ' ! test -d bar ' +test_expect_failure 'should clean things that almost look like git but are not' ' + rm -fr almost_git almost_bare_git almost_submodule + mkdir -p almost_git/.git/objects + mkdir -p almost_git/.git/refs + cat almost_git/.git/HEAD -\EOF + garbage + EOF + cp -r almost_git/.git/ almost_bare_git + mkdir almost_submodule/ + cat almost_submodule/.git -\EOF + garbage + EOF + test_when_finished rm -rf almost_* + ## This will fail due to die(Invalid gitfile format: %s, path); in + ## setup.c:read_gitfile. + git clean -f -d + test_path_is_missing almost_git + test_path_is_missing almost_bare_git + test_path_is_missing almost_submodule +' + +test_expect_success 'should not clean submodules' ' + rm -fr repo to_clean sub1 sub2 + mkdir repo to_clean + ( + cd repo + git init + test_commit msg hello.world + ) + git submodule add ./repo/.git sub1 + git commit -m sub1 + git branch before_sub2 + git submodule add ./repo/.git sub2 + git commit -m sub2 + git checkout before_sub2 + to_clean/should_clean.this + git clean -f -d + test_path_is_file repo/.git/index + test_path_is_file repo/hello.world + test_path_is_file sub1/.git + test_path_is_file sub1/hello.world + test_path_is_file sub2/.git + test_path_is_file sub2/hello.world + test_path_is_missing to_clean +' + +test_expect_failure 'should avoid cleaning possible submodules' ' + rm -fr to_clean possible_sub1 + mkdir to_clean possible_sub1 + test_when_finished rm -rf possible_sub* + echo gitdir: foo possible_sub1/.git + possible_sub1/hello.world + chmod 0 possible_sub1/.git + to_clean/should_clean.this + git clean -f -d + test_path_is_file possible_sub1/.git + test_path_is_file possible_sub1/hello.world + test_path_is_missing to_clean +' + +test_expect_failure 'nested (empty) git should be kept' ' + rm -fr empty_repo to_clean + git init empty_repo + mkdir to_clean + to_clean/should_clean.this + git clean -f -d + test_path_is_file empty_repo/.git/HEAD + test_path_is_missing to_clean +' + +test_expect_success 'nested bare repositories should be cleaned' ' + rm -fr bare1 bare2 subdir + git init --bare bare1 + git clone --local --bare . bare2 + mkdir subdir + cp -r bare2 subdir/bare3 + git clean -f -d + test_path_is_missing bare1 + test_path_is_missing bare2 + test_path_is_missing subdir +' + +test_expect_success 'nested (empty) bare repositories should be cleaned even when in .git' ' + rm -fr strange_bare + mkdir strange_bare + git init --bare strange_bare/.git + git clean -f -d + test_path_is_missing strange_bare +' + +test_expect_failure 'nested (non-empty) bare repositories should be cleaned even when in .git' ' + rm -fr strange_bare + mkdir strange_bare + git clone --local --bare . strange_bare/.git + git clean -f -d + test_path_is_missing strange_bare +' + +test_expect_success 'giving path in nested git work tree will remove it' ' + rm -fr repo + mkdir repo + ( + cd repo + git init + mkdir -p bar/baz + test_commit msg bar/baz/hello.world + ) + git clean -f -d repo/bar/baz + test_path_is_file repo/.git/HEAD + test_path_is_dir repo/bar/ + test_path_is_missing repo/bar/baz +' + +test_expect_success 'giving path to nested .git will not remove it' ' + rm -fr repo + mkdir repo untracked + ( + cd repo + git init + test_commit msg hello.world + ) + git clean -f -d repo/.git + test_path_is_file repo/.git/HEAD + test_path_is_dir repo/.git/refs + test_path_is_dir repo/.git/objects + test_path_is_dir untracked/ +' + +test_expect_success 'giving path to nested .git/ will remove contents' ' + rm -fr repo untracked + mkdir repo untracked + ( + cd repo + git init + test_commit msg hello.world + ) + git clean -f -d repo/.git/ + test_path_is_dir repo/.git + test_dir_is_empty repo/.git + test_path_is_dir untracked/ +' + test_expect_success 'force removal
[PATCH v7 2/5] setup: sanity check file size in read_gitfile_gently
read_gitfile_gently will allocate a buffer to fit the entire file that should be read. Add a sanity check of the file size before opening to avoid allocating a potentially huge amount of memory if we come across a large file that someone happened to name .git. The limit is set to a sufficiently unreasonable size that should never be exceeded by a genuine .git file. Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- cache.h | 1 + setup.c | 7 +++ 2 files changed, 8 insertions(+) diff --git a/cache.h b/cache.h index 25578cb..858d9b3 100644 --- a/cache.h +++ b/cache.h @@ -454,6 +454,7 @@ extern const char *get_git_work_tree(void); #define READ_GITFILE_ERR_INVALID_FORMAT 5 #define READ_GITFILE_ERR_NO_PATH 6 #define READ_GITFILE_ERR_NOT_A_REPO 7 +#define READ_GITFILE_ERR_TOO_LARGE 8 extern const char *read_gitfile_gently(const char *path, int *return_error_code); #define read_gitfile(path) read_gitfile_gently((path), NULL) extern const char *resolve_gitdir(const char *suspect); diff --git a/setup.c b/setup.c index 4748b63..e76955e 100644 --- a/setup.c +++ b/setup.c @@ -414,6 +414,7 @@ static void update_linked_gitdir(const char *gitfile, const char *gitdir) */ const char *read_gitfile_gently(const char *path, int *return_error_code) { + static const int one_MB = 1 20; int error_code = 0; char *buf = NULL; char *dir = NULL; @@ -430,6 +431,10 @@ const char *read_gitfile_gently(const char *path, int *return_error_code) error_code = READ_GITFILE_ERR_NOT_A_FILE; goto cleanup_return; } + if (st.st_size one_MB) { + error_code = READ_GITFILE_ERR_TOO_LARGE; + goto cleanup_return; + } fd = open(path, O_RDONLY); if (fd 0) { error_code = READ_GITFILE_ERR_OPEN_FAILED; @@ -489,6 +494,8 @@ cleanup_return: return NULL; case READ_GITFILE_ERR_OPEN_FAILED: die_errno(Error opening '%s', path); + case READ_GITFILE_ERR_TOO_LARGE: + die(Too large to be a .git file: '%s', path); case READ_GITFILE_ERR_READ_FAILED: die(Error reading %s, path); case READ_GITFILE_ERR_INVALID_FORMAT: -- 2.4.3.373.gc496bfb -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 4/5] p7300: add performance tests for clean
The tests are run in dry-run mode to avoid having to restore the test directories for each timed iteration. Using dry-run is an acceptable compromise since we are mostly interested in the initial computation of what to clean and not so much in the cleaning it self. Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- t/perf/p7300-clean.sh | 31 +++ 1 file changed, 31 insertions(+) create mode 100755 t/perf/p7300-clean.sh diff --git a/t/perf/p7300-clean.sh b/t/perf/p7300-clean.sh new file mode 100755 index 000..ec94cdd --- /dev/null +++ b/t/perf/p7300-clean.sh @@ -0,0 +1,31 @@ +#!/bin/sh + +test_description=Test git-clean performance + +. ./perf-lib.sh + +test_perf_default_repo +test_checkout_worktree + +test_expect_success 'setup untracked directory with many sub dirs' ' + rm -rf 500_sub_dirs 10_sub_dirs clean_test_dir + mkdir 500_sub_dirs 10_sub_dirs clean_test_dir + for i in $(test_seq 1 500) + do + mkdir 500_sub_dirs/dir$i || return $? + done + for i in $(test_seq 1 200) + do + cp -r 500_sub_dirs 10_sub_dirs/dir$i || return $? + done +' + +test_perf 'clean many untracked sub dirs, check for nested git' ' + git clean -n -q -f -d 10_sub_dirs/ +' + +test_perf 'clean many untracked sub dirs, ignore nested git' ' + git clean -n -q -f -f -d 10_sub_dirs/ +' + +test_done -- 2.4.3.373.gc496bfb -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 0/5] Improving performance of git clean
Here is a reroll of this series (after much delay). Changes in v7: * changed order of file size and file open error check in read_gitfile * resolved conflicts with nd/multiple-work-trees. This removed the need for is_git_directory_gently that was added in v6 and simplified some error cases. Erik Elfström (5): setup: add gentle version of read_gitfile setup: sanity check file size in read_gitfile_gently t7300: add tests to document behavior of clean and nested git p7300: add performance tests for clean clean: improve performance when removing lots of directories builtin/clean.c | 30 +-- cache.h | 12 - setup.c | 91 +--- t/perf/p7300-clean.sh | 31 +++ t/t7300-clean.sh | 140 ++ 5 files changed, 280 insertions(+), 24 deletions(-) create mode 100755 t/perf/p7300-clean.sh -- 2.4.3.373.gc496bfb -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 5/5] clean: improve performance when removing lots of directories
git clean uses resolve_gitlink_ref() to check for the presence of nested git repositories, but it has the drawback of creating a ref_cache entry for every directory that should potentially be cleaned. The linear search through the ref_cache list causes a massive performance hit for large number of directories. Modify clean.c:remove_dirs to use setup.c:is_git_directory and setup.c:read_gitfile_gently instead. Both these functions will open files and parse contents when they find something that looks like a git repository. This is ok from a performance standpoint since finding repository candidates should be comparatively rare. Using is_git_directory and read_gitfile_gently should give a more standardized check for what is and what isn't a git repository but also gives three behavioral changes. The first change is that we will now detect and avoid cleaning empty nested git repositories (only init run). This is desirable. Second, we will no longer die when cleaning a file named .git with garbage content (it will be cleaned instead). This is also desirable. The last change is that we will detect and avoid cleaning empty bare repositories that have been placed in a directory named .git. This is not desirable but should have no real user impact since we already fail to clean non-empty bare repositories in the same scenario. This is thus deemed acceptable. On top of this we add some extra precautions. If read_gitfile_gently fails to open the git file, read the git file or verify the path in the git file we assume that the path with the git file is a valid repository and avoid cleaning. Update t7300 to reflect these changes in behavior. The time to clean an untracked directory containing 10 sub directories went from 61s to 1.7s after this change. Helped-by: Jeff King p...@peff.net Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- builtin/clean.c | 30 ++ t/t7300-clean.sh | 10 -- 2 files changed, 30 insertions(+), 10 deletions(-) diff --git a/builtin/clean.c b/builtin/clean.c index 6dcb72e..df53def 100644 --- a/builtin/clean.c +++ b/builtin/clean.c @@ -10,7 +10,6 @@ #include cache.h #include dir.h #include parse-options.h -#include refs.h #include string-list.h #include quote.h #include column.h @@ -148,6 +147,31 @@ static int exclude_cb(const struct option *opt, const char *arg, int unset) return 0; } +/* + * Return 1 if the given path is the root of a git repository or + * submodule else 0. Will not return 1 for bare repositories with the + * exception of creating a bare repository in foo/.git and calling + * is_git_repository(foo). + */ +static int is_git_repository(struct strbuf *path) +{ + int ret = 0; + int gitfile_error; + size_t orig_path_len = path-len; + assert(orig_path_len != 0); + if (path-buf[orig_path_len - 1] != '/') + strbuf_addch(path, '/'); + strbuf_addstr(path, .git); + if (read_gitfile_gently(path-buf, gitfile_error) || is_git_directory(path-buf)) + ret = 1; + if (gitfile_error == READ_GITFILE_ERR_OPEN_FAILED || + gitfile_error == READ_GITFILE_ERR_READ_FAILED) + ret = 1; /* This could be a real .git file, take the + * safe option and avoid cleaning */ + strbuf_setlen(path, orig_path_len); + return ret; +} + static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag, int dry_run, int quiet, int *dir_gone) { @@ -155,13 +179,11 @@ static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag, struct strbuf quoted = STRBUF_INIT; struct dirent *e; int res = 0, ret = 0, gone = 1, original_len = path-len, len; - unsigned char submodule_head[20]; struct string_list dels = STRING_LIST_INIT_DUP; *dir_gone = 1; - if ((force_flag REMOVE_DIR_KEEP_NESTED_GIT) - !resolve_gitlink_ref(path-buf, HEAD, submodule_head)) { + if ((force_flag REMOVE_DIR_KEEP_NESTED_GIT) is_git_repository(path)) { if (!quiet) { quote_path_relative(path-buf, prefix, quoted); printf(dry_run ? _(msg_would_skip_git_dir) : _(msg_skip_git_dir), diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh index fbfdf2d..32e96da 100755 --- a/t/t7300-clean.sh +++ b/t/t7300-clean.sh @@ -455,7 +455,7 @@ test_expect_success 'nested git work tree' ' ! test -d bar ' -test_expect_failure 'should clean things that almost look like git but are not' ' +test_expect_success 'should clean things that almost look like git but are not' ' rm -fr almost_git almost_bare_git almost_submodule mkdir -p almost_git/.git/objects mkdir -p almost_git/.git/refs @@ -468,8 +468,6 @@ test_expect_failure 'should clean things that almost look like git but are not' garbage EOF test_when_finished rm
[PATCH v6 6/7] clean: improve performance when removing lots of directories
git clean uses resolve_gitlink_ref() to check for the presence of nested git repositories, but it has the drawback of creating a ref_cache entry for every directory that should potentially be cleaned. The linear search through the ref_cache list causes a massive performance hit for large number of directories. Modify clean.c:remove_dirs to use setup.c:is_git_directory and setup.c:read_gitfile_gently instead. Both these functions will open files and parse contents when they find something that looks like a git repository. This is ok from a performance standpoint since finding repository candidates should be comparatively rare. Using is_git_directory and read_gitfile_gently should give a more standardized check for what is and what isn't a git repository but also gives three behavioral changes. The first change is that we will now detect and avoid cleaning empty nested git repositories (only init run). This is desirable. Second, we will no longer die when cleaning a file named .git with garbage content (it will be cleaned instead). This is also desirable. The last change is that we will detect and avoid cleaning empty bare repositories that have been placed in a directory named .git. This is not desirable but should have no real user impact since we already fail to clean non-empty bare repositories in the same scenario. This is thus deemed acceptable. On top of this we add some extra precautions. If read_gitfile_gently fails to open the git file, read the git file or verify the path in the git file we assume that the path with the git file is a valid repository and avoid cleaning. Update t7300 to reflect these changes in behavior. The time to clean an untracked directory containing 10 sub directories went from 61s to 1.7s after this change. Helped-by: Jeff King p...@peff.net Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- builtin/clean.c | 31 +++ t/t7300-clean.sh | 10 -- 2 files changed, 31 insertions(+), 10 deletions(-) diff --git a/builtin/clean.c b/builtin/clean.c index 98c103f..d739dcf 100644 --- a/builtin/clean.c +++ b/builtin/clean.c @@ -10,7 +10,6 @@ #include cache.h #include dir.h #include parse-options.h -#include refs.h #include string-list.h #include quote.h #include column.h @@ -148,6 +147,32 @@ static int exclude_cb(const struct option *opt, const char *arg, int unset) return 0; } +/* + * Return 1 if the given path is the root of a git repository or + * submodule else 0. Will not return 1 for bare repositories with the + * exception of creating a bare repository in foo/.git and calling + * is_git_repository(foo). + */ +static int is_git_repository(struct strbuf *path) +{ + int ret = 0; + int gitfile_error; + size_t orig_path_len = path-len; + assert(orig_path_len != 0); + if (path-buf[orig_path_len - 1] != '/') + strbuf_addch(path, '/'); + strbuf_addstr(path, .git); + if (read_gitfile_gently(path-buf, gitfile_error) || is_git_directory(path-buf)) + ret = 1; + if (gitfile_error == READ_GITFILE_ERR_OPEN_FAILED || + gitfile_error == READ_GITFILE_ERR_READ_FAILED || + gitfile_error == READ_GITFILE_ERR_CANT_VERIFY_PATH) + ret = 1; /* This could be a real .git file, take the + * safe option and avoid cleaning */ + strbuf_setlen(path, orig_path_len); + return ret; +} + static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag, int dry_run, int quiet, int *dir_gone) { @@ -155,13 +180,11 @@ static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag, struct strbuf quoted = STRBUF_INIT; struct dirent *e; int res = 0, ret = 0, gone = 1, original_len = path-len, len; - unsigned char submodule_head[20]; struct string_list dels = STRING_LIST_INIT_DUP; *dir_gone = 1; - if ((force_flag REMOVE_DIR_KEEP_NESTED_GIT) - !resolve_gitlink_ref(path-buf, HEAD, submodule_head)) { + if ((force_flag REMOVE_DIR_KEEP_NESTED_GIT) is_git_repository(path)) { if (!quiet) { quote_path_relative(path-buf, prefix, quoted); printf(dry_run ? _(msg_would_skip_git_dir) : _(msg_skip_git_dir), diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh index 23962e4..fbab888 100755 --- a/t/t7300-clean.sh +++ b/t/t7300-clean.sh @@ -455,7 +455,7 @@ test_expect_success 'nested git work tree' ' ! test -d bar ' -test_expect_failure 'should clean things that almost look like git but are not' ' +test_expect_success 'should clean things that almost look like git but are not' ' rm -fr almost_git almost_bare_git almost_submodule mkdir -p almost_git/.git/objects mkdir -p almost_git/.git/refs @@ -468,8 +468,6 @@ test_expect_failure 'should clean things that almost look like git
Re: [PATCH v5 1/5] setup: add gentle version of read_gitfile
On Tue, Apr 28, 2015 at 8:17 AM, Jeff King p...@peff.net wrote: There was a discussion not too long ago on strategies for returning errors, and one of the suggestions was to return an error strbuf rather than a code[1]. That's less flexible, as the caller can't react differently based on the type of error. But for cases like this, where the only fate for the code is to get converted back into a message, it can reduce the boilerplate. What you have here is OK to me, and I don't want to hold up your patch series in a flamewar about error-reporting techniques. But I think it's an interesting case study. -Peff Thanks. I haven't had time to look through that thread yet, I'll try to get to that later. My initial reaction is a bit skeptical though. For this case we currently don't want any error reporting, the NULL return is sufficient and even allocating and sending in the int* is pure noise. Allocating and releasing a strbuf seems like a lot more overhead for this type of caller? The one other potential candidate caller for read_gitfile_gently that I have seen (clone.c:get_repo_path) don't want any error code or message either as far as i can tell. Also if it turns out that we actually need to treat the file too large error differently in clean (as discussed in thread on the file size check) then we can no longer communicate that back using the strbuf interface. Overall it seems like a less attractive solution to me but I can be persuaded if there is more support expressed for the strbuf direction. /Erik -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 5/5] clean: improve performance when removing lots of directories
On Tue, Apr 28, 2015 at 8:24 AM, Jeff King p...@peff.net wrote: This iteration looks reasonable overall to me. Should this is_git_repository() helper be available to other files? I think there are other calls to resolve_gitlink_ref() that would want the same treatment (e.g., I think git status may have a similar issue). -Peff Yes that is probably the direction to go in but I think the current version is too tailored to the clean case to be generally useful (although I haven't check the status case). Right now this is more of a is_this_a_repo_clean_should_care_about_ignoring_some_cornercase_quirks than a true is_git_repository check. I would think a general version would look more like this: static int is_git_repository(struct strbuf *path) { if (is_bare_repository(path)) return BARE_REPO; if (is_submodule(path)) return SUBMODULE; if (is_git_director(path)) return REPO; return 0; } probably also communicating any errors back to the caller. To sum it up I think you are correct in the direction but I suspect that the current version is not sufficient for the general case and that it would be best to leave the generalization and making it public for future work. /Erik -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 4/5] p7300: add performance tests for clean
On Tue, Apr 28, 2015 at 8:33 AM, Jeff King p...@peff.net wrote: Do we actually need a large repo here? The real cost is coming from the directories we create. We could actually start with a totally empty repository if we wanted (though I don't think the t/perf system handles that right now). But if there's not a reason to use the large repo, I think using test_perf_default_repo is better, as it works out of the box without specifying extra environment variables (well, it works either way, but you get a nasty warning from perf-lib.sh). -Peff Sure, I'll change this to the default. I wasn't sure about the performance characteristics of clean (does working tree size matter?) when I started so it felt safest to go with a large repo. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 2/5] setup: sanity check file size in read_gitfile_gently
On Tue, Apr 28, 2015 at 8:02 AM, Jeff King p...@peff.net wrote: My understanding is that PATH_MAX is set absurdly low on Windows systems (and doesn't actually represent the real limit of a path!). Since the value is picked arbitrarily anyway, could use something more independent (like 100K or something, which is large enough to be beyond absurd and small enough that a malloc isn't a big deal)? -Peff I'm happy to set the limit to anything that makes everybody feel safe. I'll set it to 1MB to be on the safe side. I'm not sure though how the code (in general) is supposed to keep working if a path can exceed PATH_MAX? A cursory search for PATH_MAX comes up with char array sizes and check-and-die kind of things. If a path is longer then surely we will be unable to handle it and abort in all sorts of places? Are you only worried we might have a submodule with a too long path (that will create various other problems in different codepaths) that we may mistakenly clean (if it doesn't trigger any other abort earlier in the clean call chain) or do you want clean to keep working and do the right thing even in this case? While digging around looking at this I also noticed that there is another problem I have overlooked previously. read_gitfile_gently will call is_git_directory at the very end and it contains the following check at the very beginning: if (PATH_MAX = len + strlen(/objects)) die(Too long path: %.*s, 60, suspect); Now, this is good in the way that we will avoid mistakenly cleaning stuff because the path is too long but also bad because it makes read_gitfile_gently behave very ungently in this case. I suspect I should make a gentle version of this also. The question is what to do in clean if the path is reported as too long? Abort? Avoid cleaning it to be safe? Ignore and clean it? is_git_directory is also called from the new is_git_repository directly but here I think dying is ok since this path is a path in the working tree and if we can't handle the paths in the tree then there seem to be little point in trying to go on (as opposed to when some string in a file is too large for a path) Thoughts? /Erik -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 2/5] setup: sanity check file size in read_gitfile_gently
read_gitfile_gently will allocate a buffer to fit the entire file that should be read. Add a sanity check of the file size before opening to avoid allocating a potentially huge amount of memory if we come across a large file that someone happened to name .git. The limit is set to a sufficiently unreasonable size that should never be exceeded by a genuine .git file. Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- cache.h | 1 + setup.c | 6 ++ 2 files changed, 7 insertions(+) diff --git a/cache.h b/cache.h index 868e4d3..c9f1f8e 100644 --- a/cache.h +++ b/cache.h @@ -439,6 +439,7 @@ extern const char *get_git_work_tree(void); #define READ_GITFILE_ERR_INVALID_FORMAT 5 #define READ_GITFILE_ERR_NO_PATH 6 #define READ_GITFILE_ERR_NOT_A_REPO 7 +#define READ_GITFILE_ERR_TOO_LARGE 8 extern const char *read_gitfile_gently(const char *path, int *return_error_code); #define read_gitfile(path) read_gitfile_gently((path), NULL) extern const char *resolve_gitdir(const char *suspect); diff --git a/setup.c b/setup.c index c4538ca..792c37b 100644 --- a/setup.c +++ b/setup.c @@ -364,6 +364,10 @@ const char *read_gitfile_gently(const char *path, int *return_error_code) error_code = READ_GITFILE_ERR_OPEN_FAILED; goto cleanup_return; } + if (st.st_size PATH_MAX * 4) { + error_code = READ_GITFILE_ERR_TOO_LARGE; + goto cleanup_return; + } buf = xmalloc(st.st_size + 1); len = read_in_full(fd, buf, st.st_size); close(fd); @@ -418,6 +422,8 @@ cleanup_return: return NULL; case READ_GITFILE_ERR_OPEN_FAILED: die_errno(Error opening '%s', path); + case READ_GITFILE_ERR_TOO_LARGE: + die(Too large to be a .git file: '%s', path); case READ_GITFILE_ERR_READ_FAILED: die(Error reading %s, path); case READ_GITFILE_ERR_INVALID_FORMAT: -- 2.4.0.rc3.8.gbb31afb -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 4/5] p7300: add performance tests for clean
The tests are run in dry-run mode to avoid having to restore the test directories for each timed iteration. Using dry-run is an acceptable compromise since we are mostly interested in the initial computation of what to clean and not so much in the cleaning it self. Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- t/perf/p7300-clean.sh | 31 +++ 1 file changed, 31 insertions(+) create mode 100755 t/perf/p7300-clean.sh diff --git a/t/perf/p7300-clean.sh b/t/perf/p7300-clean.sh new file mode 100755 index 000..6ae55ec --- /dev/null +++ b/t/perf/p7300-clean.sh @@ -0,0 +1,31 @@ +#!/bin/sh + +test_description=Test git-clean performance + +. ./perf-lib.sh + +test_perf_large_repo +test_checkout_worktree + +test_expect_success 'setup untracked directory with many sub dirs' ' + rm -rf 500_sub_dirs 10_sub_dirs clean_test_dir + mkdir 500_sub_dirs 10_sub_dirs clean_test_dir + for i in $(test_seq 1 500) + do + mkdir 500_sub_dirs/dir$i || return $? + done + for i in $(test_seq 1 200) + do + cp -r 500_sub_dirs 10_sub_dirs/dir$i || return $? + done +' + +test_perf 'clean many untracked sub dirs, check for nested git' ' + git clean -n -q -f -d 10_sub_dirs/ +' + +test_perf 'clean many untracked sub dirs, ignore nested git' ' + git clean -n -q -f -f -d 10_sub_dirs/ +' + +test_done -- 2.4.0.rc3.8.gbb31afb -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 3/5] t7300: add tests to document behavior of clean and nested git
Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- t/t7300-clean.sh | 128 +++ 1 file changed, 128 insertions(+) diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh index 99be5d9..11f3a6d 100755 --- a/t/t7300-clean.sh +++ b/t/t7300-clean.sh @@ -455,6 +455,134 @@ test_expect_success 'nested git work tree' ' ! test -d bar ' +test_expect_failure 'should clean things that almost look like git but are not' ' + rm -fr almost_git almost_bare_git almost_submodule + mkdir -p almost_git/.git/objects + mkdir -p almost_git/.git/refs + cat almost_git/.git/HEAD -\EOF + garbage + EOF + cp -r almost_git/.git/ almost_bare_git + mkdir almost_submodule/ + cat almost_submodule/.git -\EOF + garbage + EOF + test_when_finished rm -rf almost_* + ## This will fail due to die(Invalid gitfile format: %s, path); in + ## setup.c:read_gitfile. + git clean -f -d + test_path_is_missing almost_git + test_path_is_missing almost_bare_git + test_path_is_missing almost_submodule +' + +test_expect_success 'should not clean submodules' ' + rm -fr repo to_clean sub1 sub2 + mkdir repo to_clean + ( + cd repo + git init + test_commit msg hello.world + ) + git submodule add ./repo/.git sub1 + git commit -m sub1 + git branch before_sub2 + git submodule add ./repo/.git sub2 + git commit -m sub2 + git checkout before_sub2 + to_clean/should_clean.this + git clean -f -d + test_path_is_file repo/.git/index + test_path_is_file repo/hello.world + test_path_is_file sub1/.git + test_path_is_file sub1/hello.world + test_path_is_file sub2/.git + test_path_is_file sub2/hello.world + test_path_is_missing to_clean +' + +test_expect_failure 'nested (empty) git should be kept' ' + rm -fr empty_repo to_clean + git init empty_repo + mkdir to_clean + to_clean/should_clean.this + git clean -f -d + test_path_is_file empty_repo/.git/HEAD + test_path_is_missing to_clean +' + +test_expect_success 'nested bare repositories should be cleaned' ' + rm -fr bare1 bare2 subdir + git init --bare bare1 + git clone --local --bare . bare2 + mkdir subdir + cp -r bare2 subdir/bare3 + git clean -f -d + test_path_is_missing bare1 + test_path_is_missing bare2 + test_path_is_missing subdir +' + +test_expect_success 'nested (empty) bare repositories should be cleaned even when in .git' ' + rm -fr strange_bare + mkdir strange_bare + git init --bare strange_bare/.git + git clean -f -d + test_path_is_missing strange_bare +' + +test_expect_failure 'nested (non-empty) bare repositories should be cleaned even when in .git' ' + rm -fr strange_bare + mkdir strange_bare + git clone --local --bare . strange_bare/.git + git clean -f -d + test_path_is_missing strange_bare +' + +test_expect_success 'giving path in nested git work tree will remove it' ' + rm -fr repo + mkdir repo + ( + cd repo + git init + mkdir -p bar/baz + test_commit msg bar/baz/hello.world + ) + git clean -f -d repo/bar/baz + test_path_is_file repo/.git/HEAD + test_path_is_dir repo/bar/ + test_path_is_missing repo/bar/baz +' + +test_expect_success 'giving path to nested .git will not remove it' ' + rm -fr repo + mkdir repo untracked + ( + cd repo + git init + test_commit msg hello.world + ) + git clean -f -d repo/.git + test_path_is_file repo/.git/HEAD + test_path_is_dir repo/.git/refs + test_path_is_dir repo/.git/objects + test_path_is_dir untracked/ +' + +test_expect_success 'giving path to nested .git/ will remove contents' ' + rm -fr repo untracked + mkdir repo untracked + ( + cd repo + git init + test_commit msg hello.world + ) + git clean -f -d repo/.git/ + test_path_is_dir repo/.git + test_dir_is_empty repo/.git + test_path_is_dir untracked/ +' + test_expect_success 'force removal of nested git work tree' ' rm -fr foo bar baz mkdir -p foo bar baz/boo -- 2.4.0.rc3.8.gbb31afb -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 1/5] setup: add gentle version of read_gitfile
read_gitfile will die on most error cases. This makes it unsuitable for speculative calls. Extract the core logic and provide a gentle version that returns NULL on failure. The first usecase of the new gentle version will be to probe for submodules during git clean. Helped-by: Junio C Hamano gits...@pobox.com Helped-by: Jeff King p...@peff.net Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- cache.h | 11 - setup.c | 82 +++-- 2 files changed, 75 insertions(+), 18 deletions(-) diff --git a/cache.h b/cache.h index 3d3244b..868e4d3 100644 --- a/cache.h +++ b/cache.h @@ -431,7 +431,16 @@ extern int set_git_dir(const char *path); extern const char *get_git_namespace(void); extern const char *strip_namespace(const char *namespaced_ref); extern const char *get_git_work_tree(void); -extern const char *read_gitfile(const char *path); + +#define READ_GITFILE_ERR_STAT_FAILED 1 +#define READ_GITFILE_ERR_NOT_A_FILE 2 +#define READ_GITFILE_ERR_OPEN_FAILED 3 +#define READ_GITFILE_ERR_READ_FAILED 4 +#define READ_GITFILE_ERR_INVALID_FORMAT 5 +#define READ_GITFILE_ERR_NO_PATH 6 +#define READ_GITFILE_ERR_NOT_A_REPO 7 +extern const char *read_gitfile_gently(const char *path, int *return_error_code); +#define read_gitfile(path) read_gitfile_gently((path), NULL) extern const char *resolve_gitdir(const char *suspect); extern void set_git_work_tree(const char *tree); diff --git a/setup.c b/setup.c index 979b13f..c4538ca 100644 --- a/setup.c +++ b/setup.c @@ -335,35 +335,53 @@ static int check_repository_format_gently(const char *gitdir, int *nongit_ok) /* * Try to read the location of the git directory from the .git file, * return path to git directory if found. + * + * On failure, if return_error_code is not NULL, return_error_code + * will be set to an error code and NULL will be returned. If + * return_error_code is NULL the function will die instead (for most + * cases). */ -const char *read_gitfile(const char *path) +const char *read_gitfile_gently(const char *path, int *return_error_code) { - char *buf; - char *dir; + int error_code = 0; + char *buf = NULL; + char *dir = NULL; const char *slash; struct stat st; int fd; ssize_t len; - if (stat(path, st)) - return NULL; - if (!S_ISREG(st.st_mode)) - return NULL; + if (stat(path, st)) { + error_code = READ_GITFILE_ERR_STAT_FAILED; + goto cleanup_return; + } + if (!S_ISREG(st.st_mode)) { + error_code = READ_GITFILE_ERR_NOT_A_FILE; + goto cleanup_return; + } fd = open(path, O_RDONLY); - if (fd 0) - die_errno(Error opening '%s', path); + if (fd 0) { + error_code = READ_GITFILE_ERR_OPEN_FAILED; + goto cleanup_return; + } buf = xmalloc(st.st_size + 1); len = read_in_full(fd, buf, st.st_size); close(fd); - if (len != st.st_size) - die(Error reading %s, path); + if (len != st.st_size) { + error_code = READ_GITFILE_ERR_READ_FAILED; + goto cleanup_return; + } buf[len] = '\0'; - if (!starts_with(buf, gitdir: )) - die(Invalid gitfile format: %s, path); + if (!starts_with(buf, gitdir: )) { + error_code = READ_GITFILE_ERR_INVALID_FORMAT; + goto cleanup_return; + } while (buf[len - 1] == '\n' || buf[len - 1] == '\r') len--; - if (len 9) - die(No path in gitfile: %s, path); + if (len 9) { + error_code = READ_GITFILE_ERR_NO_PATH; + goto cleanup_return; + } buf[len] = '\0'; dir = buf + 8; @@ -378,11 +396,41 @@ const char *read_gitfile(const char *path) buf = dir; } - if (!is_git_directory(dir)) - die(Not a git repository: %s, dir); + if (!is_git_directory(dir)) { + error_code = READ_GITFILE_ERR_NOT_A_REPO; + goto cleanup_return; + } path = real_path(dir); +cleanup_return: free(buf); + + if (return_error_code) + *return_error_code = error_code; + + if (error_code) { + if (return_error_code) + return NULL; + + switch (error_code) { + case READ_GITFILE_ERR_STAT_FAILED: + case READ_GITFILE_ERR_NOT_A_FILE: + return NULL; + case READ_GITFILE_ERR_OPEN_FAILED: + die_errno(Error opening '%s', path); + case READ_GITFILE_ERR_READ_FAILED: + die(Error reading %s, path); + case READ_GITFILE_ERR_INVALID_FORMAT: + die(Invalid gitfile format: %s, path); + case
[PATCH v5 5/5] clean: improve performance when removing lots of directories
git clean uses resolve_gitlink_ref() to check for the presence of nested git repositories, but it has the drawback of creating a ref_cache entry for every directory that should potentially be cleaned. The linear search through the ref_cache list causes a massive performance hit for large number of directories. Modify clean.c:remove_dirs to use setup.c:is_git_directory and setup.c:read_gitfile_gently instead. Both these functions will open files and parse contents when they find something that looks like a git repository. This is ok from a performance standpoint since finding repository candidates should be comparatively rare. Using is_git_directory and read_gitfile_gently should give a more standardized check for what is and what isn't a git repository but also gives three behavioral changes. The first change is that we will now detect and avoid cleaning empty nested git repositories (only init run). This is desirable. Second, we will no longer die when cleaning a file named .git with garbage content (it will be cleaned instead). This is also desirable. The last change is that we will detect and avoid cleaning empty bare repositories that have been placed in a directory named .git. This is not desirable but should have no real user impact since we already fail to clean non-empty bare repositories in the same scenario. This is thus deemed acceptable. Update t7300 to reflect these changes in behavior. The time to clean an untracked directory containing 10 sub directories went from 61s to 1.7s after this change. Helped-by: Jeff King p...@peff.net Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- builtin/clean.c | 26 ++ t/t7300-clean.sh | 8 +++- 2 files changed, 25 insertions(+), 9 deletions(-) diff --git a/builtin/clean.c b/builtin/clean.c index 98c103f..17a1c13 100644 --- a/builtin/clean.c +++ b/builtin/clean.c @@ -10,7 +10,6 @@ #include cache.h #include dir.h #include parse-options.h -#include refs.h #include string-list.h #include quote.h #include column.h @@ -148,6 +147,27 @@ static int exclude_cb(const struct option *opt, const char *arg, int unset) return 0; } +/* + * Return 1 if the given path is the root of a git repository or + * submodule else 0. Will not return 1 for bare repositories with the + * exception of creating a bare repository in foo/.git and calling + * is_git_repository(foo). + */ +static int is_git_repository(struct strbuf *path) +{ + int ret = 0; + int unused_error_code; + size_t orig_path_len = path-len; + assert(orig_path_len != 0); + if (path-buf[orig_path_len - 1] != '/') + strbuf_addch(path, '/'); + strbuf_addstr(path, .git); + if (read_gitfile_gently(path-buf, unused_error_code) || is_git_directory(path-buf)) + ret = 1; + strbuf_setlen(path, orig_path_len); + return ret; +} + static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag, int dry_run, int quiet, int *dir_gone) { @@ -155,13 +175,11 @@ static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag, struct strbuf quoted = STRBUF_INIT; struct dirent *e; int res = 0, ret = 0, gone = 1, original_len = path-len, len; - unsigned char submodule_head[20]; struct string_list dels = STRING_LIST_INIT_DUP; *dir_gone = 1; - if ((force_flag REMOVE_DIR_KEEP_NESTED_GIT) - !resolve_gitlink_ref(path-buf, HEAD, submodule_head)) { + if ((force_flag REMOVE_DIR_KEEP_NESTED_GIT) is_git_repository(path)) { if (!quiet) { quote_path_relative(path-buf, prefix, quoted); printf(dry_run ? _(msg_would_skip_git_dir) : _(msg_skip_git_dir), diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh index 11f3a6d..4d07a4a 100755 --- a/t/t7300-clean.sh +++ b/t/t7300-clean.sh @@ -455,7 +455,7 @@ test_expect_success 'nested git work tree' ' ! test -d bar ' -test_expect_failure 'should clean things that almost look like git but are not' ' +test_expect_success 'should clean things that almost look like git but are not' ' rm -fr almost_git almost_bare_git almost_submodule mkdir -p almost_git/.git/objects mkdir -p almost_git/.git/refs @@ -468,8 +468,6 @@ test_expect_failure 'should clean things that almost look like git but are not' garbage EOF test_when_finished rm -rf almost_* - ## This will fail due to die(Invalid gitfile format: %s, path); in - ## setup.c:read_gitfile. git clean -f -d test_path_is_missing almost_git test_path_is_missing almost_bare_git @@ -501,7 +499,7 @@ test_expect_success 'should not clean submodules' ' test_path_is_missing to_clean ' -test_expect_failure 'nested (empty) git should be kept' ' +test_expect_success 'nested (empty) git should be kept' ' rm -fr
[PATCH v5 0/5] Improving performance of git clean
Changes in v5: * Added defines for read_gitfile_gently error codes. This was a silly mistake, sorry about that. Erik Elfström (5): setup: add gentle version of read_gitfile setup: sanity check file size in read_gitfile_gently t7300: add tests to document behavior of clean and nested git p7300: add performance tests for clean clean: improve performance when removing lots of directories builtin/clean.c | 26 +-- cache.h | 12 - setup.c | 88 --- t/perf/p7300-clean.sh | 31 + t/t7300-clean.sh | 126 ++ 5 files changed, 261 insertions(+), 22 deletions(-) create mode 100755 t/perf/p7300-clean.sh -- 2.4.0.rc3.8.gbb31afb -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/5] setup: sanity check file size in read_gitfile_gently
On Sat, Apr 25, 2015 at 6:47 PM, Junio C Hamano gits...@pobox.com wrote: I do not think it is wrong per-se, but the changes in this patch shows why hardcoded values assigned to error_code without #define is not a good idea, as these values are now exposed to the callers of the new function. After we gain a new caller that does care about the exact error code (e.g. to react differently to the reason why we failed to read by giving different error messages) if we decide to revert this step, or if we decide to add a new error type, for whatever reason, this organization forces the caller to be updated. Yes, it was a bit silly of me not to add that. Would something like this be ok? --- cache.h | 9 + setup.c | 32 2 files changed, 25 insertions(+), 16 deletions(-) diff --git a/cache.h b/cache.h index 6e29068..177337a 100644 --- a/cache.h +++ b/cache.h @@ -431,6 +431,15 @@ extern int set_git_dir(const char *path); extern const char *get_git_namespace(void); extern const char *strip_namespace(const char *namespaced_ref); extern const char *get_git_work_tree(void); + +#define READ_GITFILE_ERR_STAT_FAILED 1 +#define READ_GITFILE_ERR_NOT_A_FILE 2 +#define READ_GITFILE_ERR_OPEN_FAILED 3 +#define READ_GITFILE_ERR_TOO_LARGE 4 +#define READ_GITFILE_ERR_READ_FAILED 5 +#define READ_GITFILE_ERR_INVALID_FORMAT 6 +#define READ_GITFILE_ERR_NO_PATH 7 +#define READ_GITFILE_ERR_NOT_A_REPO 8 extern const char *read_gitfile_gently(const char *path, int *return_error_code); #define read_gitfile(path) read_gitfile_gently((path), NULL) extern const char *resolve_gitdir(const char *suspect); diff --git a/setup.c b/setup.c index ed87334..792c37b 100644 --- a/setup.c +++ b/setup.c @@ -352,38 +352,38 @@ const char *read_gitfile_gently(const char *path, int *return_error_code) ssize_t len; if (stat(path, st)) { - error_code = 1; + error_code = READ_GITFILE_ERR_STAT_FAILED; goto cleanup_return; } if (!S_ISREG(st.st_mode)) { - error_code = 2; + error_code = READ_GITFILE_ERR_NOT_A_FILE; goto cleanup_return; } fd = open(path, O_RDONLY); if (fd 0) { - error_code = 3; + error_code = READ_GITFILE_ERR_OPEN_FAILED; goto cleanup_return; } if (st.st_size PATH_MAX * 4) { - error_code = 4; + error_code = READ_GITFILE_ERR_TOO_LARGE; goto cleanup_return; } buf = xmalloc(st.st_size + 1); len = read_in_full(fd, buf, st.st_size); close(fd); if (len != st.st_size) { - error_code = 5; + error_code = READ_GITFILE_ERR_READ_FAILED; goto cleanup_return; } buf[len] = '\0'; if (!starts_with(buf, gitdir: )) { - error_code = 6; + error_code = READ_GITFILE_ERR_INVALID_FORMAT; goto cleanup_return; } while (buf[len - 1] == '\n' || buf[len - 1] == '\r') len--; if (len 9) { - error_code = 7; + error_code = READ_GITFILE_ERR_NO_PATH; goto cleanup_return; } buf[len] = '\0'; @@ -401,7 +401,7 @@ const char *read_gitfile_gently(const char *path, int *return_error_code) } if (!is_git_directory(dir)) { - error_code = 8; + error_code = READ_GITFILE_ERR_NOT_A_REPO; goto cleanup_return; } path = real_path(dir); @@ -417,20 +417,20 @@ cleanup_return: return NULL; switch (error_code) { - case 1: // failed to stat - case 2: // not regular file + case READ_GITFILE_ERR_STAT_FAILED: + case READ_GITFILE_ERR_NOT_A_FILE: return NULL; - case 3: + case READ_GITFILE_ERR_OPEN_FAILED: die_errno(Error opening '%s', path); - case 4: + case READ_GITFILE_ERR_TOO_LARGE: die(Too large to be a .git file: '%s', path); - case 5: + case READ_GITFILE_ERR_READ_FAILED: die(Error reading %s, path); - case 6: + case READ_GITFILE_ERR_INVALID_FORMAT: die(Invalid gitfile format: %s, path); - case 7: + case READ_GITFILE_ERR_NO_PATH: die(No path in gitfile: %s, path); - case 8: + case READ_GITFILE_ERR_NOT_A_REPO: die(Not a git repository: %s, dir); default: assert(0); -- 2.4.0.rc3.8.g86acfbd.dirty -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org
[PATCH v4 3/5] t7300: add tests to document behavior of clean and nested git
Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- t/t7300-clean.sh | 128 +++ 1 file changed, 128 insertions(+) diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh index 99be5d9..11f3a6d 100755 --- a/t/t7300-clean.sh +++ b/t/t7300-clean.sh @@ -455,6 +455,134 @@ test_expect_success 'nested git work tree' ' ! test -d bar ' +test_expect_failure 'should clean things that almost look like git but are not' ' + rm -fr almost_git almost_bare_git almost_submodule + mkdir -p almost_git/.git/objects + mkdir -p almost_git/.git/refs + cat almost_git/.git/HEAD -\EOF + garbage + EOF + cp -r almost_git/.git/ almost_bare_git + mkdir almost_submodule/ + cat almost_submodule/.git -\EOF + garbage + EOF + test_when_finished rm -rf almost_* + ## This will fail due to die(Invalid gitfile format: %s, path); in + ## setup.c:read_gitfile. + git clean -f -d + test_path_is_missing almost_git + test_path_is_missing almost_bare_git + test_path_is_missing almost_submodule +' + +test_expect_success 'should not clean submodules' ' + rm -fr repo to_clean sub1 sub2 + mkdir repo to_clean + ( + cd repo + git init + test_commit msg hello.world + ) + git submodule add ./repo/.git sub1 + git commit -m sub1 + git branch before_sub2 + git submodule add ./repo/.git sub2 + git commit -m sub2 + git checkout before_sub2 + to_clean/should_clean.this + git clean -f -d + test_path_is_file repo/.git/index + test_path_is_file repo/hello.world + test_path_is_file sub1/.git + test_path_is_file sub1/hello.world + test_path_is_file sub2/.git + test_path_is_file sub2/hello.world + test_path_is_missing to_clean +' + +test_expect_failure 'nested (empty) git should be kept' ' + rm -fr empty_repo to_clean + git init empty_repo + mkdir to_clean + to_clean/should_clean.this + git clean -f -d + test_path_is_file empty_repo/.git/HEAD + test_path_is_missing to_clean +' + +test_expect_success 'nested bare repositories should be cleaned' ' + rm -fr bare1 bare2 subdir + git init --bare bare1 + git clone --local --bare . bare2 + mkdir subdir + cp -r bare2 subdir/bare3 + git clean -f -d + test_path_is_missing bare1 + test_path_is_missing bare2 + test_path_is_missing subdir +' + +test_expect_success 'nested (empty) bare repositories should be cleaned even when in .git' ' + rm -fr strange_bare + mkdir strange_bare + git init --bare strange_bare/.git + git clean -f -d + test_path_is_missing strange_bare +' + +test_expect_failure 'nested (non-empty) bare repositories should be cleaned even when in .git' ' + rm -fr strange_bare + mkdir strange_bare + git clone --local --bare . strange_bare/.git + git clean -f -d + test_path_is_missing strange_bare +' + +test_expect_success 'giving path in nested git work tree will remove it' ' + rm -fr repo + mkdir repo + ( + cd repo + git init + mkdir -p bar/baz + test_commit msg bar/baz/hello.world + ) + git clean -f -d repo/bar/baz + test_path_is_file repo/.git/HEAD + test_path_is_dir repo/bar/ + test_path_is_missing repo/bar/baz +' + +test_expect_success 'giving path to nested .git will not remove it' ' + rm -fr repo + mkdir repo untracked + ( + cd repo + git init + test_commit msg hello.world + ) + git clean -f -d repo/.git + test_path_is_file repo/.git/HEAD + test_path_is_dir repo/.git/refs + test_path_is_dir repo/.git/objects + test_path_is_dir untracked/ +' + +test_expect_success 'giving path to nested .git/ will remove contents' ' + rm -fr repo untracked + mkdir repo untracked + ( + cd repo + git init + test_commit msg hello.world + ) + git clean -f -d repo/.git/ + test_path_is_dir repo/.git + test_dir_is_empty repo/.git + test_path_is_dir untracked/ +' + test_expect_success 'force removal of nested git work tree' ' rm -fr foo bar baz mkdir -p foo bar baz/boo -- 2.4.0.rc3.8.g4ebd28d -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 1/5] setup: add gentle version of read_gitfile
read_gitfile will die on most error cases. This makes it unsuitable for speculative calls. Extract the core logic and provide a gentle version that returns NULL on failure. The first usecase of the new gentle version will be to probe for submodules during git clean. Helped-by: Junio C Hamano gits...@pobox.com Helped-by: Jeff King p...@peff.net Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- cache.h | 3 ++- setup.c | 82 +++-- 2 files changed, 67 insertions(+), 18 deletions(-) diff --git a/cache.h b/cache.h index 3d3244b..6e29068 100644 --- a/cache.h +++ b/cache.h @@ -431,7 +431,8 @@ extern int set_git_dir(const char *path); extern const char *get_git_namespace(void); extern const char *strip_namespace(const char *namespaced_ref); extern const char *get_git_work_tree(void); -extern const char *read_gitfile(const char *path); +extern const char *read_gitfile_gently(const char *path, int *return_error_code); +#define read_gitfile(path) read_gitfile_gently((path), NULL) extern const char *resolve_gitdir(const char *suspect); extern void set_git_work_tree(const char *tree); diff --git a/setup.c b/setup.c index 979b13f..e1897cc 100644 --- a/setup.c +++ b/setup.c @@ -335,35 +335,53 @@ static int check_repository_format_gently(const char *gitdir, int *nongit_ok) /* * Try to read the location of the git directory from the .git file, * return path to git directory if found. + * + * On failure, if return_error_code is not NULL, return_error_code + * will be set to an error code and NULL will be returned. If + * return_error_code is NULL the function will die instead (for most + * cases). */ -const char *read_gitfile(const char *path) +const char *read_gitfile_gently(const char *path, int *return_error_code) { - char *buf; - char *dir; + int error_code = 0; + char *buf = NULL; + char *dir = NULL; const char *slash; struct stat st; int fd; ssize_t len; - if (stat(path, st)) - return NULL; - if (!S_ISREG(st.st_mode)) - return NULL; + if (stat(path, st)) { + error_code = 1; + goto cleanup_return; + } + if (!S_ISREG(st.st_mode)) { + error_code = 2; + goto cleanup_return; + } fd = open(path, O_RDONLY); - if (fd 0) - die_errno(Error opening '%s', path); + if (fd 0) { + error_code = 3; + goto cleanup_return; + } buf = xmalloc(st.st_size + 1); len = read_in_full(fd, buf, st.st_size); close(fd); - if (len != st.st_size) - die(Error reading %s, path); + if (len != st.st_size) { + error_code = 4; + goto cleanup_return; + } buf[len] = '\0'; - if (!starts_with(buf, gitdir: )) - die(Invalid gitfile format: %s, path); + if (!starts_with(buf, gitdir: )) { + error_code = 5; + goto cleanup_return; + } while (buf[len - 1] == '\n' || buf[len - 1] == '\r') len--; - if (len 9) - die(No path in gitfile: %s, path); + if (len 9) { + error_code = 6; + goto cleanup_return; + } buf[len] = '\0'; dir = buf + 8; @@ -378,11 +396,41 @@ const char *read_gitfile(const char *path) buf = dir; } - if (!is_git_directory(dir)) - die(Not a git repository: %s, dir); + if (!is_git_directory(dir)) { + error_code = 7; + goto cleanup_return; + } path = real_path(dir); +cleanup_return: free(buf); + + if (return_error_code) + *return_error_code = error_code; + + if (error_code) { + if (return_error_code) + return NULL; + + switch (error_code) { + case 1: // failed to stat + case 2: // not regular file + return NULL; + case 3: + die_errno(Error opening '%s', path); + case 4: + die(Error reading %s, path); + case 5: + die(Invalid gitfile format: %s, path); + case 6: + die(No path in gitfile: %s, path); + case 7: + die(Not a git repository: %s, dir); + default: + assert(0); + } + } + return path; } -- 2.4.0.rc3.8.g4ebd28d -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 5/5] clean: improve performance when removing lots of directories
git clean uses resolve_gitlink_ref() to check for the presence of nested git repositories, but it has the drawback of creating a ref_cache entry for every directory that should potentially be cleaned. The linear search through the ref_cache list causes a massive performance hit for large number of directories. Modify clean.c:remove_dirs to use setup.c:is_git_directory and setup.c:read_gitfile_gently instead. Both these functions will open files and parse contents when they find something that looks like a git repository. This is ok from a performance standpoint since finding repository candidates should be comparatively rare. Using is_git_directory and read_gitfile_gently should give a more standardized check for what is and what isn't a git repository but also gives three behavioral changes. The first change is that we will now detect and avoid cleaning empty nested git repositories (only init run). This is desirable. Second, we will no longer die when cleaning a file named .git with garbage content (it will be cleaned instead). This is also desirable. The last change is that we will detect and avoid cleaning empty bare repositories that have been placed in a directory named .git. This is not desirable but should have no real user impact since we already fail to clean non-empty bare repositories in the same scenario. This is thus deemed acceptable. Update t7300 to reflect these changes in behavior. The time to clean an untracked directory containing 10 sub directories went from 61s to 1.7s after this change. Helped-by: Jeff King p...@peff.net Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- builtin/clean.c | 26 ++ t/t7300-clean.sh | 8 +++- 2 files changed, 25 insertions(+), 9 deletions(-) diff --git a/builtin/clean.c b/builtin/clean.c index 98c103f..17a1c13 100644 --- a/builtin/clean.c +++ b/builtin/clean.c @@ -10,7 +10,6 @@ #include cache.h #include dir.h #include parse-options.h -#include refs.h #include string-list.h #include quote.h #include column.h @@ -148,6 +147,27 @@ static int exclude_cb(const struct option *opt, const char *arg, int unset) return 0; } +/* + * Return 1 if the given path is the root of a git repository or + * submodule else 0. Will not return 1 for bare repositories with the + * exception of creating a bare repository in foo/.git and calling + * is_git_repository(foo). + */ +static int is_git_repository(struct strbuf *path) +{ + int ret = 0; + int unused_error_code; + size_t orig_path_len = path-len; + assert(orig_path_len != 0); + if (path-buf[orig_path_len - 1] != '/') + strbuf_addch(path, '/'); + strbuf_addstr(path, .git); + if (read_gitfile_gently(path-buf, unused_error_code) || is_git_directory(path-buf)) + ret = 1; + strbuf_setlen(path, orig_path_len); + return ret; +} + static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag, int dry_run, int quiet, int *dir_gone) { @@ -155,13 +175,11 @@ static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag, struct strbuf quoted = STRBUF_INIT; struct dirent *e; int res = 0, ret = 0, gone = 1, original_len = path-len, len; - unsigned char submodule_head[20]; struct string_list dels = STRING_LIST_INIT_DUP; *dir_gone = 1; - if ((force_flag REMOVE_DIR_KEEP_NESTED_GIT) - !resolve_gitlink_ref(path-buf, HEAD, submodule_head)) { + if ((force_flag REMOVE_DIR_KEEP_NESTED_GIT) is_git_repository(path)) { if (!quiet) { quote_path_relative(path-buf, prefix, quoted); printf(dry_run ? _(msg_would_skip_git_dir) : _(msg_skip_git_dir), diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh index 11f3a6d..4d07a4a 100755 --- a/t/t7300-clean.sh +++ b/t/t7300-clean.sh @@ -455,7 +455,7 @@ test_expect_success 'nested git work tree' ' ! test -d bar ' -test_expect_failure 'should clean things that almost look like git but are not' ' +test_expect_success 'should clean things that almost look like git but are not' ' rm -fr almost_git almost_bare_git almost_submodule mkdir -p almost_git/.git/objects mkdir -p almost_git/.git/refs @@ -468,8 +468,6 @@ test_expect_failure 'should clean things that almost look like git but are not' garbage EOF test_when_finished rm -rf almost_* - ## This will fail due to die(Invalid gitfile format: %s, path); in - ## setup.c:read_gitfile. git clean -f -d test_path_is_missing almost_git test_path_is_missing almost_bare_git @@ -501,7 +499,7 @@ test_expect_success 'should not clean submodules' ' test_path_is_missing to_clean ' -test_expect_failure 'nested (empty) git should be kept' ' +test_expect_success 'nested (empty) git should be kept' ' rm -fr
[PATCH v4 4/5] p7300: add performance tests for clean
The tests are run in dry-run mode to avoid having to restore the test directories for each timed iteration. Using dry-run is an acceptable compromise since we are mostly interested in the initial computation of what to clean and not so much in the cleaning it self. Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- t/perf/p7300-clean.sh | 31 +++ 1 file changed, 31 insertions(+) create mode 100755 t/perf/p7300-clean.sh diff --git a/t/perf/p7300-clean.sh b/t/perf/p7300-clean.sh new file mode 100755 index 000..6ae55ec --- /dev/null +++ b/t/perf/p7300-clean.sh @@ -0,0 +1,31 @@ +#!/bin/sh + +test_description=Test git-clean performance + +. ./perf-lib.sh + +test_perf_large_repo +test_checkout_worktree + +test_expect_success 'setup untracked directory with many sub dirs' ' + rm -rf 500_sub_dirs 10_sub_dirs clean_test_dir + mkdir 500_sub_dirs 10_sub_dirs clean_test_dir + for i in $(test_seq 1 500) + do + mkdir 500_sub_dirs/dir$i || return $? + done + for i in $(test_seq 1 200) + do + cp -r 500_sub_dirs 10_sub_dirs/dir$i || return $? + done +' + +test_perf 'clean many untracked sub dirs, check for nested git' ' + git clean -n -q -f -d 10_sub_dirs/ +' + +test_perf 'clean many untracked sub dirs, ignore nested git' ' + git clean -n -q -f -f -d 10_sub_dirs/ +' + +test_done -- 2.4.0.rc3.8.g4ebd28d -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 2/5] setup: sanity check file size in read_gitfile_gently
read_gitfile_gently will allocate a buffer to fit the entire file that should be read. Add a sanity check of the file size before opening to avoid allocating a potentially huge amount of memory if we come across a large file that someone happened to name .git. The limit is set to a sufficiently unreasonable size that should never be exceeded by a genuine .git file. Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- I'm not sure about this one but it felt like the safe thing to do. This patch can be dropped if it is not desired. I considered testing it using mkdir foo truncate -s 200G foo/.git git clean -f -d but that feels like a pretty evil test that is likely to cause lots of problems and not fail in any good way. setup.c | 20 +--- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/setup.c b/setup.c index e1897cc..ed87334 100644 --- a/setup.c +++ b/setup.c @@ -364,22 +364,26 @@ const char *read_gitfile_gently(const char *path, int *return_error_code) error_code = 3; goto cleanup_return; } + if (st.st_size PATH_MAX * 4) { + error_code = 4; + goto cleanup_return; + } buf = xmalloc(st.st_size + 1); len = read_in_full(fd, buf, st.st_size); close(fd); if (len != st.st_size) { - error_code = 4; + error_code = 5; goto cleanup_return; } buf[len] = '\0'; if (!starts_with(buf, gitdir: )) { - error_code = 5; + error_code = 6; goto cleanup_return; } while (buf[len - 1] == '\n' || buf[len - 1] == '\r') len--; if (len 9) { - error_code = 6; + error_code = 7; goto cleanup_return; } buf[len] = '\0'; @@ -397,7 +401,7 @@ const char *read_gitfile_gently(const char *path, int *return_error_code) } if (!is_git_directory(dir)) { - error_code = 7; + error_code = 8; goto cleanup_return; } path = real_path(dir); @@ -419,12 +423,14 @@ cleanup_return: case 3: die_errno(Error opening '%s', path); case 4: - die(Error reading %s, path); + die(Too large to be a .git file: '%s', path); case 5: - die(Invalid gitfile format: %s, path); + die(Error reading %s, path); case 6: - die(No path in gitfile: %s, path); + die(Invalid gitfile format: %s, path); case 7: + die(No path in gitfile: %s, path); + case 8: die(Not a git repository: %s, dir); default: assert(0); -- 2.4.0.rc3.8.g4ebd28d -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 0/5] Improving performance of git clean
v3 of the patch can be found here: http://thread.gmane.org/gmane.comp.version-control.git/267422 Changes in v4: * changed some tests to use more meaningful dir names. * fixed performance test by doing git clean -n to avoid timing setup code. Increased test size to 10 directories (~0.5s runtime). * changed interface of read_gitfile_gently to be able to return error code. * fixed a compiler warning in read_gitfile_gently (warning: ‘dir’ may be used uninitialized in this function). * added sanity check of git file size in read_gitfile_gently * updated commit message in [5/5] to more clearly motivate remaining behavioral changes of git clean. Thanks to Junio C Hamano and Jeff King for comments and help on v3. Erik Elfström (5): setup: add gentle version of read_gitfile setup: sanity check file size in read_gitfile_gently t7300: add tests to document behavior of clean and nested git p7300: add performance tests for clean clean: improve performance when removing lots of directories builtin/clean.c | 26 +-- cache.h | 3 +- setup.c | 88 --- t/perf/p7300-clean.sh | 31 + t/t7300-clean.sh | 126 ++ 5 files changed, 252 insertions(+), 22 deletions(-) create mode 100755 t/perf/p7300-clean.sh -- 2.4.0.rc3.8.g4ebd28d -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC v3 0/4] Improving performance of git clean
On Tue, Apr 21, 2015 at 11:24 PM, Jeff King p...@peff.net wrote: If I understand correctly, the reason that you need per-run setup is that your git clean command actually cleans things, and you need to restore the original state for each time-trial. Can you instead use git clean -n to do a dry-run? I think what you are timing is really the figure out what to clean step, and not the cleaning itself. -Peff Yes, that is the problem. A dry run will spot this particular performance issue but maybe we lose some value as a general performance test if we only do half the clean? Admittedly we clearly lose some value in the current state as well due to the copying taking more time than the cleaning. I could go either way here. /Erik -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC v3 0/4] Improving performance of git clean
On Wed, Apr 22, 2015 at 9:46 PM, Jeff King p...@peff.net wrote: On Wed, Apr 22, 2015 at 09:30:20PM +0200, erik elfström wrote: Yes, that is the problem. A dry run will spot this particular performance issue but maybe we lose some value as a general performance test if we only do half the clean? Admittedly we clearly lose some value in the current state as well due to the copying taking more time than the cleaning. I could go either way here. I guess it is a matter of opinion. I think testing only the find out what to clean half separately is actually beneficial, because it helps us isolate any slowdown. If we want to add a test for the other half, we can, but I do not actually think it is currently that interesting (it is just calling unlink() in a loop). So even leaving the practical matters aside, I do not think it is a bad thing to split it up. When you add in the fact that it is practically much easier to test the first half, it seems to me that testing just that is a good first step. -Peff Sounds reasonable to me. I'll make this change in v4, thanks! (Sorry for the duplicate email Jeff, I'm bad at this mailing list thing...) /Erik -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC v3 0/4] Improving performance of git clean
On Sun, Apr 19, 2015 at 3:14 AM, Junio C Hamano gits...@pobox.com wrote: Erik Elfström erik.elfst...@gmail.com writes: Known Problems: * Unsure about the setup.c:read_gitfile refactor, feels a bit messy? The interface indeed feels somewhat messy. I suspect that a better interface might be more like setup_git_directory_gently() that is a gentler version of setup_git_directory(). The function takes an optional pointer to an int, and it behaves not-gently-at-all when the optional argument is NULL. The location the optional pointer points at, when it is not NULL, can be used to return the error state to the caller. That function pair only uses the optional argument to convey one bit of information (i.e. are we in any git repository or not?) back to the caller, but the interface could be used to tell the caller a lot more if we wanted to. If you model read_gitfile_gently() after that pattern, I would expect that - The extra pattern would be int *error; - The implementation of read_gitfile() would be #define read_gitfile(path) read_gitfile_gently((path), NULL) and the _gently() version will die when error parameter is set to NULL and finds any error. - The caller of the gentle variant can use the error code to determine what went wrong, not just the fact that it failed. I do not think your caller does not have an immediate need to tell between invalid gitfile format and No path in gitfile, but such an interface leaves that possibility open. Thanks. Ok, I'll try that approach, thanks. (and apologies for the late reply) /Erik -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC v3 0/4] Improving performance of git clean
Ok, thanks for looking into this. I have no well founded opinions on the implementation but I do think the performance tests would be more meaningful if the setup/cleanup code could be removed from the timed section. If the community agrees on an implementation I would be happy to convert the new tests, either directly in this series or as a follow up if that is preferred. /Erik On Tue, Apr 21, 2015 at 12:14 AM, Thomas Gummerer t.gumme...@gmail.com wrote: On 04/18, Erik Elfström wrote: * Still have issues in the performance tests, see comments from Thomas Gummerer on v2 I've looked at the modern style tests again, and I don't the code churn is worth it just for using them for the performance tests. If anyone wants to take a look at the code, it's at github.com/tgummerer/git tg/perf-lib. I think adding the test_perf_setup_cleanup command would make more sense in this case. If you want I can send a patch for that. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH/RFC v3 4/4] clean: improve performance when removing lots of directories
git clean uses resolve_gitlink_ref() to check for the presence of nested git repositories, but it has the drawback of creating a ref_cache entry for every directory that should potentially be cleaned. The linear search through the ref_cache list causes a massive performance hit for large number of directories. Modify clean.c:remove_dirs to use setup.c:is_git_directory and setup.c:read_gitfile_gently instead. Both these functions will open files and parse contents when they find something that looks like a git repository. This is ok from a performance standpoint since finding repository candidates should be comparatively rare. Using is_git_directory and read_gitfile_gently should give a more standardized check for what is and what isn't a git repository but also gives a slight behavioral change. We will now detect and respect empty nested git repositories (only init run) and empty bare repositories that have been placed in a .git directory. We will also no longer die when cleaning a file named .git with garbage content (it will be cleaned instead). Update t7300 to reflect this. The time to clean an untracked directory containing 10 sub directories went from 61s to 1.7s after this change. Helped-by: Jeff King p...@peff.net Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- builtin/clean.c | 25 + t/t7300-clean.sh | 8 +++- 2 files changed, 24 insertions(+), 9 deletions(-) diff --git a/builtin/clean.c b/builtin/clean.c index 98c103f..5cda3c5 100644 --- a/builtin/clean.c +++ b/builtin/clean.c @@ -10,7 +10,6 @@ #include cache.h #include dir.h #include parse-options.h -#include refs.h #include string-list.h #include quote.h #include column.h @@ -148,6 +147,26 @@ static int exclude_cb(const struct option *opt, const char *arg, int unset) return 0; } +/* + * Return 1 if the given path is the root of a git repository or + * submodule else 0. Will not return 1 for bare repositories with the + * exception of creating a bare repository in foo/.git and calling + * is_git_repository(foo). + */ +static int is_git_repository(struct strbuf *path) +{ + int ret = 0; + size_t orig_path_len = path-len; + assert(orig_path_len != 0); + if (path-buf[orig_path_len - 1] != '/') + strbuf_addch(path, '/'); + strbuf_addstr(path, .git); + if (read_gitfile_gently(path-buf) || is_git_directory(path-buf)) + ret = 1; + strbuf_setlen(path, orig_path_len); + return ret; +} + static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag, int dry_run, int quiet, int *dir_gone) { @@ -155,13 +174,11 @@ static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag, struct strbuf quoted = STRBUF_INIT; struct dirent *e; int res = 0, ret = 0, gone = 1, original_len = path-len, len; - unsigned char submodule_head[20]; struct string_list dels = STRING_LIST_INIT_DUP; *dir_gone = 1; - if ((force_flag REMOVE_DIR_KEEP_NESTED_GIT) - !resolve_gitlink_ref(path-buf, HEAD, submodule_head)) { + if ((force_flag REMOVE_DIR_KEEP_NESTED_GIT) is_git_repository(path)) { if (!quiet) { quote_path_relative(path-buf, prefix, quoted); printf(dry_run ? _(msg_would_skip_git_dir) : _(msg_skip_git_dir), diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh index 4b9a72a..1bbb8ef 100755 --- a/t/t7300-clean.sh +++ b/t/t7300-clean.sh @@ -455,7 +455,7 @@ test_expect_success 'nested git work tree' ' ! test -d bar ' -test_expect_failure 'should clean things that almost look like git but are not' ' +test_expect_success 'should clean things that almost look like git but are not' ' rm -fr almost_git almost_bare_git almost_submodule mkdir -p almost_git/.git/objects mkdir -p almost_git/.git/refs @@ -468,8 +468,6 @@ test_expect_failure 'should clean things that almost look like git but are not' garbage EOF test_when_finished rm -rf almost_* - ## This will fail due to die(Invalid gitfile format: %s, path); in - ## setup.c:read_gitfile. git clean -f -d test_path_is_missing almost_git test_path_is_missing almost_bare_git @@ -501,7 +499,7 @@ test_expect_success 'should not clean submodules' ' test_path_is_missing to_clean ' -test_expect_failure 'nested (empty) git should be kept' ' +test_expect_success 'nested (empty) git should be kept' ' rm -fr foo bar git init foo mkdir bar @@ -523,7 +521,7 @@ test_expect_success 'nested bare repositories should be cleaned' ' test_path_is_missing subdir ' -test_expect_success 'nested (empty) bare repositories should be cleaned even when in .git' ' +test_expect_failure 'nested (empty) bare repositories should be cleaned even when in .git' ' rm -fr
[PATCH/RFC v3 2/4] t7300: add tests to document behavior of clean and nested git
Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- t/t7300-clean.sh | 127 +++ 1 file changed, 127 insertions(+) diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh index 99be5d9..4b9a72a 100755 --- a/t/t7300-clean.sh +++ b/t/t7300-clean.sh @@ -455,6 +455,133 @@ test_expect_success 'nested git work tree' ' ! test -d bar ' +test_expect_failure 'should clean things that almost look like git but are not' ' + rm -fr almost_git almost_bare_git almost_submodule + mkdir -p almost_git/.git/objects + mkdir -p almost_git/.git/refs + cat almost_git/.git/HEAD -\EOF + garbage + EOF + cp -r almost_git/.git/ almost_bare_git + mkdir almost_submodule/ + cat almost_submodule/.git -\EOF + garbage + EOF + test_when_finished rm -rf almost_* + ## This will fail due to die(Invalid gitfile format: %s, path); in + ## setup.c:read_gitfile. + git clean -f -d + test_path_is_missing almost_git + test_path_is_missing almost_bare_git + test_path_is_missing almost_submodule +' + +test_expect_success 'should not clean submodules' ' + rm -fr repo to_clean sub1 sub2 + mkdir repo to_clean + ( + cd repo + git init + test_commit msg hello.world + ) + git submodule add ./repo/.git sub1 + git commit -m sub1 + git branch before_sub2 + git submodule add ./repo/.git sub2 + git commit -m sub2 + git checkout before_sub2 + to_clean/should_clean.this + git clean -f -d + test_path_is_file repo/.git/index + test_path_is_file repo/hello.world + test_path_is_file sub1/.git + test_path_is_file sub1/hello.world + test_path_is_file sub2/.git + test_path_is_file sub2/hello.world + test_path_is_missing to_clean +' + +test_expect_failure 'nested (empty) git should be kept' ' + rm -fr foo bar + git init foo + mkdir bar + bar/goodbye.people + git clean -f -d + test_path_is_file foo/.git/HEAD + test_path_is_missing bar +' + +test_expect_success 'nested bare repositories should be cleaned' ' + rm -fr bare1 bare2 subdir + git init --bare bare1 + git clone --local --bare . bare2 + mkdir subdir + cp -r bare2 subdir/bare3 + git clean -f -d + test_path_is_missing bare1 + test_path_is_missing bare2 + test_path_is_missing subdir +' + +test_expect_success 'nested (empty) bare repositories should be cleaned even when in .git' ' + rm -fr strange_bare + mkdir strange_bare + git init --bare strange_bare/.git + git clean -f -d + test_path_is_missing strange_bare +' + +test_expect_failure 'nested (non-empty) bare repositories should be cleaned even when in .git' ' + rm -fr strange_bare + mkdir strange_bare + git clone --local --bare . strange_bare/.git + git clean -f -d + test_path_is_missing strange_bare +' + +test_expect_success 'giving path in nested git work tree will remove it' ' + rm -fr foo + mkdir foo + ( + cd foo + git init + mkdir -p bar/baz + test_commit msg bar/baz/hello.world + ) + git clean -f -d foo/bar/baz + test_path_is_file foo/.git/HEAD + test_path_is_dir foo/bar/ + test_path_is_missing foo/bar/baz +' + +test_expect_success 'giving path to nested .git will not remove it' ' + rm -fr foo + mkdir foo bar + ( + cd foo + git init + test_commit msg hello.world + ) + git clean -f -d foo/.git + test_path_is_file foo/.git/HEAD + test_path_is_dir foo/.git/refs + test_path_is_dir foo/.git/objects + test_path_is_dir bar/ +' + +test_expect_success 'giving path to nested .git/ will remove contents' ' + rm -fr foo bar + mkdir foo bar + ( + cd foo + git init + test_commit msg hello.world + ) + git clean -f -d foo/.git/ + test_path_is_dir foo/.git + test_dir_is_empty foo/.git +' + test_expect_success 'force removal of nested git work tree' ' rm -fr foo bar baz mkdir -p foo bar baz/boo -- 2.4.0.rc2.5.g2871d5e -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH/RFC v3 0/4] Improving performance of git clean
I've marked this RFC since there are known problems here. v2 of the patch can be found here: http://thread.gmane.org/gmane.comp.version-control.git/267023/focus=267023 Changes in v3: * Created setup.c:read_gitfile_gently to use for submodule probing * Cleanup of some tests by use of test_commit helper * Added more tests of cleaning in the presence of submodules * Reversed expectation of test for cleaning nested bare repos. They are now expected to be cleaned. Added one more case. * Fixed bug where submodules could be cleaned by using new read_gitfile_gently for additional submodule check in clean.c:is_git_repository * Attempt to change behavior of patch implementation to clean bare repositories (only partially successful) * Reworded commit message of the performance fix commit Known Problems: * Unsure about the setup.c:read_gitfile refactor, feels a bit messy? * Potentially a missing sanity check of git file size in setup.c:read_gitfile_gently_or_non_gently * We still get a behavioral change for empty bare repositories placed in a .git directory. Currently we clean empty bare repos in a .git folder but not non-empty one. After this patch we won't clean either. How serious is this? Is there an easy fix (preferebly to clean all bare repositories)? * Still have issues in the performance tests, see comments from Thomas Gummerer on v2 Thanks to Junio C Hamano and Jeff King for spotting fundamental problems in v2 and suggesting a solution. Erik Elfström (4): setup: add gentle version of read_gitfile t7300: add tests to document behavior of clean and nested git p7300: add performance tests for clean clean: improve performance when removing lots of directories builtin/clean.c | 25 -- cache.h | 1 + setup.c | 94 - t/perf/p7300-clean.sh | 37 +++ t/t7300-clean.sh | 125 ++ 5 files changed, 257 insertions(+), 25 deletions(-) create mode 100755 t/perf/p7300-clean.sh -- 2.4.0.rc2.5.g2871d5e -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH/RFC v3 1/4] setup: add gentle version of read_gitfile
read_gitfile will die on most error cases. This makes it unsuitable for speculative calls. Extract the core logic and provide a gentle version that returns NULL on failure. The first usecase of the new gentle version will be to probe for submodules during git clean. Helped-by: Junio C Hamano gits...@pobox.com Helped-by: Jeff King p...@peff.net Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- If this is going to be used for speculative probing should there be a sanity check before: buf = xmalloc(st.st_size + 1); len = read_in_full(fd, buf, st.st_size); Something like: if (st.st_size PATH_MAX*2) { error = N; goto cleanup_return; { What do you think? cache.h | 1 + setup.c | 94 ++--- 2 files changed, 74 insertions(+), 21 deletions(-) diff --git a/cache.h b/cache.h index 3d3244b..9d9199d 100644 --- a/cache.h +++ b/cache.h @@ -432,6 +432,7 @@ extern const char *get_git_namespace(void); extern const char *strip_namespace(const char *namespaced_ref); extern const char *get_git_work_tree(void); extern const char *read_gitfile(const char *path); +extern const char *read_gitfile_gently(const char *path); extern const char *resolve_gitdir(const char *suspect); extern void set_git_work_tree(const char *tree); diff --git a/setup.c b/setup.c index 979b13f..a33b293 100644 --- a/setup.c +++ b/setup.c @@ -332,38 +332,46 @@ static int check_repository_format_gently(const char *gitdir, int *nongit_ok) return 0; } -/* - * Try to read the location of the git directory from the .git file, - * return path to git directory if found. - */ -const char *read_gitfile(const char *path) +static const char *read_gitfile_gently_or_non_gently(const char *path, int gently) { - char *buf; + int error = 0; + char *buf = NULL; char *dir; const char *slash; struct stat st; int fd; ssize_t len; - - if (stat(path, st)) - return NULL; - if (!S_ISREG(st.st_mode)) - return NULL; + if (stat(path, st)) { + error = 1; + goto cleanup_return; + } + if (!S_ISREG(st.st_mode)) { + error = 2; + goto cleanup_return; + } fd = open(path, O_RDONLY); - if (fd 0) - die_errno(Error opening '%s', path); + if (fd 0) { + error = 3; + goto cleanup_return; + } buf = xmalloc(st.st_size + 1); len = read_in_full(fd, buf, st.st_size); close(fd); - if (len != st.st_size) - die(Error reading %s, path); + if (len != st.st_size) { + error = 4; + goto cleanup_return; + } buf[len] = '\0'; - if (!starts_with(buf, gitdir: )) - die(Invalid gitfile format: %s, path); + if (!starts_with(buf, gitdir: )) { + error = 5; + goto cleanup_return; + } while (buf[len - 1] == '\n' || buf[len - 1] == '\r') len--; - if (len 9) - die(No path in gitfile: %s, path); + if (len 9) { + error = 6; + goto cleanup_return; + } buf[len] = '\0'; dir = buf + 8; @@ -378,14 +386,58 @@ const char *read_gitfile(const char *path) buf = dir; } - if (!is_git_directory(dir)) - die(Not a git repository: %s, dir); + if (!is_git_directory(dir)) { + error = 7; + goto cleanup_return; + } path = real_path(dir); +cleanup_return: free(buf); + + if (error) { + if (gently) + return NULL; + + switch (error) { + case 1: // failed to stat + case 2: // not regular file + return NULL; + case 3: + die_errno(Error opening '%s', path); + case 4: + die(Error reading %s, path); + case 5: + die(Invalid gitfile format: %s, path); + case 6: + die(No path in gitfile: %s, path); + case 7: + die(Not a git repository: %s, dir); + default: + assert(0); + } + } + return path; } +/* + * Try to read the location of the git directory from the .git file, + * return path to git directory if found, die on (most) failures. + */ +const char *read_gitfile(const char *path) +{ + return read_gitfile_gently_or_non_gently(path, 0); +} + +/* + * Same as read_gitfile but return NULL on failure. + */ +const char *read_gitfile_gently(const char *path) +{ + return read_gitfile_gently_or_non_gently(path, 1); +} + static const char
[PATCH/RFC v3 3/4] p7300: add performance tests for clean
Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- t/perf/p7300-clean.sh | 37 + 1 file changed, 37 insertions(+) create mode 100755 t/perf/p7300-clean.sh diff --git a/t/perf/p7300-clean.sh b/t/perf/p7300-clean.sh new file mode 100755 index 000..af50d5d --- /dev/null +++ b/t/perf/p7300-clean.sh @@ -0,0 +1,37 @@ +#!/bin/sh + +test_description=Test git-clean performance + +. ./perf-lib.sh + +test_perf_large_repo +test_checkout_worktree + +test_expect_success 'setup untracked directory with many sub dirs' ' + rm -rf 500_sub_dirs 5_sub_dirs clean_test_dir + mkdir 500_sub_dirs 5_sub_dirs clean_test_dir + for i in $(test_seq 1 500) + do + mkdir 500_sub_dirs/dir$i || return $? + done + for i in $(test_seq 1 100) + do + cp -r 500_sub_dirs 5_sub_dirs/dir$i || return $? + done +' + +test_perf 'clean many untracked sub dirs, check for nested git' ' + rm -rf clean_test_dir/5_sub_dirs_cpy + cp -r 5_sub_dirs clean_test_dir/5_sub_dirs_cpy + git clean -q -f -d clean_test_dir/ + test_dir_is_empty clean_test_dir +' + +test_perf 'clean many untracked sub dirs, ignore nested git' ' + rm -rf clean_test_dir/5_sub_dirs_cpy + cp -r 5_sub_dirs clean_test_dir/5_sub_dirs_cpy + git clean -q -f -f -d clean_test_dir/ + test_dir_is_empty clean_test_dir +' + +test_done -- 2.4.0.rc2.5.g2871d5e -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/3] clean: improve performance when removing lots of directories
On Wed, Apr 15, 2015 at 7:56 PM, Junio C Hamano gits...@pobox.com wrote: Erik Elfström erik.elfst...@gmail.com writes: Before this change, clean used resolve_gitlink_ref to check for the presence of nested git repositories. This had the drawback of creating a ref_cache entry for every directory that should potentially be cleaned. The linear search through the ref_cache list caused a massive performance hit for large number of directories. I'd prefer to see the current state described in the current tense, e.g. git clean uses resolve_gitlink_ref() to check for the presence of nested git repositories, but it has the drawback of creating a ref_cache entry for every directory that should potentially be cleaned. The linear search through the ref_cache list causes a massive performance hit for large number of directories. Yes, that reads better. Teach clean.c:remove_dirs to use setup.c:is_git_directory instead. is_git_directory will actually open HEAD and parse the HEAD ref but this implies a nested git repository and should be rare when cleaning. I am not sure what you wanted to say in this paragraph. What does it being rare have to do with it? Even if it is not rare (i.e. the top-level project you are working with has many submodules checked out without using the more recent a file .git pointing into .git/modules/ via 'gitdir: $overThere' mechanism), if we found a nested git repository, we treat it as special and exclude it from cleaning it out, which is a good thing, no? I was trying to motivate that the performance of is_git_directory is not a problem for us even though it opens a file and parses it. I see now when I read it again that this is not very clear. Doesn't this implementation get confused by modern submodule checkouts and descend into and clean their working tree, though? Module M with path P would have a directory P in the working tree of the top-level project, and P/.git is a regular file that will fail is_git_directory() test but records the location of the real submodule repository i.e. .git/modules/M via the gitdir: mechanism. Yes, there is a problem here. I've added the test below and it fails after my change by cleaning sub2 (sub1 is not cleaned). Are there more cases here that I should test for? +test_expect_success 'should not clean submodules' ' + rm -fr repo to_clean sub1 sub2 + mkdir -p repo to_clean + ( + cd repo + git init + hello.world + git add . + git commit -a -m nested + ) + git submodule add ./repo/.git sub1 + git commit -m sub1 + git branch before_sub2 + git submodule add ./repo/.git sub2 + git commit -m sub2 + git checkout before_sub2 + to_clean/should_clean.this + git clean -f -d + test_path_is_file repo/.git/index + test_path_is_file repo/hello.world + test_path_is_file sub1/.git + test_path_is_file sub1/hello.world + test_path_is_file sub2/.git + test_path_is_file sub2/hello.world + test_path_is_missing to_clean +' Base on the previous discussion of the patch topic I can see 3 options for how to fix this: Option 1: Plug the hole in my new is_git_repository function. A quick and dirty fix that passes the above test would be: diff --git a/builtin/clean.c b/builtin/clean.c index b679913..4f2fe95 100644 --- a/builtin/clean.c +++ b/builtin/clean.c @@ -153,14 +153,21 @@ static int is_git_repository(struct strbuf *path) if (is_git_directory(path-buf)) ret = 1; else { - size_t orig_path_len = path-len; - assert(orig_path_len != 0); - if (path-buf[orig_path_len - 1] != '/') - strbuf_addch(path, '/'); - strbuf_addstr(path, .git); - if (is_git_directory(path-buf)) - ret = 1; - strbuf_setlen(path, orig_path_len); + struct stat st; + const char *submodule_git_dir = git_path_submodule(path-buf, ); + lstat(submodule_git_dir, st); + if (S_ISDIR(st.st_mode)) + ret = 1; + else { + size_t orig_path_len = path-len; + assert(orig_path_len != 0); + if (path-buf[orig_path_len - 1] != '/') + strbuf_addch(path, '/'); + strbuf_addstr(path, .git); + if (is_git_directory(path-buf)) + ret = 1; + strbuf_setlen(path, orig_path_len); + } } return ret; There are probably more elegant solutions available here, suggestions welcome. Option 2: Go with the current solution of using resolve_gitlink_ref but either A) avoid placing non-submodule paths in the ref_cache list
Re: [PATCH v2 2/3] p7300: add performance tests for clean
On Sat, Apr 11, 2015 at 7:59 PM, Thomas Gummerer t.gumme...@gmail.com wrote: On 04/11, Erik Elfström wrote: Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- t/perf/p7300-clean.sh | 37 + 1 file changed, 37 insertions(+) create mode 100755 t/perf/p7300-clean.sh diff --git a/t/perf/p7300-clean.sh b/t/perf/p7300-clean.sh new file mode 100755 index 000..af50d5d --- /dev/null +++ b/t/perf/p7300-clean.sh @@ -0,0 +1,37 @@ +#!/bin/sh + +test_description=Test git-clean performance + +. ./perf-lib.sh + +test_perf_large_repo +test_checkout_worktree + +test_expect_success 'setup untracked directory with many sub dirs' ' + rm -rf 500_sub_dirs 5_sub_dirs clean_test_dir + mkdir 500_sub_dirs 5_sub_dirs clean_test_dir + for i in $(test_seq 1 500) + do + mkdir 500_sub_dirs/dir$i || return $? + done + for i in $(test_seq 1 100) + do + cp -r 500_sub_dirs 5_sub_dirs/dir$i || return $? + done +' + +test_perf 'clean many untracked sub dirs, check for nested git' ' + rm -rf clean_test_dir/5_sub_dirs_cpy + cp -r 5_sub_dirs clean_test_dir/5_sub_dirs_cpy Maybe this would be a good place to use test_perf_cleanup, which I introduced a while ago and you can find in the tg/perf-lib-test-perf-cleanup branch? It probably won't influence the performance a lot, but still better separate the code that actually needs to be tested from the cleanup/preparation code. Ditto in the other test. Yes, that would be a clear improvement. I was looking for something like this, the copy takes more time than the clean currently. The cleanup hook is maybe not exactly the right fit here though. I would need to do one initial copy in the setup test and then a copy in the cleanup, something like this: test_expect_success 'setup untracked directory with many sub dirs' ' ... cp -r 5_sub_dirs clean_test_dir/5_sub_dirs_cpy ' test_perf_cleanup 'clean many untracked sub dirs, check for nested git' ' git clean -q -f -d clean_test_dir/ ' ' test_dir_is_empty clean_test_dir rm -rf clean_test_dir/5_sub_dirs_cpy cp -r 5_sub_dirs clean_test_dir/5_sub_dirs_cpy ' This works better than my original code but maybe we can do even better with something like: test_setup_perf_cleanup 'clean many untracked sub dirs, check for nested git' ' rm -rf clean_test_dir/5_sub_dirs_cpy cp -r 5_sub_dirs clean_test_dir/5_sub_dirs_cpy ' ' git clean -q -f -d clean_test_dir/ ' ' test_dir_is_empty clean_test_dir ' Having a setup phase avoids the initial copy in the setup test making things a little easier to follow. I'm not sure its worth the extra complexity in perf-lib though (and I'm not sure I would be able to implement it either). Also, what would the implications be if I were to use your new cleanup function that is not yet on master? Should I rebase on top of your topic or make a follow up patch to switch over? + git clean -q -f -d clean_test_dir/ + test_dir_is_empty clean_test_dir +' + +test_perf 'clean many untracked sub dirs, ignore nested git' ' + rm -rf clean_test_dir/5_sub_dirs_cpy + cp -r 5_sub_dirs clean_test_dir/5_sub_dirs_cpy + git clean -q -f -f -d clean_test_dir/ + test_dir_is_empty clean_test_dir +' + +test_done -- 2.4.0.rc0.37.ga3b75b3 -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/3] Improving performance of git clean
v1 of the patch can be found here: http://thread.gmane.org/gmane.comp.version-control.git/266839/focus=266839 changes in v2: * fixed commit message, p7300: added performance tests for clean change to: p7300: add performance tests for clean * simplified test code * removed non portable ls -A in test * removed non portable $(seq ) in test * fixed missing || return $? in test * fixed missing sub shell for 'cd' command in test * fixed broken chains in test * added assert new clean.c:is_git_repository to guard against negative array index * use size_t instead of int for strbuf-len fixes held back for cleanup patches: * fixed existing broken chains * added assert in existing code to guard against negative array index Thanks to Eric Sunshine and Torsten Bögershausen for the very helpful review! Erik Elfström (3): t7300: add tests to document behavior of clean and nested git p7300: add performance tests for clean clean: improve performance when removing lots of directories builtin/clean.c | 24 ++--- t/perf/p7300-clean.sh | 37 ++ t/t7300-clean.sh | 72 +++ 3 files changed, 129 insertions(+), 4 deletions(-) create mode 100755 t/perf/p7300-clean.sh -- 2.4.0.rc0.37.ga3b75b3 -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/3] t7300: add tests to document behavior of clean and nested git
Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- t/t7300-clean.sh | 72 1 file changed, 72 insertions(+) diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh index 99be5d9..58e6b4a 100755 --- a/t/t7300-clean.sh +++ b/t/t7300-clean.sh @@ -455,6 +455,78 @@ test_expect_success 'nested git work tree' ' ! test -d bar ' +test_expect_failure 'nested git (only init) should be kept' ' + rm -fr foo bar + git init foo + mkdir bar + bar/goodbye.people + git clean -f -d + test_path_is_file foo/.git/HEAD + test_path_is_missing bar +' + +test_expect_failure 'nested git (bare) should be kept' ' + rm -fr foo bar + git init --bare foo + mkdir bar + bar/goodbye.people + git clean -f -d + test_path_is_file foo/HEAD + test_path_is_missing bar +' + +test_expect_success 'giving path in nested git work tree will remove it' ' + rm -fr foo + mkdir foo + ( + cd foo + git init + mkdir -p bar/baz + ( + cd bar/baz + hello.world + git add . + git commit -a -m nested + ) + ) + git clean -f -d foo/bar/baz + test_path_is_file foo/.git/HEAD + test_path_is_dir foo/bar/ + test_path_is_missing foo/bar/baz +' + +test_expect_success 'giving path to nested .git will not remove it' ' + rm -fr foo + mkdir foo bar + ( + cd foo + git init + hello.world + git add . + git commit -a -m nested + ) + git clean -f -d foo/.git + test_path_is_file foo/.git/HEAD + test_path_is_dir foo/.git/refs + test_path_is_dir foo/.git/objects + test_path_is_dir bar/ +' + +test_expect_success 'giving path to nested .git/ will remove contents' ' + rm -fr foo bar + mkdir foo bar + ( + cd foo + git init + hello.world + git add . + git commit -a -m nested + ) + git clean -f -d foo/.git/ + test_path_is_dir foo/.git + test_dir_is_empty foo/.git +' + test_expect_success 'force removal of nested git work tree' ' rm -fr foo bar baz mkdir -p foo bar baz/boo -- 2.4.0.rc0.37.ga3b75b3 -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/3] clean: improve performance when removing lots of directories
Before this change, clean used resolve_gitlink_ref to check for the presence of nested git repositories. This had the drawback of creating a ref_cache entry for every directory that should potentially be cleaned. The linear search through the ref_cache list caused a massive performance hit for large number of directories. Teach clean.c:remove_dirs to use setup.c:is_git_directory instead. is_git_directory will actually open HEAD and parse the HEAD ref but this implies a nested git repository and should be rare when cleaning. Using is_git_directory should give a more standardized check for what is and what isn't a git repository but also gives a slight behavioral change. We will now detect and respect bare and empty nested git repositories (only init run). Update t7300 to reflect this. The time to clean an untracked directory containing 10 sub directories went from 61s to 1.7s after this change. Helped-by: Jeff King p...@peff.net Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- builtin/clean.c | 24 t/t7300-clean.sh | 4 ++-- 2 files changed, 22 insertions(+), 6 deletions(-) diff --git a/builtin/clean.c b/builtin/clean.c index 98c103f..b679913 100644 --- a/builtin/clean.c +++ b/builtin/clean.c @@ -10,7 +10,6 @@ #include cache.h #include dir.h #include parse-options.h -#include refs.h #include string-list.h #include quote.h #include column.h @@ -148,6 +147,25 @@ static int exclude_cb(const struct option *opt, const char *arg, int unset) return 0; } +static int is_git_repository(struct strbuf *path) +{ + int ret = 0; + if (is_git_directory(path-buf)) + ret = 1; + else { + size_t orig_path_len = path-len; + assert(orig_path_len != 0); + if (path-buf[orig_path_len - 1] != '/') + strbuf_addch(path, '/'); + strbuf_addstr(path, .git); + if (is_git_directory(path-buf)) + ret = 1; + strbuf_setlen(path, orig_path_len); + } + + return ret; +} + static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag, int dry_run, int quiet, int *dir_gone) { @@ -155,13 +173,11 @@ static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag, struct strbuf quoted = STRBUF_INIT; struct dirent *e; int res = 0, ret = 0, gone = 1, original_len = path-len, len; - unsigned char submodule_head[20]; struct string_list dels = STRING_LIST_INIT_DUP; *dir_gone = 1; - if ((force_flag REMOVE_DIR_KEEP_NESTED_GIT) - !resolve_gitlink_ref(path-buf, HEAD, submodule_head)) { + if ((force_flag REMOVE_DIR_KEEP_NESTED_GIT) is_git_repository(path)) { if (!quiet) { quote_path_relative(path-buf, prefix, quoted); printf(dry_run ? _(msg_would_skip_git_dir) : _(msg_skip_git_dir), diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh index 58e6b4a..da294fe 100755 --- a/t/t7300-clean.sh +++ b/t/t7300-clean.sh @@ -455,7 +455,7 @@ test_expect_success 'nested git work tree' ' ! test -d bar ' -test_expect_failure 'nested git (only init) should be kept' ' +test_expect_success 'nested git (only init) should be kept' ' rm -fr foo bar git init foo mkdir bar @@ -465,7 +465,7 @@ test_expect_failure 'nested git (only init) should be kept' ' test_path_is_missing bar ' -test_expect_failure 'nested git (bare) should be kept' ' +test_expect_success 'nested git (bare) should be kept' ' rm -fr foo bar git init --bare foo mkdir bar -- 2.4.0.rc0.37.ga3b75b3 -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/3] p7300: add performance tests for clean
Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- t/perf/p7300-clean.sh | 37 + 1 file changed, 37 insertions(+) create mode 100755 t/perf/p7300-clean.sh diff --git a/t/perf/p7300-clean.sh b/t/perf/p7300-clean.sh new file mode 100755 index 000..af50d5d --- /dev/null +++ b/t/perf/p7300-clean.sh @@ -0,0 +1,37 @@ +#!/bin/sh + +test_description=Test git-clean performance + +. ./perf-lib.sh + +test_perf_large_repo +test_checkout_worktree + +test_expect_success 'setup untracked directory with many sub dirs' ' + rm -rf 500_sub_dirs 5_sub_dirs clean_test_dir + mkdir 500_sub_dirs 5_sub_dirs clean_test_dir + for i in $(test_seq 1 500) + do + mkdir 500_sub_dirs/dir$i || return $? + done + for i in $(test_seq 1 100) + do + cp -r 500_sub_dirs 5_sub_dirs/dir$i || return $? + done +' + +test_perf 'clean many untracked sub dirs, check for nested git' ' + rm -rf clean_test_dir/5_sub_dirs_cpy + cp -r 5_sub_dirs clean_test_dir/5_sub_dirs_cpy + git clean -q -f -d clean_test_dir/ + test_dir_is_empty clean_test_dir +' + +test_perf 'clean many untracked sub dirs, ignore nested git' ' + rm -rf clean_test_dir/5_sub_dirs_cpy + cp -r 5_sub_dirs clean_test_dir/5_sub_dirs_cpy + git clean -q -f -f -d clean_test_dir/ + test_dir_is_empty clean_test_dir +' + +test_done -- 2.4.0.rc0.37.ga3b75b3 -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] p7300: added performance tests for clean
Will fix! Also I forgot to ask, does anyone have a good way of moving the copy out of the performance timing? After the fix this test spends more time copying than cleaning and that is not so good. I'm not very good at shell scripting and the only way I could think of was to make multiple copies at the start and then in the test check and clean the first non empty directory. That feels kind of ugly and will fail if a different number of iterations per test is used. Any suggestions? I looked a bit at the framework code but I couldn't figure out if it was easy to add a setup option to be called be before each iteration. On Tue, Apr 7, 2015 at 12:09 AM, Eric Sunshine sunsh...@sunshineco.com wrote: On Mon, Apr 6, 2015 at 4:40 PM, Torsten Bögershausen tbo...@web.de wrote: On 2015-04-06 13.48, Erik Elfström wrote: Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- diff --git a/t/perf/p7300-clean.sh b/t/perf/p7300-clean.sh new file mode 100755 index 000..3f56fb2 --- /dev/null +++ b/t/perf/p7300-clean.sh @@ -0,0 +1,37 @@ +#!/bin/sh + +test_description=Test git-clean performance + +. ./perf-lib.sh + +test_perf_large_repo +test_checkout_worktree + +test_expect_success 'setup untracked directory with many sub dirs' ' + rm -rf 500_sub_dirs 5_sub_dirs clean_test_dir + mkdir 500_sub_dirs 5_sub_dirs clean_test_dir + for i in $(seq 1 500) I think seq is bash-only, and can be easily replaced by test_seq test_seq is definitely desirable. 'seq' is not present on some systems I use. + do + mkdir 500_sub_dirs/dir$i You may want: mkdir 500_sub_dirs/dir$i || return $? to catch failure of 'mkdir'. + done + for i in $(seq 1 100) + do + cp -r 500_sub_dirs 5_sub_dirs/dir$i Ditto: cp -r 500_sub_dirs 5_sub_dirs/dir$i || return $? + done +' + +test_perf 'clean many untracked sub dirs, check for nested git' ' + rm -rf clean_test_dir/5_sub_dirs_cpy + cp -r 5_sub_dirs clean_test_dir/5_sub_dirs_cpy + git clean -q -f -d clean_test_dir/ + test 0 = $(ls -A clean_test_dir/ | wc -l) Is the ls -A portable on all systems: http://pubs.opengroup.org/onlinepubs/009695399/utilities/ls.html My feeling is that the emptiness of a directory can by tested by simply removing it: + git clean -q -f -d clean_test_dir/ + rmdir clean_test_dir (And similar below) There's also the test_dir_is_empty() function which makes the intent even clearer than 'rmdir'. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] clean: improve performance when removing lots of directories
On Tue, Apr 7, 2015 at 12:10 AM, Eric Sunshine sunsh...@sunshineco.com wrote: On Mon, Apr 6, 2015 at 7:48 AM, Erik Elfström erik.elfst...@gmail.com wrote: Before this change, clean used resolve_gitlink_ref to check for the presence of nested git repositories. This had the drawback of creating a ref_cache entry for every directory that should potentially be cleaned. The linear search through the ref_cache list caused a massive performance hit for large number of directories. Teach clean.c:remove_dirs to use setup.c:is_git_directory instead. is_git_directory will actually open HEAD and parse the HEAD ref but this implies a nested git repository and should be rare when cleaning. Using is_git_directory should give a more standardized check for what is and what isn't a git repository but also gives a slight behavioral change. We will now detect and respect bare and empty nested git repositories (only init run). Update t7300 to reflect this. The time to clean an untracked directory containing 10 sub directories went from 61s to 1.7s after this change. Impressive. Signed-off-by: Erik Elfström erik.elfst...@gmail.com Helped-by: Jeff King p...@peff.net It is customary for your sign-off to be last. More below... --- diff --git a/builtin/clean.c b/builtin/clean.c index 98c103f..e951bd9 100644 --- a/builtin/clean.c +++ b/builtin/clean.c @@ -148,6 +147,24 @@ static int exclude_cb(const struct option *opt, const char *arg, int unset) return 0; } +static int is_git_repository(struct strbuf *path) +{ + int ret = 0; + if (is_git_directory(path-buf)) + ret = 1; + else { + int orig_path_len = path-len; + if (path-buf[orig_path_len - 1] != '/') Minor: I don't know how others feel about it, but I always find it a bit disturbing to see a potential negative array access without a safety check that orig_path_len is not 0, either directly in the conditional or as a documenting assert(). I think I would prefer to accept empty input and return false rather than assert. What to you think about: static int is_git_repository(struct strbuf *path) { int ret = 0; size_t orig_path_len = path-len; if (orig_path_len == 0) ret = 0; else if (is_git_directory(path-buf)) ret = 1; else { if (path-buf[orig_path_len - 1] != '/') strbuf_addch(path, '/'); strbuf_addstr(path, .git); if (is_git_directory(path-buf)) ret = 1; strbuf_setlen(path, orig_path_len); } return ret; } Also I borrowed this pattern from remove_dirs and it has the same problem. Should I add something like this as a separate commit? diff --git a/builtin/clean.c b/builtin/clean.c index ccffd8a..88850e3 100644 --- a/builtin/clean.c +++ b/builtin/clean.c @@ -173,7 +173,8 @@ static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag, DIR *dir; struct strbuf quoted = STRBUF_INIT; struct dirent *e; - int res = 0, ret = 0, gone = 1, original_len = path-len, len; + int res = 0, ret = 0, gone = 1; + size_t original_len = path-len, len; struct string_list dels = STRING_LIST_INIT_DUP; *dir_gone = 1; @@ -201,6 +202,7 @@ static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag, return res; } + assert(original_len 0 expects non-empty path); if (path-buf[original_len - 1] != '/') strbuf_addch(path, '/'); + strbuf_addch(path, '/'); + strbuf_addstr(path, .git); + if (is_git_directory(path-buf)) + ret = 1; + strbuf_setlen(path, orig_path_len); + } + + return ret; +} + static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag, int dry_run, int quiet, int *dir_gone) { -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] t7300: add tests to document behavior of clean and nested git
will fix! On Tue, Apr 7, 2015 at 12:06 AM, Eric Sunshine sunsh...@sunshineco.com wrote: On Mon, Apr 6, 2015 at 7:48 AM, Erik Elfström erik.elfst...@gmail.com wrote: Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh index 99be5d9..cfdf6d4 100755 --- a/t/t7300-clean.sh +++ b/t/t7300-clean.sh @@ -455,6 +455,88 @@ test_expect_success 'nested git work tree' ' ! test -d bar ' +test_expect_failure 'nested git (only init) should be kept' ' + rm -fr foo bar + mkdir foo bar + ( + cd foo + git init + ) + ( + cd bar + goodbye.people + ) (minor; ignore if desired) The above could be simplified to: rm -fr foo bar git init foo mkdir bar bar/goodbye.people + git clean -f -d + test -f foo/.git/HEAD + ! test -d bar Here and elsewhere, you could instead use test_path_is_file() and test_path_is_missing(), respectively. +' + +test_expect_failure 'nested git (bare) should be kept' ' + rm -fr foo bar + mkdir foo bar + ( + cd foo + git init --bare + ) + ( + cd bar + goodbye.people + ) Simplified: rm -fr foo bar git init --bare foo mkdir bar bar/goodbye.people + git clean -f -d + test -f foo/HEAD + ! test -d bar +' + +test_expect_success 'giving path in nested git work tree will remove it' ' + rm -fr foo + mkdir foo + ( + cd foo + git init + mkdir -p bar/baz + cd bar/baz + hello.world + git add . + git commit -a -m nested + ) + git clean -f -d foo/bar/baz + test -f foo/.git/HEAD + test -d foo/bar/ Alternative, here and elsewhere: test_path_is_dir() + ! test -d foo/bar/baz +' + +test_expect_success 'giving path to nested .git will not remove it' ' + rm -fr foo + mkdir foo bar + ( + cd foo + git init + hello.world + git add . + git commit -a -m nested + ) + git clean -f -d foo/.git + test -f foo/.git/HEAD + test -d foo/.git/refs + test -d foo/.git/objects + test -d bar/ +' + +test_expect_success 'giving path to nested .git/ will remove contents' ' + rm -fr foo bar + mkdir foo bar + ( + cd foo + git init + hello.world + git add . + git commit -a -m nested + ) + git clean -f -d foo/.git/ + test 0 = $(ls -A foo/.git | wc -l) Although in the latest POSIX, -A may not be portable. Alternative: test_dir_is_empty() + test -d foo/.git +' + test_expect_success 'force removal of nested git work tree' ' rm -fr foo bar baz mkdir -p foo bar baz/boo -- 2.4.0.rc0.37.ga3b75b3 -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] clean: improve performance when removing lots of directories
Before this change, clean used resolve_gitlink_ref to check for the presence of nested git repositories. This had the drawback of creating a ref_cache entry for every directory that should potentially be cleaned. The linear search through the ref_cache list caused a massive performance hit for large number of directories. Teach clean.c:remove_dirs to use setup.c:is_git_directory instead. is_git_directory will actually open HEAD and parse the HEAD ref but this implies a nested git repository and should be rare when cleaning. Using is_git_directory should give a more standardized check for what is and what isn't a git repository but also gives a slight behavioral change. We will now detect and respect bare and empty nested git repositories (only init run). Update t7300 to reflect this. The time to clean an untracked directory containing 10 sub directories went from 61s to 1.7s after this change. Signed-off-by: Erik Elfström erik.elfst...@gmail.com Helped-by: Jeff King p...@peff.net --- builtin/clean.c | 23 +++ t/t7300-clean.sh | 4 ++-- 2 files changed, 21 insertions(+), 6 deletions(-) diff --git a/builtin/clean.c b/builtin/clean.c index 98c103f..e951bd9 100644 --- a/builtin/clean.c +++ b/builtin/clean.c @@ -10,7 +10,6 @@ #include cache.h #include dir.h #include parse-options.h -#include refs.h #include string-list.h #include quote.h #include column.h @@ -148,6 +147,24 @@ static int exclude_cb(const struct option *opt, const char *arg, int unset) return 0; } +static int is_git_repository(struct strbuf *path) +{ + int ret = 0; + if (is_git_directory(path-buf)) + ret = 1; + else { + int orig_path_len = path-len; + if (path-buf[orig_path_len - 1] != '/') + strbuf_addch(path, '/'); + strbuf_addstr(path, .git); + if (is_git_directory(path-buf)) + ret = 1; + strbuf_setlen(path, orig_path_len); + } + + return ret; +} + static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag, int dry_run, int quiet, int *dir_gone) { @@ -155,13 +172,11 @@ static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag, struct strbuf quoted = STRBUF_INIT; struct dirent *e; int res = 0, ret = 0, gone = 1, original_len = path-len, len; - unsigned char submodule_head[20]; struct string_list dels = STRING_LIST_INIT_DUP; *dir_gone = 1; - if ((force_flag REMOVE_DIR_KEEP_NESTED_GIT) - !resolve_gitlink_ref(path-buf, HEAD, submodule_head)) { + if ((force_flag REMOVE_DIR_KEEP_NESTED_GIT) is_git_repository(path)) { if (!quiet) { quote_path_relative(path-buf, prefix, quoted); printf(dry_run ? _(msg_would_skip_git_dir) : _(msg_skip_git_dir), diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh index cfdf6d4..ada569d 100755 --- a/t/t7300-clean.sh +++ b/t/t7300-clean.sh @@ -455,7 +455,7 @@ test_expect_success 'nested git work tree' ' ! test -d bar ' -test_expect_failure 'nested git (only init) should be kept' ' +test_expect_success 'nested git (only init) should be kept' ' rm -fr foo bar mkdir foo bar ( @@ -471,7 +471,7 @@ test_expect_failure 'nested git (only init) should be kept' ' ! test -d bar ' -test_expect_failure 'nested git (bare) should be kept' ' +test_expect_success 'nested git (bare) should be kept' ' rm -fr foo bar mkdir foo bar ( -- 2.4.0.rc0.37.ga3b75b3 -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] t7300: add tests to document behavior of clean and nested git
Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- These tests were added so that I could understand the corner case behaviors of clean and nested repositories and document the changes in behavior that this series will cause. The ones marked as expect failure will be changed to expect success later in the series. t/t7300-clean.sh | 82 1 file changed, 82 insertions(+) diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh index 99be5d9..cfdf6d4 100755 --- a/t/t7300-clean.sh +++ b/t/t7300-clean.sh @@ -455,6 +455,88 @@ test_expect_success 'nested git work tree' ' ! test -d bar ' +test_expect_failure 'nested git (only init) should be kept' ' + rm -fr foo bar + mkdir foo bar + ( + cd foo + git init + ) + ( + cd bar + goodbye.people + ) + git clean -f -d + test -f foo/.git/HEAD + ! test -d bar +' + +test_expect_failure 'nested git (bare) should be kept' ' + rm -fr foo bar + mkdir foo bar + ( + cd foo + git init --bare + ) + ( + cd bar + goodbye.people + ) + git clean -f -d + test -f foo/HEAD + ! test -d bar +' + +test_expect_success 'giving path in nested git work tree will remove it' ' + rm -fr foo + mkdir foo + ( + cd foo + git init + mkdir -p bar/baz + cd bar/baz + hello.world + git add . + git commit -a -m nested + ) + git clean -f -d foo/bar/baz + test -f foo/.git/HEAD + test -d foo/bar/ + ! test -d foo/bar/baz +' + +test_expect_success 'giving path to nested .git will not remove it' ' + rm -fr foo + mkdir foo bar + ( + cd foo + git init + hello.world + git add . + git commit -a -m nested + ) + git clean -f -d foo/.git + test -f foo/.git/HEAD + test -d foo/.git/refs + test -d foo/.git/objects + test -d bar/ +' + +test_expect_success 'giving path to nested .git/ will remove contents' ' + rm -fr foo bar + mkdir foo bar + ( + cd foo + git init + hello.world + git add . + git commit -a -m nested + ) + git clean -f -d foo/.git/ + test 0 = $(ls -A foo/.git | wc -l) + test -d foo/.git +' + test_expect_success 'force removal of nested git work tree' ' rm -fr foo bar baz mkdir -p foo bar baz/boo -- 2.4.0.rc0.37.ga3b75b3 -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] p7300: added performance tests for clean
Signed-off-by: Erik Elfström erik.elfst...@gmail.com --- t/perf/p7300-clean.sh | 37 + 1 file changed, 37 insertions(+) create mode 100755 t/perf/p7300-clean.sh diff --git a/t/perf/p7300-clean.sh b/t/perf/p7300-clean.sh new file mode 100755 index 000..3f56fb2 --- /dev/null +++ b/t/perf/p7300-clean.sh @@ -0,0 +1,37 @@ +#!/bin/sh + +test_description=Test git-clean performance + +. ./perf-lib.sh + +test_perf_large_repo +test_checkout_worktree + +test_expect_success 'setup untracked directory with many sub dirs' ' + rm -rf 500_sub_dirs 5_sub_dirs clean_test_dir + mkdir 500_sub_dirs 5_sub_dirs clean_test_dir + for i in $(seq 1 500) + do + mkdir 500_sub_dirs/dir$i + done + for i in $(seq 1 100) + do + cp -r 500_sub_dirs 5_sub_dirs/dir$i + done +' + +test_perf 'clean many untracked sub dirs, check for nested git' ' + rm -rf clean_test_dir/5_sub_dirs_cpy + cp -r 5_sub_dirs clean_test_dir/5_sub_dirs_cpy + git clean -q -f -d clean_test_dir/ + test 0 = $(ls -A clean_test_dir/ | wc -l) +' + +test_perf 'clean many untracked sub dirs, ignore nested git' ' + rm -rf clean_test_dir/5_sub_dirs_cpy + cp -r 5_sub_dirs clean_test_dir/5_sub_dirs_cpy + git clean -q -f -f -d clean_test_dir/ + test 0 = $(ls -A clean_test_dir/ | wc -l) +' + +test_done -- 2.4.0.rc0.37.ga3b75b3 -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] Improving performance of git clean
This series addresses a performance issue of git clean previously discussed here: http://thread.gmane.org/gmane.comp.version-control.git/265560/focus=265560 and here: http://thread.gmane.org/gmane.comp.version-control.git/266777/focus=266777 The issue manifests when trying to clean a large number of untracked directories. In my case this scenario triggered by a test suite running in the repository creating a directory for each test resulting in a build directory with ~10 sub directories that needs to be cleaned. For some extreme cases, clean times of more than 1h have been observed. With this series, the time to clean an untracked directory containing 10 sub directories goes from 61s to 1.7s. The main change is to switch the repository check in clean.c:remove_dirs from using refs.c:resolve_gitlink_ref to setup.c:is_git_directory. One potential issue that is_git_directory contains the following check: if (getenv(DB_ENVIRONMENT)) { if (access(getenv(DB_ENVIRONMENT), X_OK)) return 0; } I'm not sure how this will affect this usecase (checking for some other nested git repo). Please give some thought to this when reviewing. Jeff King also expressed concerns that we may have similar performance issues in other commands and that it could be good to unify these is this a repo?-checks. This series only attempts to solve the git-clean case. Erik Elfström (3): t7300: add tests to document behavior of clean and nested git p7300: added performance tests for clean clean: improve performance when removing lots of directories builtin/clean.c | 23 --- t/perf/p7300-clean.sh | 37 +++ t/t7300-clean.sh | 82 +++ 3 files changed, 138 insertions(+), 4 deletions(-) create mode 100755 t/perf/p7300-clean.sh -- 2.4.0.rc0.37.ga3b75b3 -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
git clean performance issues
Hi, I'm having a performance issue with git clean -qxfd (note, not using -ff). The performance issue shows up when trying to clean untracked directories that themselves contain many sub directories. The performance is highly non linear with the number of sub directories. Some test numbers: DirsTime 1 0m0.754s 5 0m16.606s 10 1m31.418s When running git clean -qxffd (note, using -ff) the performance is fast and linear: DirsTime 1 0m0.158s 5 0m0.918s 10 0m1.639s After checking the source of git-clean my understanding of the problem is as follows: When clean.c:cmd_clean finds a directory and the -d flag is given it will call clean.c:remove_dirs to potentially remove the directory and all sub directories. Unless -ff is given remove_dirs tries to be nice and not remove directories containing other git repositories. To do this it does the following check: ... if ((force_flag REMOVE_DIR_KEEP_NESTED_GIT) !resolve_gitlink_ref(path-buf, HEAD, submodule_head)) { ... The problem is that refs.c:resolve_gitlink_ref will call refs.c:get_ref_cache that will linearly search a linked list of cache entries and create and insert a new ref_cache entry in the list for each path it is given if it fails to find an existing entry: for (refs = submodule_ref_caches; refs; refs = refs-next) if (!strcmp(submodule, refs-name)) return refs; refs = create_ref_cache(submodule); refs-next = submodule_ref_caches; submodule_ref_caches = refs; return refs; In my scenario get_ref_cache will be called 1+ times, each time with a new path. The final few calls will need to search through and compare 1+ entries before realizing that there is no existing entry. This quickly ads up to 100 million+ calls to strcmp(). From what I can understand, the calls to get_ref_cache in this scenario will never do any useful work. Is this correct? If so, would it be possible to bypass it, maybe by calling resolve_gitlink_ref_recursive directly or by using some other way of checking for the presence of a git repo in clean.c:remove_dirs? /Erik -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: git clean performance issues
That looks like the same issue. The use is_git_directory approach sounds good to me, is that the direction you would prefer? I can try to cobble something together although I must warn you I have zero previous experience with this code base so a few iterations will probably be needed. /Erik On Sat, Apr 4, 2015 at 9:55 PM, Jeff King p...@peff.net wrote: On Sat, Apr 04, 2015 at 08:32:45PM +0200, erik elfström wrote: In my scenario get_ref_cache will be called 1+ times, each time with a new path. The final few calls will need to search through and compare 1+ entries before realizing that there is no existing entry. This quickly ads up to 100 million+ calls to strcmp(). From what I can understand, the calls to get_ref_cache in this scenario will never do any useful work. Is this correct? If so, would it be possible to bypass it, maybe by calling resolve_gitlink_ref_recursive directly or by using some other way of checking for the presence of a git repo in clean.c:remove_dirs? I think this is the same issue that was discussed here: http://thread.gmane.org/gmane.comp.version-control.git/265560/focus=265585 There is some discussion of a possible fix in that thread. I was hoping that Andreas was going to look further and produce a patch, but I imagine he got busy with other things. Do you want to try picking it up? -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html