Re: [PATCH 3/3] clean: improve performance when removing lots of directories
On Tue, Apr 7, 2015 at 3:55 PM, erik elfström wrote: > On Tue, Apr 7, 2015 at 12:10 AM, Eric Sunshine > wrote: >> On Mon, Apr 6, 2015 at 7:48 AM, Erik Elfström >> wrote: >>> diff --git a/builtin/clean.c b/builtin/clean.c >>> index 98c103f..e951bd9 100644 >>> --- a/builtin/clean.c >>> +++ b/builtin/clean.c >>> +static int is_git_repository(struct strbuf *path) >>> +{ >>> + int ret = 0; >>> + if (is_git_directory(path->buf)) >>> + ret = 1; >>> + else { >>> + int orig_path_len = path->len; >>> + if (path->buf[orig_path_len - 1] != '/') >> >> Minor: I don't know how others feel about it, but I always find it a >> bit disturbing to see a potential negative array access without a >> safety check that orig_path_len is not 0, either directly in the >> conditional or as a documenting assert(). > > I think I would prefer to accept empty input and return false rather > than assert. What to you think about: > > static int is_git_repository(struct strbuf *path) > { > int ret = 0; > size_t orig_path_len = path->len; > if (orig_path_len == 0) > ret = 0; My concern in raising the issue is that someone reviewing the patch or reading the code later won't necessarily know whether you took the potential negative array access into account and dismissed it as "can't happen", or if you overlooked the possibility altogether. Had there been an explicit check in the code (either assert() or other special handling such as returning 'false'), a comment in the code, or mention in the commit message, then it would have been clear that you took the case into consideration, and I wouldn't have worried about it. As for the this proposed version of is_git_repository(), I don't have strong feelings, and can formulate arguments either way. If it doesn't make sense for is_git_repository() ever to be called with empty input, then assert() may be the better choice for documenting that fact. However, if you foresee some need for allowing empty input, or if you audited the functionality and found that it can already be called with empty input, then returning 'false' makes sense. Use your best judgment. > else if (is_git_directory(path->buf)) > ret = 1; > else { > if (path->buf[orig_path_len - 1] != '/') > strbuf_addch(path, '/'); > strbuf_addstr(path, ".git"); > if (is_git_directory(path->buf)) > ret = 1; > strbuf_setlen(path, orig_path_len); > } > > return ret; > } > > > Also I borrowed this pattern from remove_dirs and it has the same > problem. Should I add something like this as a separate commit? > > diff --git a/builtin/clean.c b/builtin/clean.c > index ccffd8a..88850e3 100644 > --- a/builtin/clean.c > +++ b/builtin/clean.c > @@ -201,6 +202,7 @@ static int remove_dirs(struct strbuf *path, const > char *prefix, int force_flag, > return res; > } > > + assert(original_len > 0 && "expects non-empty path"); > if (path->buf[original_len - 1] != '/') > strbuf_addch(path, '/'); I personally wouldn't mind such a patch. (I'm not sure that the string within the assert() adds much value, and it's a not-much-used idiom within the Git source.) -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] clean: improve performance when removing lots of directories
On Tue, Apr 7, 2015 at 12:10 AM, Eric Sunshine wrote: > On Mon, Apr 6, 2015 at 7:48 AM, Erik Elfström wrote: >> Before this change, clean used resolve_gitlink_ref to check for the >> presence of nested git repositories. This had the drawback of creating >> a ref_cache entry for every directory that should potentially be >> cleaned. The linear search through the ref_cache list caused a massive >> performance hit for large number of directories. >> >> Teach clean.c:remove_dirs to use setup.c:is_git_directory >> instead. is_git_directory will actually open HEAD and parse the HEAD >> ref but this implies a nested git repository and should be rare when >> cleaning. >> >> Using is_git_directory should give a more standardized check for what >> is and what isn't a git repository but also gives a slight behavioral >> change. We will now detect and respect bare and empty nested git >> repositories (only init run). Update t7300 to reflect this. >> >> The time to clean an untracked directory containing 10 sub >> directories went from 61s to 1.7s after this change. > > Impressive. > >> Signed-off-by: Erik Elfström >> Helped-by: Jeff King > > It is customary for your sign-off to be last. > > More below... > >> --- >> diff --git a/builtin/clean.c b/builtin/clean.c >> index 98c103f..e951bd9 100644 >> --- a/builtin/clean.c >> +++ b/builtin/clean.c >> @@ -148,6 +147,24 @@ static int exclude_cb(const struct option *opt, const >> char *arg, int unset) >> return 0; >> } >> >> +static int is_git_repository(struct strbuf *path) >> +{ >> + int ret = 0; >> + if (is_git_directory(path->buf)) >> + ret = 1; >> + else { >> + int orig_path_len = path->len; >> + if (path->buf[orig_path_len - 1] != '/') > > Minor: I don't know how others feel about it, but I always find it a > bit disturbing to see a potential negative array access without a > safety check that orig_path_len is not 0, either directly in the > conditional or as a documenting assert(). > I think I would prefer to accept empty input and return false rather than assert. What to you think about: static int is_git_repository(struct strbuf *path) { int ret = 0; size_t orig_path_len = path->len; if (orig_path_len == 0) ret = 0; else if (is_git_directory(path->buf)) ret = 1; else { if (path->buf[orig_path_len - 1] != '/') strbuf_addch(path, '/'); strbuf_addstr(path, ".git"); if (is_git_directory(path->buf)) ret = 1; strbuf_setlen(path, orig_path_len); } return ret; } Also I borrowed this pattern from remove_dirs and it has the same problem. Should I add something like this as a separate commit? diff --git a/builtin/clean.c b/builtin/clean.c index ccffd8a..88850e3 100644 --- a/builtin/clean.c +++ b/builtin/clean.c @@ -173,7 +173,8 @@ static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag, DIR *dir; struct strbuf quoted = STRBUF_INIT; struct dirent *e; - int res = 0, ret = 0, gone = 1, original_len = path->len, len; + int res = 0, ret = 0, gone = 1; + size_t original_len = path->len, len; struct string_list dels = STRING_LIST_INIT_DUP; *dir_gone = 1; @@ -201,6 +202,7 @@ static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag, return res; } + assert(original_len > 0 && "expects non-empty path"); if (path->buf[original_len - 1] != '/') strbuf_addch(path, '/'); >> + strbuf_addch(path, '/'); >> + strbuf_addstr(path, ".git"); >> + if (is_git_directory(path->buf)) >> + ret = 1; >> + strbuf_setlen(path, orig_path_len); >> + } >> + >> + return ret; >> +} >> + >> static int remove_dirs(struct strbuf *path, const char *prefix, int >> force_flag, >> int dry_run, int quiet, int *dir_gone) >> { -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] clean: improve performance when removing lots of directories
On Mon, Apr 6, 2015 at 7:48 AM, Erik Elfström wrote: > Before this change, clean used resolve_gitlink_ref to check for the > presence of nested git repositories. This had the drawback of creating > a ref_cache entry for every directory that should potentially be > cleaned. The linear search through the ref_cache list caused a massive > performance hit for large number of directories. > > Teach clean.c:remove_dirs to use setup.c:is_git_directory > instead. is_git_directory will actually open HEAD and parse the HEAD > ref but this implies a nested git repository and should be rare when > cleaning. > > Using is_git_directory should give a more standardized check for what > is and what isn't a git repository but also gives a slight behavioral > change. We will now detect and respect bare and empty nested git > repositories (only init run). Update t7300 to reflect this. > > The time to clean an untracked directory containing 10 sub > directories went from 61s to 1.7s after this change. Impressive. > Signed-off-by: Erik Elfström > Helped-by: Jeff King It is customary for your sign-off to be last. More below... > --- > diff --git a/builtin/clean.c b/builtin/clean.c > index 98c103f..e951bd9 100644 > --- a/builtin/clean.c > +++ b/builtin/clean.c > @@ -148,6 +147,24 @@ static int exclude_cb(const struct option *opt, const > char *arg, int unset) > return 0; > } > > +static int is_git_repository(struct strbuf *path) > +{ > + int ret = 0; > + if (is_git_directory(path->buf)) > + ret = 1; > + else { > + int orig_path_len = path->len; > + if (path->buf[orig_path_len - 1] != '/') Minor: I don't know how others feel about it, but I always find it a bit disturbing to see a potential negative array access without a safety check that orig_path_len is not 0, either directly in the conditional or as a documenting assert(). > + strbuf_addch(path, '/'); > + strbuf_addstr(path, ".git"); > + if (is_git_directory(path->buf)) > + ret = 1; > + strbuf_setlen(path, orig_path_len); > + } > + > + return ret; > +} > + > static int remove_dirs(struct strbuf *path, const char *prefix, int > force_flag, > int dry_run, int quiet, int *dir_gone) > { -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] clean: improve performance when removing lots of directories
Before this change, clean used resolve_gitlink_ref to check for the presence of nested git repositories. This had the drawback of creating a ref_cache entry for every directory that should potentially be cleaned. The linear search through the ref_cache list caused a massive performance hit for large number of directories. Teach clean.c:remove_dirs to use setup.c:is_git_directory instead. is_git_directory will actually open HEAD and parse the HEAD ref but this implies a nested git repository and should be rare when cleaning. Using is_git_directory should give a more standardized check for what is and what isn't a git repository but also gives a slight behavioral change. We will now detect and respect bare and empty nested git repositories (only init run). Update t7300 to reflect this. The time to clean an untracked directory containing 10 sub directories went from 61s to 1.7s after this change. Signed-off-by: Erik Elfström Helped-by: Jeff King --- builtin/clean.c | 23 +++ t/t7300-clean.sh | 4 ++-- 2 files changed, 21 insertions(+), 6 deletions(-) diff --git a/builtin/clean.c b/builtin/clean.c index 98c103f..e951bd9 100644 --- a/builtin/clean.c +++ b/builtin/clean.c @@ -10,7 +10,6 @@ #include "cache.h" #include "dir.h" #include "parse-options.h" -#include "refs.h" #include "string-list.h" #include "quote.h" #include "column.h" @@ -148,6 +147,24 @@ static int exclude_cb(const struct option *opt, const char *arg, int unset) return 0; } +static int is_git_repository(struct strbuf *path) +{ + int ret = 0; + if (is_git_directory(path->buf)) + ret = 1; + else { + int orig_path_len = path->len; + if (path->buf[orig_path_len - 1] != '/') + strbuf_addch(path, '/'); + strbuf_addstr(path, ".git"); + if (is_git_directory(path->buf)) + ret = 1; + strbuf_setlen(path, orig_path_len); + } + + return ret; +} + static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag, int dry_run, int quiet, int *dir_gone) { @@ -155,13 +172,11 @@ static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag, struct strbuf quoted = STRBUF_INIT; struct dirent *e; int res = 0, ret = 0, gone = 1, original_len = path->len, len; - unsigned char submodule_head[20]; struct string_list dels = STRING_LIST_INIT_DUP; *dir_gone = 1; - if ((force_flag & REMOVE_DIR_KEEP_NESTED_GIT) && - !resolve_gitlink_ref(path->buf, "HEAD", submodule_head)) { + if ((force_flag & REMOVE_DIR_KEEP_NESTED_GIT) && is_git_repository(path)) { if (!quiet) { quote_path_relative(path->buf, prefix, "ed); printf(dry_run ? _(msg_would_skip_git_dir) : _(msg_skip_git_dir), diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh index cfdf6d4..ada569d 100755 --- a/t/t7300-clean.sh +++ b/t/t7300-clean.sh @@ -455,7 +455,7 @@ test_expect_success 'nested git work tree' ' ! test -d bar ' -test_expect_failure 'nested git (only init) should be kept' ' +test_expect_success 'nested git (only init) should be kept' ' rm -fr foo bar && mkdir foo bar && ( @@ -471,7 +471,7 @@ test_expect_failure 'nested git (only init) should be kept' ' ! test -d bar ' -test_expect_failure 'nested git (bare) should be kept' ' +test_expect_success 'nested git (bare) should be kept' ' rm -fr foo bar && mkdir foo bar && ( -- 2.4.0.rc0.37.ga3b75b3 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html