Re: [PATCH v2 3/3] clean: improve performance when removing lots of directories

2015-04-17 Thread Junio C Hamano
Jeff King  writes:

>> Option 1:
>>  Plug the hole in my new is_git_repository function. A quick and dirty
>>  fix that passes the above test would be:
>
> I think that makes sense. It would be nice if you could just call
> read_gitfile, but that function is very anxious to die on error. So the
> prerequisite step would probably be to refactor that into a
> read_gitfile_gently that returns an error code.

I agree.

I was looking at the repository discovery loop to see if it makes
sense to update is-git-directory() to take a gitfile, but I do not
think it is a good idea (typically after is-git-directory() says
"yes", we would append paths e.g. "refs/heads/master" after it to
pass the result to system calls like open()).  I agree that adding
read-gitfile-gently and call it before running is-git-directory
would be a good solution for this change.

> PS Thank you for working on this.

That too.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/3] clean: improve performance when removing lots of directories

2015-04-17 Thread Jeff King
On Fri, Apr 17, 2015 at 08:15:40PM +0200, erik elfström wrote:

> > Doesn't this implementation get confused by modern submodule
> > checkouts and descend into and clean their working tree, though?
> > Module M with path P would have a directory P in the working tree of
> > the top-level project, and P/.git is a regular file that will fail
> > "is_git_directory()" test but records the location of the real
> > submodule repository i.e. ".git/modules/M" via the "gitdir:"
> > mechanism.
> >
> 
> Yes, there is a problem here. I've added the test below and it fails after
> my change by cleaning sub2 (sub1 is not cleaned). Are there more cases
> here that I should test for?

I wonder about the opposite case, too (finding more repos than we used
to).

It looks like your patches will find bare repositories in the tree,
whereas the current code does not (it only cares about ".git"). AFAIK,
submodules will never exist as bare in the working tree. And I have seen
repositories which embed bare repos as test cases. Admittedly this is
because I work on projects that are related to git itself, but I don't
see a reason to regress this case if the submodule code doesn't get any
benefit.

> Base on the previous discussion of the patch topic I can see 3 options
> for how to fix this:
> 
> Option 1:
>  Plug the hole in my new is_git_repository function. A quick and dirty
>  fix that passes the above test would be:

I think that makes sense. It would be nice if you could just call
read_gitfile, but that function is very anxious to die on error. So the
prerequisite step would probably be to refactor that into a
read_gitfile_gently that returns an error code.

-Peff

PS Thank you for working on this. I have been quiet because I haven't
   had a chance to look over your patches carefully yet, but overall
   they look very promising.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/3] clean: improve performance when removing lots of directories

2015-04-17 Thread erik elfström
On Wed, Apr 15, 2015 at 7:56 PM, Junio C Hamano  wrote:
> Erik Elfström  writes:
>
>> Before this change, clean used resolve_gitlink_ref to check for the
>> presence of nested git repositories. This had the drawback of creating
>> a ref_cache entry for every directory that should potentially be
>> cleaned. The linear search through the ref_cache list caused a massive
>> performance hit for large number of directories.
>
> I'd prefer to see the "current state" described in the current
> tense, e.g.
>
> "git clean" uses resolve_gitlink_ref() to check for the presence of
> nested git repositories, but it has the drawback of creating a
> ref_cache entry for every directory that should potentially be
> cleaned. The linear search through the ref_cache list causes a
> massive performance hit for large number of directories.
>

Yes, that reads better.

>> Teach clean.c:remove_dirs to use setup.c:is_git_directory
>> instead. is_git_directory will actually open HEAD and parse the HEAD
>> ref but this implies a nested git repository and should be rare when
>> cleaning.
>
> I am not sure what you wanted to say in this paragraph.  What does
> it being rare have to do with it?  Even if it is not rare (i.e. the
> top-level project you are working with has many submodules checked
> out without using the more recent "a file .git pointing into
> .git/modules/ via 'gitdir: $overThere'" mechanism), if we found a
> nested git repository, we treat it as special and exclude it from
> cleaning it out, which is a good thing, no?
>

I was trying to motivate that the performance of is_git_directory is not a
problem for us even though it opens a file and parses it. I see now when I
read it again that this is not very clear.

> Doesn't this implementation get confused by modern submodule
> checkouts and descend into and clean their working tree, though?
> Module M with path P would have a directory P in the working tree of
> the top-level project, and P/.git is a regular file that will fail
> "is_git_directory()" test but records the location of the real
> submodule repository i.e. ".git/modules/M" via the "gitdir:"
> mechanism.
>

Yes, there is a problem here. I've added the test below and it fails after
my change by cleaning sub2 (sub1 is not cleaned). Are there more cases
here that I should test for?

+test_expect_success 'should not clean submodules' '
+   rm -fr repo to_clean sub1 sub2 &&
+   mkdir -p repo to_clean &&
+   (
+   cd repo &&
+   git init &&
+   >hello.world
+   git add . &&
+   git commit -a -m nested
+   ) &&
+   git submodule add ./repo/.git sub1 &&
+   git commit -m "sub1" &&
+   git branch before_sub2 &&
+   git submodule add ./repo/.git sub2 &&
+   git commit -m "sub2" &&
+   git checkout before_sub2 &&
+   >to_clean/should_clean.this &&
+   git clean -f -d &&
+   test_path_is_file repo/.git/index &&
+   test_path_is_file repo/hello.world &&
+   test_path_is_file sub1/.git &&
+   test_path_is_file sub1/hello.world &&
+   test_path_is_file sub2/.git &&
+   test_path_is_file sub2/hello.world &&
+   test_path_is_missing to_clean
+'

Base on the previous discussion of the patch topic I can see 3 options
for how to fix this:

Option 1:
 Plug the hole in my new is_git_repository function. A quick and dirty
 fix that passes the above test would be:

diff --git a/builtin/clean.c b/builtin/clean.c
index b679913..4f2fe95 100644
--- a/builtin/clean.c
+++ b/builtin/clean.c
@@ -153,14 +153,21 @@ static int is_git_repository(struct strbuf *path)
if (is_git_directory(path->buf))
ret = 1;
else {
-   size_t orig_path_len = path->len;
-   assert(orig_path_len != 0);
-   if (path->buf[orig_path_len - 1] != '/')
-   strbuf_addch(path, '/');
-   strbuf_addstr(path, ".git");
-   if (is_git_directory(path->buf))
-   ret = 1;
-   strbuf_setlen(path, orig_path_len);
+   struct stat st;
+   const char *submodule_git_dir =
git_path_submodule(path->buf, "");
+   lstat(submodule_git_dir, &st);
+   if (S_ISDIR(st.st_mode))
+   ret = 1;
+   else {
+   size_t orig_path_len = path->len;
+   assert(orig_path_len != 0);
+   if (path->buf[orig_path_len - 1] != '/')
+   strbuf_addch(path, '/');
+   strbuf_addstr(path, ".git");
+   if (is_git_directory(path->buf))
+   ret = 1;
+   strbuf_setlen(path, orig_path_len);
+   }
}

return ret;

There are probably more elegant solutions available here, suggestions
welcome.

Option 2:
 Go with the current solution of using resolve_git

Re: [PATCH v2 3/3] clean: improve performance when removing lots of directories

2015-04-15 Thread Junio C Hamano
Erik Elfström  writes:

> Before this change, clean used resolve_gitlink_ref to check for the
> presence of nested git repositories. This had the drawback of creating
> a ref_cache entry for every directory that should potentially be
> cleaned. The linear search through the ref_cache list caused a massive
> performance hit for large number of directories.

I'd prefer to see the "current state" described in the current
tense, e.g.

"git clean" uses resolve_gitlink_ref() to check for the presence of
nested git repositories, but it has the drawback of creating a
ref_cache entry for every directory that should potentially be
cleaned. The linear search through the ref_cache list causes a
massive performance hit for large number of directories.

> Teach clean.c:remove_dirs to use setup.c:is_git_directory
> instead. is_git_directory will actually open HEAD and parse the HEAD
> ref but this implies a nested git repository and should be rare when
> cleaning.

I am not sure what you wanted to say in this paragraph.  What does
it being rare have to do with it?  Even if it is not rare (i.e. the
top-level project you are working with has many submodules checked
out without using the more recent "a file .git pointing into
.git/modules/ via 'gitdir: $overThere'" mechanism), if we found a
nested git repository, we treat it as special and exclude it from
cleaning it out, which is a good thing, no?

Doesn't this implementation get confused by modern submodule
checkouts and descend into and clean their working tree, though?
Module M with path P would have a directory P in the working tree of
the top-level project, and P/.git is a regular file that will fail
"is_git_directory()" test but records the location of the real
submodule repository i.e. ".git/modules/M" via the "gitdir:"
mechanism.

> Using is_git_directory should give a more standardized check for what
> is and what isn't a git repository but also gives a slight behavioral
> change. We will now detect and respect bare and empty nested git
> repositories (only init run). Update t7300 to reflect this.
>
> The time to clean an untracked directory containing 10 sub
> directories went from 61s to 1.7s after this change.
>
> Helped-by: Jeff King 
> Signed-off-by: Erik Elfström 
> ---
>  builtin/clean.c  | 24 
>  t/t7300-clean.sh |  4 ++--
>  2 files changed, 22 insertions(+), 6 deletions(-)
>
> diff --git a/builtin/clean.c b/builtin/clean.c
> index 98c103f..b679913 100644
> --- a/builtin/clean.c
> +++ b/builtin/clean.c
> @@ -10,7 +10,6 @@
>  #include "cache.h"
>  #include "dir.h"
>  #include "parse-options.h"
> -#include "refs.h"
>  #include "string-list.h"
>  #include "quote.h"
>  #include "column.h"
> @@ -148,6 +147,25 @@ static int exclude_cb(const struct option *opt, const 
> char *arg, int unset)
>   return 0;
>  }
>  
> +static int is_git_repository(struct strbuf *path)
> +{
> + int ret = 0;
> + if (is_git_directory(path->buf))
> + ret = 1;
> + else {
> + size_t orig_path_len = path->len;
> + assert(orig_path_len != 0);
> + if (path->buf[orig_path_len - 1] != '/')
> + strbuf_addch(path, '/');
> + strbuf_addstr(path, ".git");
> + if (is_git_directory(path->buf))
> + ret = 1;
> + strbuf_setlen(path, orig_path_len);
> + }
> +
> + return ret;
> +}
> +
>  static int remove_dirs(struct strbuf *path, const char *prefix, int 
> force_flag,
>   int dry_run, int quiet, int *dir_gone)
>  {
> @@ -155,13 +173,11 @@ static int remove_dirs(struct strbuf *path, const char 
> *prefix, int force_flag,
>   struct strbuf quoted = STRBUF_INIT;
>   struct dirent *e;
>   int res = 0, ret = 0, gone = 1, original_len = path->len, len;
> - unsigned char submodule_head[20];
>   struct string_list dels = STRING_LIST_INIT_DUP;
>  
>   *dir_gone = 1;
>  
> - if ((force_flag & REMOVE_DIR_KEEP_NESTED_GIT) &&
> - !resolve_gitlink_ref(path->buf, "HEAD", 
> submodule_head)) {
> + if ((force_flag & REMOVE_DIR_KEEP_NESTED_GIT) && 
> is_git_repository(path)) {
>   if (!quiet) {
>   quote_path_relative(path->buf, prefix, "ed);
>   printf(dry_run ?  _(msg_would_skip_git_dir) : 
> _(msg_skip_git_dir),
> diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
> index 58e6b4a..da294fe 100755
> --- a/t/t7300-clean.sh
> +++ b/t/t7300-clean.sh
> @@ -455,7 +455,7 @@ test_expect_success 'nested git work tree' '
>   ! test -d bar
>  '
>  
> -test_expect_failure 'nested git (only init) should be kept' '
> +test_expect_success 'nested git (only init) should be kept' '
>   rm -fr foo bar &&
>   git init foo &&
>   mkdir bar &&
> @@ -465,7 +465,7 @@ test_expect_failure 'nested git (only init) should be 
> kept' '
>   test_path_is_missing bar
>  '
>  
> -test_expect_failure 'nested git (bare) should 

[PATCH v2 3/3] clean: improve performance when removing lots of directories

2015-04-11 Thread Erik Elfström
Before this change, clean used resolve_gitlink_ref to check for the
presence of nested git repositories. This had the drawback of creating
a ref_cache entry for every directory that should potentially be
cleaned. The linear search through the ref_cache list caused a massive
performance hit for large number of directories.

Teach clean.c:remove_dirs to use setup.c:is_git_directory
instead. is_git_directory will actually open HEAD and parse the HEAD
ref but this implies a nested git repository and should be rare when
cleaning.

Using is_git_directory should give a more standardized check for what
is and what isn't a git repository but also gives a slight behavioral
change. We will now detect and respect bare and empty nested git
repositories (only init run). Update t7300 to reflect this.

The time to clean an untracked directory containing 10 sub
directories went from 61s to 1.7s after this change.

Helped-by: Jeff King 
Signed-off-by: Erik Elfström 
---
 builtin/clean.c  | 24 
 t/t7300-clean.sh |  4 ++--
 2 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/builtin/clean.c b/builtin/clean.c
index 98c103f..b679913 100644
--- a/builtin/clean.c
+++ b/builtin/clean.c
@@ -10,7 +10,6 @@
 #include "cache.h"
 #include "dir.h"
 #include "parse-options.h"
-#include "refs.h"
 #include "string-list.h"
 #include "quote.h"
 #include "column.h"
@@ -148,6 +147,25 @@ static int exclude_cb(const struct option *opt, const char 
*arg, int unset)
return 0;
 }
 
+static int is_git_repository(struct strbuf *path)
+{
+   int ret = 0;
+   if (is_git_directory(path->buf))
+   ret = 1;
+   else {
+   size_t orig_path_len = path->len;
+   assert(orig_path_len != 0);
+   if (path->buf[orig_path_len - 1] != '/')
+   strbuf_addch(path, '/');
+   strbuf_addstr(path, ".git");
+   if (is_git_directory(path->buf))
+   ret = 1;
+   strbuf_setlen(path, orig_path_len);
+   }
+
+   return ret;
+}
+
 static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag,
int dry_run, int quiet, int *dir_gone)
 {
@@ -155,13 +173,11 @@ static int remove_dirs(struct strbuf *path, const char 
*prefix, int force_flag,
struct strbuf quoted = STRBUF_INIT;
struct dirent *e;
int res = 0, ret = 0, gone = 1, original_len = path->len, len;
-   unsigned char submodule_head[20];
struct string_list dels = STRING_LIST_INIT_DUP;
 
*dir_gone = 1;
 
-   if ((force_flag & REMOVE_DIR_KEEP_NESTED_GIT) &&
-   !resolve_gitlink_ref(path->buf, "HEAD", 
submodule_head)) {
+   if ((force_flag & REMOVE_DIR_KEEP_NESTED_GIT) && 
is_git_repository(path)) {
if (!quiet) {
quote_path_relative(path->buf, prefix, "ed);
printf(dry_run ?  _(msg_would_skip_git_dir) : 
_(msg_skip_git_dir),
diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
index 58e6b4a..da294fe 100755
--- a/t/t7300-clean.sh
+++ b/t/t7300-clean.sh
@@ -455,7 +455,7 @@ test_expect_success 'nested git work tree' '
! test -d bar
 '
 
-test_expect_failure 'nested git (only init) should be kept' '
+test_expect_success 'nested git (only init) should be kept' '
rm -fr foo bar &&
git init foo &&
mkdir bar &&
@@ -465,7 +465,7 @@ test_expect_failure 'nested git (only init) should be kept' 
'
test_path_is_missing bar
 '
 
-test_expect_failure 'nested git (bare) should be kept' '
+test_expect_success 'nested git (bare) should be kept' '
rm -fr foo bar &&
git init --bare foo &&
mkdir bar &&
-- 
2.4.0.rc0.37.ga3b75b3

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html