Re: Git 2.8.1 fails test 32 of t7300-clean.sh, breaks profile build

2016-05-03 Thread erik elfström
Thanks for fixing the missing SANITY prerequisite Stefan.

As for the error handling logic in setup.c: is_nonbare_repository_dir
(was clean.c: is_git_repository) my reasoning is as follows:

READ_GITFILE_ERR_STAT_FAILED
READ_GITFILE_ERR_NOT_A_FILE:

When checking random paths for .git files these are the common error
modes, file is not there or it is a directory. This should not be
interpreted as a valid .git file.


READ_GITFILE_ERR_OPEN_FAILED
READ_GITFILE_ERR_READ_FAILED:

Here we found a .git file but could not open and read it to verify
that it is valid. Treating it as valid is the safest option for clean.
For resolve_gitlink_ref I think it maybe leads to the creation of a
redundant ref cache entries but I don't think this is a problem unless
someone has a huge amount of unreadable .git files lying around.


READ_GITFILE_ERR_TOO_LARGE:

File is absurdly large (1MB), very unlikely to be a valid .git file.


READ_GITFILE_ERR_INVALID_FORMAT
READ_GITFILE_ERR_NO_PATH:

File is malformed in some way, either the "gitdir:" prefix is missing
or the path is missing. Could theoretically be a corrupted .git file
but seems unlikely.


READ_GITFILE_ERR_NOT_A_REPO:

This is maybe a bit more suspicious. Here we have found a .git file,
it has the format "gitdir: some/path" but
is_git_directory("some/path") returned false, meaning that "some/path"
does not fulfill:

/*
 * Test if it looks like we're at a git directory.
 * We want to see:
 *
 *  - either an objects/ directory _or_ the proper
 *GIT_OBJECT_DIRECTORY environment variable
 *  - a refs/ directory
 *  - either a HEAD symlink or a HEAD file that is formatted as
 *a proper "ref:", or a regular file HEAD that has a properly
 *formatted sha1 object name.
*/

Technically we don't have a valid .git file here but we have something
that really tries to be. I guess it is debatable what the correct
conservative choice is here and if it is the same for all callers.

/Erik
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] t7300: fix broken && chains

2015-09-01 Thread erik elfström
On Mon, Aug 31, 2015 at 8:54 PM, Jeff King <p...@peff.net> wrote:
> On Sun, Aug 30, 2015 at 11:18:09AM +0200, Erik Elfström wrote:
>
> Unfortunately CHAIN_LINT cannot reach inside a nested subshell. I cannot
> think of a way to make it do so, and besides, that is also the way to
> override the &&-chain precedence. So I think it is not really possible
> to get better coverage here. And such cases as these are not really very
> interesting (e.g., here we exclude only minor minor setup steps from the
> &&-chain).
>
> -Peff

I actually hacked together an ugly script to check for more broken
chains. It's not working very well and I haven't checked the output
thoroughly but I did see quite a few broken chains. Many are of the
(mostly harmless) form:

 (
echo "100644 $o5 0a"
echo "100644 $o0 0c"
echo "16 $c1 0d"
) >expected &&

I'd estimate that there are hundreds of these (see t3030 for
examples). I'm not sure if you care about these? As you say they are
not really very interesting cases.

/Erik
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] t7300: fix broken && chains

2015-09-01 Thread erik elfström
On Mon, Aug 31, 2015 at 6:58 PM, Junio C Hamano  wrote:
>
> Many of the constructs we see here shows clearly that this is an
> ancient part of the codebase ;-), as we would be using the one
> parameter form of "git init" and more test_* helpers if we were
> writing this script in today's Git codebase.  It may have been
> better if you didn't do "while we are here" and corrected only the
> &&-chain in patch 1/2 and then updated the style of the tests to
> take advantage of the newer facilities recent test-lib has in a
> separate patch 2/2, but this will do at least for now.
>
> Will queue.
>
> Thanks.


I can do a re-roll with the chain fix in the first patch and a more
thorough modernization of t7300 in separate patches if you'd like? I
almost went this way for v1 but decided to limit the scope for the
first version.

(Forgot to include the list in my first reply, sorry... And forgot to
turn off HTML in the second... sigh, sorry for the spam)

/Erik
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] t7300: fix broken chains

2015-08-30 Thread Erik Elfström
While we are here, remove some boilerplate by using test_commit.

Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 t/t7300-clean.sh | 16 
 1 file changed, 4 insertions(+), 12 deletions(-)

diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
index 27557d6..86ceb38 100755
--- a/t/t7300-clean.sh
+++ b/t/t7300-clean.sh
@@ -432,9 +432,7 @@ test_expect_success 'nested git work tree' '
(
cd foo 
git init 
-   hello.world
-   git add . 
-   git commit -a -m nested
+   test_commit nested hello.world
) 
(
cd bar 
@@ -443,9 +441,7 @@ test_expect_success 'nested git work tree' '
(
cd baz/boo 
git init 
-   deeper.world
-   git add . 
-   git commit -a -m deeply.nested
+   test_commit deeply.nested deeper.world
) 
git clean -f -d 
test -f foo/.git/index 
@@ -601,9 +597,7 @@ test_expect_success 'force removal of nested git work tree' 
'
(
cd foo 
git init 
-   hello.world
-   git add . 
-   git commit -a -m nested
+   test_commit nested hello.world
) 
(
cd bar 
@@ -612,9 +606,7 @@ test_expect_success 'force removal of nested git work tree' 
'
(
cd baz/boo 
git init 
-   deeper.world
-   git add . 
-   git commit -a -m deeply.nested
+   test_commit deeply.nested deeper.world
) 
git clean -f -f -d 
! test -d foo 
-- 
2.4.4.410.gd6567e3

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 1/5] setup: add gentle version of read_gitfile

2015-06-26 Thread erik elfström
On Fri, Jun 26, 2015 at 11:03 AM, Jeff King p...@peff.net wrote:
 I happened to be playing with clang's static analyzer today, and it
 noticed that there is a subtle use-after-free here.

Doh, sorry about that. Thanks for fixing my bug.

/Erik
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v8 3/5] t7300: add tests to document behavior of clean and nested git

2015-06-15 Thread Erik Elfström
Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 t/t7300-clean.sh | 142 +++
 1 file changed, 142 insertions(+)

diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
index 99be5d9..fbfdf2d 100755
--- a/t/t7300-clean.sh
+++ b/t/t7300-clean.sh
@@ -455,6 +455,148 @@ test_expect_success 'nested git work tree' '
! test -d bar
 '
 
+test_expect_failure 'should clean things that almost look like git but are 
not' '
+   rm -fr almost_git almost_bare_git almost_submodule 
+   mkdir -p almost_git/.git/objects 
+   mkdir -p almost_git/.git/refs 
+   cat almost_git/.git/HEAD -\EOF 
+   garbage
+   EOF
+   cp -r almost_git/.git/ almost_bare_git 
+   mkdir almost_submodule/ 
+   cat almost_submodule/.git -\EOF 
+   garbage
+   EOF
+   test_when_finished rm -rf almost_* 
+   ## This will fail due to die(Invalid gitfile format: %s, path); in
+   ## setup.c:read_gitfile.
+   git clean -f -d 
+   test_path_is_missing almost_git 
+   test_path_is_missing almost_bare_git 
+   test_path_is_missing almost_submodule
+'
+
+test_expect_success 'should not clean submodules' '
+   rm -fr repo to_clean sub1 sub2 
+   mkdir repo to_clean 
+   (
+   cd repo 
+   git init 
+   test_commit msg hello.world
+   ) 
+   git submodule add ./repo/.git sub1 
+   git commit -m sub1 
+   git branch before_sub2 
+   git submodule add ./repo/.git sub2 
+   git commit -m sub2 
+   git checkout before_sub2 
+   to_clean/should_clean.this 
+   git clean -f -d 
+   test_path_is_file repo/.git/index 
+   test_path_is_file repo/hello.world 
+   test_path_is_file sub1/.git 
+   test_path_is_file sub1/hello.world 
+   test_path_is_file sub2/.git 
+   test_path_is_file sub2/hello.world 
+   test_path_is_missing to_clean
+'
+
+test_expect_failure 'should avoid cleaning possible submodules' '
+   rm -fr to_clean possible_sub1 
+   mkdir to_clean possible_sub1 
+   test_when_finished rm -rf possible_sub* 
+   echo gitdir: foo possible_sub1/.git 
+   possible_sub1/hello.world 
+   chmod 0 possible_sub1/.git 
+   to_clean/should_clean.this 
+   git clean -f -d 
+   test_path_is_file possible_sub1/.git 
+   test_path_is_file possible_sub1/hello.world 
+   test_path_is_missing to_clean
+'
+
+test_expect_failure 'nested (empty) git should be kept' '
+   rm -fr empty_repo to_clean 
+   git init empty_repo 
+   mkdir to_clean 
+   to_clean/should_clean.this 
+   git clean -f -d 
+   test_path_is_file empty_repo/.git/HEAD 
+   test_path_is_missing to_clean
+'
+
+test_expect_success 'nested bare repositories should be cleaned' '
+   rm -fr bare1 bare2 subdir 
+   git init --bare bare1 
+   git clone --local --bare . bare2 
+   mkdir subdir 
+   cp -r bare2 subdir/bare3 
+   git clean -f -d 
+   test_path_is_missing bare1 
+   test_path_is_missing bare2 
+   test_path_is_missing subdir
+'
+
+test_expect_success 'nested (empty) bare repositories should be cleaned even 
when in .git' '
+   rm -fr strange_bare 
+   mkdir strange_bare 
+   git init --bare strange_bare/.git 
+   git clean -f -d 
+   test_path_is_missing strange_bare
+'
+
+test_expect_failure 'nested (non-empty) bare repositories should be cleaned 
even when in .git' '
+   rm -fr strange_bare 
+   mkdir strange_bare 
+   git clone --local --bare . strange_bare/.git 
+   git clean -f -d 
+   test_path_is_missing strange_bare
+'
+
+test_expect_success 'giving path in nested git work tree will remove it' '
+   rm -fr repo 
+   mkdir repo 
+   (
+   cd repo 
+   git init 
+   mkdir -p bar/baz 
+   test_commit msg bar/baz/hello.world
+   ) 
+   git clean -f -d repo/bar/baz 
+   test_path_is_file repo/.git/HEAD 
+   test_path_is_dir repo/bar/ 
+   test_path_is_missing repo/bar/baz
+'
+
+test_expect_success 'giving path to nested .git will not remove it' '
+   rm -fr repo 
+   mkdir repo untracked 
+   (
+   cd repo 
+   git init 
+   test_commit msg hello.world
+   ) 
+   git clean -f -d repo/.git 
+   test_path_is_file repo/.git/HEAD 
+   test_path_is_dir repo/.git/refs 
+   test_path_is_dir repo/.git/objects 
+   test_path_is_dir untracked/
+'
+
+test_expect_success 'giving path to nested .git/ will remove contents' '
+   rm -fr repo untracked 
+   mkdir repo untracked 
+   (
+   cd repo 
+   git init 
+   test_commit msg hello.world
+   ) 
+   git clean -f -d repo/.git/ 
+   test_path_is_dir repo/.git 
+   test_dir_is_empty repo/.git 
+   test_path_is_dir untracked/
+'
+
 test_expect_success 'force removal

[PATCH v8 4/5] p7300: add performance tests for clean

2015-06-15 Thread Erik Elfström
The tests are run in dry-run mode to avoid having to restore the test
directories for each timed iteration. Using dry-run is an acceptable
compromise since we are mostly interested in the initial computation
of what to clean and not so much in the cleaning it self.

Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 t/perf/p7300-clean.sh | 31 +++
 1 file changed, 31 insertions(+)
 create mode 100755 t/perf/p7300-clean.sh

diff --git a/t/perf/p7300-clean.sh b/t/perf/p7300-clean.sh
new file mode 100755
index 000..ec94cdd
--- /dev/null
+++ b/t/perf/p7300-clean.sh
@@ -0,0 +1,31 @@
+#!/bin/sh
+
+test_description=Test git-clean performance
+
+. ./perf-lib.sh
+
+test_perf_default_repo
+test_checkout_worktree
+
+test_expect_success 'setup untracked directory with many sub dirs' '
+   rm -rf 500_sub_dirs 10_sub_dirs clean_test_dir 
+   mkdir 500_sub_dirs 10_sub_dirs clean_test_dir 
+   for i in $(test_seq 1 500)
+   do
+   mkdir 500_sub_dirs/dir$i || return $?
+   done 
+   for i in $(test_seq 1 200)
+   do
+   cp -r 500_sub_dirs 10_sub_dirs/dir$i || return $?
+   done
+'
+
+test_perf 'clean many untracked sub dirs, check for nested git' '
+   git clean -n -q -f -d 10_sub_dirs/
+'
+
+test_perf 'clean many untracked sub dirs, ignore nested git' '
+   git clean -n -q -f -f -d 10_sub_dirs/
+'
+
+test_done
-- 
2.4.3.373.gc496bfb.dirty

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v8 0/5] Improving performance of git clean

2015-06-15 Thread Erik Elfström
This is v8 of this series. v7 can be found here:

http://thread.gmane.org/gmane.comp.version-control.git/271220

Changes in v8:
 * change name and type of file size limit variable in read_git_file_gently


Erik Elfström (5):
  setup: add gentle version of read_gitfile
  setup: sanity check file size in read_gitfile_gently
  t7300: add tests to document behavior of clean and nested git
  p7300: add performance tests for clean
  clean: improve performance when removing lots of directories

 builtin/clean.c   |  30 +--
 cache.h   |  12 -
 setup.c   |  91 +---
 t/perf/p7300-clean.sh |  31 +++
 t/t7300-clean.sh  | 140 ++
 5 files changed, 280 insertions(+), 24 deletions(-)
 create mode 100755 t/perf/p7300-clean.sh

-- 
2.4.3.373.gc496bfb.dirty

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v8 2/5] setup: sanity check file size in read_gitfile_gently

2015-06-15 Thread Erik Elfström
read_gitfile_gently will allocate a buffer to fit the entire file that
should be read. Add a sanity check of the file size before opening to
avoid allocating a potentially huge amount of memory if we come across
a large file that someone happened to name .git. The limit is set to
a sufficiently unreasonable size that should never be exceeded by a
genuine .git file.

Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 cache.h | 1 +
 setup.c | 7 +++
 2 files changed, 8 insertions(+)

diff --git a/cache.h b/cache.h
index 25578cb..858d9b3 100644
--- a/cache.h
+++ b/cache.h
@@ -454,6 +454,7 @@ extern const char *get_git_work_tree(void);
 #define READ_GITFILE_ERR_INVALID_FORMAT 5
 #define READ_GITFILE_ERR_NO_PATH 6
 #define READ_GITFILE_ERR_NOT_A_REPO 7
+#define READ_GITFILE_ERR_TOO_LARGE 8
 extern const char *read_gitfile_gently(const char *path, int 
*return_error_code);
 #define read_gitfile(path) read_gitfile_gently((path), NULL)
 extern const char *resolve_gitdir(const char *suspect);
diff --git a/setup.c b/setup.c
index 4748b63..a03ca94 100644
--- a/setup.c
+++ b/setup.c
@@ -414,6 +414,7 @@ static void update_linked_gitdir(const char *gitfile, const 
char *gitdir)
  */
 const char *read_gitfile_gently(const char *path, int *return_error_code)
 {
+   const int max_file_size = 1  20;  /* 1MB */
int error_code = 0;
char *buf = NULL;
char *dir = NULL;
@@ -430,6 +431,10 @@ const char *read_gitfile_gently(const char *path, int 
*return_error_code)
error_code = READ_GITFILE_ERR_NOT_A_FILE;
goto cleanup_return;
}
+   if (st.st_size  max_file_size) {
+   error_code = READ_GITFILE_ERR_TOO_LARGE;
+   goto cleanup_return;
+   }
fd = open(path, O_RDONLY);
if (fd  0) {
error_code = READ_GITFILE_ERR_OPEN_FAILED;
@@ -489,6 +494,8 @@ cleanup_return:
return NULL;
case READ_GITFILE_ERR_OPEN_FAILED:
die_errno(Error opening '%s', path);
+   case READ_GITFILE_ERR_TOO_LARGE:
+   die(Too large to be a .git file: '%s', path);
case READ_GITFILE_ERR_READ_FAILED:
die(Error reading %s, path);
case READ_GITFILE_ERR_INVALID_FORMAT:
-- 
2.4.3.373.gc496bfb.dirty

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v8 1/5] setup: add gentle version of read_gitfile

2015-06-15 Thread Erik Elfström
read_gitfile will die on most error cases. This makes it unsuitable
for speculative calls. Extract the core logic and provide a gentle
version that returns NULL on failure.

The first usecase of the new gentle version will be to probe for
submodules during git clean.

Helped-by: Junio C Hamano gits...@pobox.com
Helped-by: Jeff King p...@peff.net
Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 cache.h | 11 -
 setup.c | 84 ++---
 2 files changed, 75 insertions(+), 20 deletions(-)

diff --git a/cache.h b/cache.h
index 571c98f..25578cb 100644
--- a/cache.h
+++ b/cache.h
@@ -446,7 +446,16 @@ extern int get_common_dir(struct strbuf *sb, const char 
*gitdir);
 extern const char *get_git_namespace(void);
 extern const char *strip_namespace(const char *namespaced_ref);
 extern const char *get_git_work_tree(void);
-extern const char *read_gitfile(const char *path);
+
+#define READ_GITFILE_ERR_STAT_FAILED 1
+#define READ_GITFILE_ERR_NOT_A_FILE 2
+#define READ_GITFILE_ERR_OPEN_FAILED 3
+#define READ_GITFILE_ERR_READ_FAILED 4
+#define READ_GITFILE_ERR_INVALID_FORMAT 5
+#define READ_GITFILE_ERR_NO_PATH 6
+#define READ_GITFILE_ERR_NOT_A_REPO 7
+extern const char *read_gitfile_gently(const char *path, int 
*return_error_code);
+#define read_gitfile(path) read_gitfile_gently((path), NULL)
 extern const char *resolve_gitdir(const char *suspect);
 extern void set_git_work_tree(const char *tree);
 
diff --git a/setup.c b/setup.c
index 863ddfd..4748b63 100644
--- a/setup.c
+++ b/setup.c
@@ -406,35 +406,53 @@ static void update_linked_gitdir(const char *gitfile, 
const char *gitdir)
 /*
  * Try to read the location of the git directory from the .git file,
  * return path to git directory if found.
+ *
+ * On failure, if return_error_code is not NULL, return_error_code
+ * will be set to an error code and NULL will be returned. If
+ * return_error_code is NULL the function will die instead (for most
+ * cases).
  */
-const char *read_gitfile(const char *path)
+const char *read_gitfile_gently(const char *path, int *return_error_code)
 {
-   char *buf;
-   char *dir;
+   int error_code = 0;
+   char *buf = NULL;
+   char *dir = NULL;
const char *slash;
struct stat st;
int fd;
ssize_t len;
 
-   if (stat(path, st))
-   return NULL;
-   if (!S_ISREG(st.st_mode))
-   return NULL;
+   if (stat(path, st)) {
+   error_code = READ_GITFILE_ERR_STAT_FAILED;
+   goto cleanup_return;
+   }
+   if (!S_ISREG(st.st_mode)) {
+   error_code = READ_GITFILE_ERR_NOT_A_FILE;
+   goto cleanup_return;
+   }
fd = open(path, O_RDONLY);
-   if (fd  0)
-   die_errno(Error opening '%s', path);
+   if (fd  0) {
+   error_code = READ_GITFILE_ERR_OPEN_FAILED;
+   goto cleanup_return;
+   }
buf = xmalloc(st.st_size + 1);
len = read_in_full(fd, buf, st.st_size);
close(fd);
-   if (len != st.st_size)
-   die(Error reading %s, path);
+   if (len != st.st_size) {
+   error_code = READ_GITFILE_ERR_READ_FAILED;
+   goto cleanup_return;
+   }
buf[len] = '\0';
-   if (!starts_with(buf, gitdir: ))
-   die(Invalid gitfile format: %s, path);
+   if (!starts_with(buf, gitdir: )) {
+   error_code = READ_GITFILE_ERR_INVALID_FORMAT;
+   goto cleanup_return;
+   }
while (buf[len - 1] == '\n' || buf[len - 1] == '\r')
len--;
-   if (len  9)
-   die(No path in gitfile: %s, path);
+   if (len  9) {
+   error_code = READ_GITFILE_ERR_NO_PATH;
+   goto cleanup_return;
+   }
buf[len] = '\0';
dir = buf + 8;
 
@@ -448,14 +466,42 @@ const char *read_gitfile(const char *path)
free(buf);
buf = dir;
}
-
-   if (!is_git_directory(dir))
-   die(Not a git repository: %s, dir);
-
+   if (!is_git_directory(dir)) {
+   error_code = READ_GITFILE_ERR_NOT_A_REPO;
+   goto cleanup_return;
+   }
update_linked_gitdir(path, dir);
path = real_path(dir);
 
+cleanup_return:
free(buf);
+
+   if (return_error_code)
+   *return_error_code = error_code;
+
+   if (error_code) {
+   if (return_error_code)
+   return NULL;
+
+   switch (error_code) {
+   case READ_GITFILE_ERR_STAT_FAILED:
+   case READ_GITFILE_ERR_NOT_A_FILE:
+   return NULL;
+   case READ_GITFILE_ERR_OPEN_FAILED:
+   die_errno(Error opening '%s', path);
+   case READ_GITFILE_ERR_READ_FAILED:
+   die(Error reading %s, path);
+   case READ_GITFILE_ERR_INVALID_FORMAT

[PATCH v7 1/5] setup: add gentle version of read_gitfile

2015-06-09 Thread Erik Elfström
read_gitfile will die on most error cases. This makes it unsuitable
for speculative calls. Extract the core logic and provide a gentle
version that returns NULL on failure.

The first usecase of the new gentle version will be to probe for
submodules during git clean.

Helped-by: Junio C Hamano gits...@pobox.com
Helped-by: Jeff King p...@peff.net
Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 cache.h | 11 -
 setup.c | 84 ++---
 2 files changed, 75 insertions(+), 20 deletions(-)

diff --git a/cache.h b/cache.h
index 571c98f..25578cb 100644
--- a/cache.h
+++ b/cache.h
@@ -446,7 +446,16 @@ extern int get_common_dir(struct strbuf *sb, const char 
*gitdir);
 extern const char *get_git_namespace(void);
 extern const char *strip_namespace(const char *namespaced_ref);
 extern const char *get_git_work_tree(void);
-extern const char *read_gitfile(const char *path);
+
+#define READ_GITFILE_ERR_STAT_FAILED 1
+#define READ_GITFILE_ERR_NOT_A_FILE 2
+#define READ_GITFILE_ERR_OPEN_FAILED 3
+#define READ_GITFILE_ERR_READ_FAILED 4
+#define READ_GITFILE_ERR_INVALID_FORMAT 5
+#define READ_GITFILE_ERR_NO_PATH 6
+#define READ_GITFILE_ERR_NOT_A_REPO 7
+extern const char *read_gitfile_gently(const char *path, int 
*return_error_code);
+#define read_gitfile(path) read_gitfile_gently((path), NULL)
 extern const char *resolve_gitdir(const char *suspect);
 extern void set_git_work_tree(const char *tree);
 
diff --git a/setup.c b/setup.c
index 863ddfd..4748b63 100644
--- a/setup.c
+++ b/setup.c
@@ -406,35 +406,53 @@ static void update_linked_gitdir(const char *gitfile, 
const char *gitdir)
 /*
  * Try to read the location of the git directory from the .git file,
  * return path to git directory if found.
+ *
+ * On failure, if return_error_code is not NULL, return_error_code
+ * will be set to an error code and NULL will be returned. If
+ * return_error_code is NULL the function will die instead (for most
+ * cases).
  */
-const char *read_gitfile(const char *path)
+const char *read_gitfile_gently(const char *path, int *return_error_code)
 {
-   char *buf;
-   char *dir;
+   int error_code = 0;
+   char *buf = NULL;
+   char *dir = NULL;
const char *slash;
struct stat st;
int fd;
ssize_t len;
 
-   if (stat(path, st))
-   return NULL;
-   if (!S_ISREG(st.st_mode))
-   return NULL;
+   if (stat(path, st)) {
+   error_code = READ_GITFILE_ERR_STAT_FAILED;
+   goto cleanup_return;
+   }
+   if (!S_ISREG(st.st_mode)) {
+   error_code = READ_GITFILE_ERR_NOT_A_FILE;
+   goto cleanup_return;
+   }
fd = open(path, O_RDONLY);
-   if (fd  0)
-   die_errno(Error opening '%s', path);
+   if (fd  0) {
+   error_code = READ_GITFILE_ERR_OPEN_FAILED;
+   goto cleanup_return;
+   }
buf = xmalloc(st.st_size + 1);
len = read_in_full(fd, buf, st.st_size);
close(fd);
-   if (len != st.st_size)
-   die(Error reading %s, path);
+   if (len != st.st_size) {
+   error_code = READ_GITFILE_ERR_READ_FAILED;
+   goto cleanup_return;
+   }
buf[len] = '\0';
-   if (!starts_with(buf, gitdir: ))
-   die(Invalid gitfile format: %s, path);
+   if (!starts_with(buf, gitdir: )) {
+   error_code = READ_GITFILE_ERR_INVALID_FORMAT;
+   goto cleanup_return;
+   }
while (buf[len - 1] == '\n' || buf[len - 1] == '\r')
len--;
-   if (len  9)
-   die(No path in gitfile: %s, path);
+   if (len  9) {
+   error_code = READ_GITFILE_ERR_NO_PATH;
+   goto cleanup_return;
+   }
buf[len] = '\0';
dir = buf + 8;
 
@@ -448,14 +466,42 @@ const char *read_gitfile(const char *path)
free(buf);
buf = dir;
}
-
-   if (!is_git_directory(dir))
-   die(Not a git repository: %s, dir);
-
+   if (!is_git_directory(dir)) {
+   error_code = READ_GITFILE_ERR_NOT_A_REPO;
+   goto cleanup_return;
+   }
update_linked_gitdir(path, dir);
path = real_path(dir);
 
+cleanup_return:
free(buf);
+
+   if (return_error_code)
+   *return_error_code = error_code;
+
+   if (error_code) {
+   if (return_error_code)
+   return NULL;
+
+   switch (error_code) {
+   case READ_GITFILE_ERR_STAT_FAILED:
+   case READ_GITFILE_ERR_NOT_A_FILE:
+   return NULL;
+   case READ_GITFILE_ERR_OPEN_FAILED:
+   die_errno(Error opening '%s', path);
+   case READ_GITFILE_ERR_READ_FAILED:
+   die(Error reading %s, path);
+   case READ_GITFILE_ERR_INVALID_FORMAT

[PATCH v7 3/5] t7300: add tests to document behavior of clean and nested git

2015-06-09 Thread Erik Elfström
Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 t/t7300-clean.sh | 142 +++
 1 file changed, 142 insertions(+)

diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
index 99be5d9..fbfdf2d 100755
--- a/t/t7300-clean.sh
+++ b/t/t7300-clean.sh
@@ -455,6 +455,148 @@ test_expect_success 'nested git work tree' '
! test -d bar
 '
 
+test_expect_failure 'should clean things that almost look like git but are 
not' '
+   rm -fr almost_git almost_bare_git almost_submodule 
+   mkdir -p almost_git/.git/objects 
+   mkdir -p almost_git/.git/refs 
+   cat almost_git/.git/HEAD -\EOF 
+   garbage
+   EOF
+   cp -r almost_git/.git/ almost_bare_git 
+   mkdir almost_submodule/ 
+   cat almost_submodule/.git -\EOF 
+   garbage
+   EOF
+   test_when_finished rm -rf almost_* 
+   ## This will fail due to die(Invalid gitfile format: %s, path); in
+   ## setup.c:read_gitfile.
+   git clean -f -d 
+   test_path_is_missing almost_git 
+   test_path_is_missing almost_bare_git 
+   test_path_is_missing almost_submodule
+'
+
+test_expect_success 'should not clean submodules' '
+   rm -fr repo to_clean sub1 sub2 
+   mkdir repo to_clean 
+   (
+   cd repo 
+   git init 
+   test_commit msg hello.world
+   ) 
+   git submodule add ./repo/.git sub1 
+   git commit -m sub1 
+   git branch before_sub2 
+   git submodule add ./repo/.git sub2 
+   git commit -m sub2 
+   git checkout before_sub2 
+   to_clean/should_clean.this 
+   git clean -f -d 
+   test_path_is_file repo/.git/index 
+   test_path_is_file repo/hello.world 
+   test_path_is_file sub1/.git 
+   test_path_is_file sub1/hello.world 
+   test_path_is_file sub2/.git 
+   test_path_is_file sub2/hello.world 
+   test_path_is_missing to_clean
+'
+
+test_expect_failure 'should avoid cleaning possible submodules' '
+   rm -fr to_clean possible_sub1 
+   mkdir to_clean possible_sub1 
+   test_when_finished rm -rf possible_sub* 
+   echo gitdir: foo possible_sub1/.git 
+   possible_sub1/hello.world 
+   chmod 0 possible_sub1/.git 
+   to_clean/should_clean.this 
+   git clean -f -d 
+   test_path_is_file possible_sub1/.git 
+   test_path_is_file possible_sub1/hello.world 
+   test_path_is_missing to_clean
+'
+
+test_expect_failure 'nested (empty) git should be kept' '
+   rm -fr empty_repo to_clean 
+   git init empty_repo 
+   mkdir to_clean 
+   to_clean/should_clean.this 
+   git clean -f -d 
+   test_path_is_file empty_repo/.git/HEAD 
+   test_path_is_missing to_clean
+'
+
+test_expect_success 'nested bare repositories should be cleaned' '
+   rm -fr bare1 bare2 subdir 
+   git init --bare bare1 
+   git clone --local --bare . bare2 
+   mkdir subdir 
+   cp -r bare2 subdir/bare3 
+   git clean -f -d 
+   test_path_is_missing bare1 
+   test_path_is_missing bare2 
+   test_path_is_missing subdir
+'
+
+test_expect_success 'nested (empty) bare repositories should be cleaned even 
when in .git' '
+   rm -fr strange_bare 
+   mkdir strange_bare 
+   git init --bare strange_bare/.git 
+   git clean -f -d 
+   test_path_is_missing strange_bare
+'
+
+test_expect_failure 'nested (non-empty) bare repositories should be cleaned 
even when in .git' '
+   rm -fr strange_bare 
+   mkdir strange_bare 
+   git clone --local --bare . strange_bare/.git 
+   git clean -f -d 
+   test_path_is_missing strange_bare
+'
+
+test_expect_success 'giving path in nested git work tree will remove it' '
+   rm -fr repo 
+   mkdir repo 
+   (
+   cd repo 
+   git init 
+   mkdir -p bar/baz 
+   test_commit msg bar/baz/hello.world
+   ) 
+   git clean -f -d repo/bar/baz 
+   test_path_is_file repo/.git/HEAD 
+   test_path_is_dir repo/bar/ 
+   test_path_is_missing repo/bar/baz
+'
+
+test_expect_success 'giving path to nested .git will not remove it' '
+   rm -fr repo 
+   mkdir repo untracked 
+   (
+   cd repo 
+   git init 
+   test_commit msg hello.world
+   ) 
+   git clean -f -d repo/.git 
+   test_path_is_file repo/.git/HEAD 
+   test_path_is_dir repo/.git/refs 
+   test_path_is_dir repo/.git/objects 
+   test_path_is_dir untracked/
+'
+
+test_expect_success 'giving path to nested .git/ will remove contents' '
+   rm -fr repo untracked 
+   mkdir repo untracked 
+   (
+   cd repo 
+   git init 
+   test_commit msg hello.world
+   ) 
+   git clean -f -d repo/.git/ 
+   test_path_is_dir repo/.git 
+   test_dir_is_empty repo/.git 
+   test_path_is_dir untracked/
+'
+
 test_expect_success 'force removal

[PATCH v7 2/5] setup: sanity check file size in read_gitfile_gently

2015-06-09 Thread Erik Elfström
read_gitfile_gently will allocate a buffer to fit the entire file that
should be read. Add a sanity check of the file size before opening to
avoid allocating a potentially huge amount of memory if we come across
a large file that someone happened to name .git. The limit is set to
a sufficiently unreasonable size that should never be exceeded by a
genuine .git file.

Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 cache.h | 1 +
 setup.c | 7 +++
 2 files changed, 8 insertions(+)

diff --git a/cache.h b/cache.h
index 25578cb..858d9b3 100644
--- a/cache.h
+++ b/cache.h
@@ -454,6 +454,7 @@ extern const char *get_git_work_tree(void);
 #define READ_GITFILE_ERR_INVALID_FORMAT 5
 #define READ_GITFILE_ERR_NO_PATH 6
 #define READ_GITFILE_ERR_NOT_A_REPO 7
+#define READ_GITFILE_ERR_TOO_LARGE 8
 extern const char *read_gitfile_gently(const char *path, int 
*return_error_code);
 #define read_gitfile(path) read_gitfile_gently((path), NULL)
 extern const char *resolve_gitdir(const char *suspect);
diff --git a/setup.c b/setup.c
index 4748b63..e76955e 100644
--- a/setup.c
+++ b/setup.c
@@ -414,6 +414,7 @@ static void update_linked_gitdir(const char *gitfile, const 
char *gitdir)
  */
 const char *read_gitfile_gently(const char *path, int *return_error_code)
 {
+   static const int one_MB = 1  20;
int error_code = 0;
char *buf = NULL;
char *dir = NULL;
@@ -430,6 +431,10 @@ const char *read_gitfile_gently(const char *path, int 
*return_error_code)
error_code = READ_GITFILE_ERR_NOT_A_FILE;
goto cleanup_return;
}
+   if (st.st_size  one_MB) {
+   error_code = READ_GITFILE_ERR_TOO_LARGE;
+   goto cleanup_return;
+   }
fd = open(path, O_RDONLY);
if (fd  0) {
error_code = READ_GITFILE_ERR_OPEN_FAILED;
@@ -489,6 +494,8 @@ cleanup_return:
return NULL;
case READ_GITFILE_ERR_OPEN_FAILED:
die_errno(Error opening '%s', path);
+   case READ_GITFILE_ERR_TOO_LARGE:
+   die(Too large to be a .git file: '%s', path);
case READ_GITFILE_ERR_READ_FAILED:
die(Error reading %s, path);
case READ_GITFILE_ERR_INVALID_FORMAT:
-- 
2.4.3.373.gc496bfb

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 4/5] p7300: add performance tests for clean

2015-06-09 Thread Erik Elfström
The tests are run in dry-run mode to avoid having to restore the test
directories for each timed iteration. Using dry-run is an acceptable
compromise since we are mostly interested in the initial computation
of what to clean and not so much in the cleaning it self.

Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 t/perf/p7300-clean.sh | 31 +++
 1 file changed, 31 insertions(+)
 create mode 100755 t/perf/p7300-clean.sh

diff --git a/t/perf/p7300-clean.sh b/t/perf/p7300-clean.sh
new file mode 100755
index 000..ec94cdd
--- /dev/null
+++ b/t/perf/p7300-clean.sh
@@ -0,0 +1,31 @@
+#!/bin/sh
+
+test_description=Test git-clean performance
+
+. ./perf-lib.sh
+
+test_perf_default_repo
+test_checkout_worktree
+
+test_expect_success 'setup untracked directory with many sub dirs' '
+   rm -rf 500_sub_dirs 10_sub_dirs clean_test_dir 
+   mkdir 500_sub_dirs 10_sub_dirs clean_test_dir 
+   for i in $(test_seq 1 500)
+   do
+   mkdir 500_sub_dirs/dir$i || return $?
+   done 
+   for i in $(test_seq 1 200)
+   do
+   cp -r 500_sub_dirs 10_sub_dirs/dir$i || return $?
+   done
+'
+
+test_perf 'clean many untracked sub dirs, check for nested git' '
+   git clean -n -q -f -d 10_sub_dirs/
+'
+
+test_perf 'clean many untracked sub dirs, ignore nested git' '
+   git clean -n -q -f -f -d 10_sub_dirs/
+'
+
+test_done
-- 
2.4.3.373.gc496bfb

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 0/5] Improving performance of git clean

2015-06-09 Thread Erik Elfström
Here is a reroll of this series (after much delay).

Changes in v7:
* changed order of file size and file open error check in read_gitfile
* resolved conflicts with nd/multiple-work-trees. This removed the
  need for is_git_directory_gently that was added in v6 and simplified
  some error cases.

Erik Elfström (5):
  setup: add gentle version of read_gitfile
  setup: sanity check file size in read_gitfile_gently
  t7300: add tests to document behavior of clean and nested git
  p7300: add performance tests for clean
  clean: improve performance when removing lots of directories

 builtin/clean.c   |  30 +--
 cache.h   |  12 -
 setup.c   |  91 +---
 t/perf/p7300-clean.sh |  31 +++
 t/t7300-clean.sh  | 140 ++
 5 files changed, 280 insertions(+), 24 deletions(-)
 create mode 100755 t/perf/p7300-clean.sh

-- 
2.4.3.373.gc496bfb

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 5/5] clean: improve performance when removing lots of directories

2015-06-09 Thread Erik Elfström
git clean uses resolve_gitlink_ref() to check for the presence of
nested git repositories, but it has the drawback of creating a
ref_cache entry for every directory that should potentially be
cleaned. The linear search through the ref_cache list causes a massive
performance hit for large number of directories.

Modify clean.c:remove_dirs to use setup.c:is_git_directory and
setup.c:read_gitfile_gently instead.

Both these functions will open files and parse contents when they find
something that looks like a git repository. This is ok from a
performance standpoint since finding repository candidates should be
comparatively rare.

Using is_git_directory and read_gitfile_gently should give a more
standardized check for what is and what isn't a git repository but
also gives three behavioral changes.

The first change is that we will now detect and avoid cleaning empty
nested git repositories (only init run). This is desirable.

Second, we will no longer die when cleaning a file named .git with
garbage content (it will be cleaned instead). This is also desirable.

The last change is that we will detect and avoid cleaning empty bare
repositories that have been placed in a directory named .git. This
is not desirable but should have no real user impact since we already
fail to clean non-empty bare repositories in the same scenario. This
is thus deemed acceptable.

On top of this we add some extra precautions. If read_gitfile_gently
fails to open the git file, read the git file or verify the path in
the git file we assume that the path with the git file is a valid
repository and avoid cleaning.

Update t7300 to reflect these changes in behavior.

The time to clean an untracked directory containing 10 sub
directories went from 61s to 1.7s after this change.

Helped-by: Jeff King p...@peff.net
Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 builtin/clean.c  | 30 ++
 t/t7300-clean.sh | 10 --
 2 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/builtin/clean.c b/builtin/clean.c
index 6dcb72e..df53def 100644
--- a/builtin/clean.c
+++ b/builtin/clean.c
@@ -10,7 +10,6 @@
 #include cache.h
 #include dir.h
 #include parse-options.h
-#include refs.h
 #include string-list.h
 #include quote.h
 #include column.h
@@ -148,6 +147,31 @@ static int exclude_cb(const struct option *opt, const char 
*arg, int unset)
return 0;
 }
 
+/*
+ * Return 1 if the given path is the root of a git repository or
+ * submodule else 0. Will not return 1 for bare repositories with the
+ * exception of creating a bare repository in foo/.git and calling
+ * is_git_repository(foo).
+ */
+static int is_git_repository(struct strbuf *path)
+{
+   int ret = 0;
+   int gitfile_error;
+   size_t orig_path_len = path-len;
+   assert(orig_path_len != 0);
+   if (path-buf[orig_path_len - 1] != '/')
+   strbuf_addch(path, '/');
+   strbuf_addstr(path, .git);
+   if (read_gitfile_gently(path-buf, gitfile_error) || 
is_git_directory(path-buf))
+   ret = 1;
+   if (gitfile_error == READ_GITFILE_ERR_OPEN_FAILED ||
+   gitfile_error == READ_GITFILE_ERR_READ_FAILED)
+   ret = 1;  /* This could be a real .git file, take the
+  * safe option and avoid cleaning */
+   strbuf_setlen(path, orig_path_len);
+   return ret;
+}
+
 static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag,
int dry_run, int quiet, int *dir_gone)
 {
@@ -155,13 +179,11 @@ static int remove_dirs(struct strbuf *path, const char 
*prefix, int force_flag,
struct strbuf quoted = STRBUF_INIT;
struct dirent *e;
int res = 0, ret = 0, gone = 1, original_len = path-len, len;
-   unsigned char submodule_head[20];
struct string_list dels = STRING_LIST_INIT_DUP;
 
*dir_gone = 1;
 
-   if ((force_flag  REMOVE_DIR_KEEP_NESTED_GIT) 
-   !resolve_gitlink_ref(path-buf, HEAD, 
submodule_head)) {
+   if ((force_flag  REMOVE_DIR_KEEP_NESTED_GIT)  
is_git_repository(path)) {
if (!quiet) {
quote_path_relative(path-buf, prefix, quoted);
printf(dry_run ?  _(msg_would_skip_git_dir) : 
_(msg_skip_git_dir),
diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
index fbfdf2d..32e96da 100755
--- a/t/t7300-clean.sh
+++ b/t/t7300-clean.sh
@@ -455,7 +455,7 @@ test_expect_success 'nested git work tree' '
! test -d bar
 '
 
-test_expect_failure 'should clean things that almost look like git but are 
not' '
+test_expect_success 'should clean things that almost look like git but are 
not' '
rm -fr almost_git almost_bare_git almost_submodule 
mkdir -p almost_git/.git/objects 
mkdir -p almost_git/.git/refs 
@@ -468,8 +468,6 @@ test_expect_failure 'should clean things that almost look 
like git but are not'
garbage
EOF
test_when_finished rm

[PATCH v6 6/7] clean: improve performance when removing lots of directories

2015-05-10 Thread Erik Elfström
git clean uses resolve_gitlink_ref() to check for the presence of
nested git repositories, but it has the drawback of creating a
ref_cache entry for every directory that should potentially be
cleaned. The linear search through the ref_cache list causes a massive
performance hit for large number of directories.

Modify clean.c:remove_dirs to use setup.c:is_git_directory and
setup.c:read_gitfile_gently instead.

Both these functions will open files and parse contents when they find
something that looks like a git repository. This is ok from a
performance standpoint since finding repository candidates should be
comparatively rare.

Using is_git_directory and read_gitfile_gently should give a more
standardized check for what is and what isn't a git repository but
also gives three behavioral changes.

The first change is that we will now detect and avoid cleaning empty
nested git repositories (only init run). This is desirable.

Second, we will no longer die when cleaning a file named .git with
garbage content (it will be cleaned instead). This is also desirable.

The last change is that we will detect and avoid cleaning empty bare
repositories that have been placed in a directory named .git. This
is not desirable but should have no real user impact since we already
fail to clean non-empty bare repositories in the same scenario. This
is thus deemed acceptable.

On top of this we add some extra precautions. If read_gitfile_gently
fails to open the git file, read the git file or verify the path in
the git file we assume that the path with the git file is a valid
repository and avoid cleaning.

Update t7300 to reflect these changes in behavior.

The time to clean an untracked directory containing 10 sub
directories went from 61s to 1.7s after this change.

Helped-by: Jeff King p...@peff.net
Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 builtin/clean.c  | 31 +++
 t/t7300-clean.sh | 10 --
 2 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/builtin/clean.c b/builtin/clean.c
index 98c103f..d739dcf 100644
--- a/builtin/clean.c
+++ b/builtin/clean.c
@@ -10,7 +10,6 @@
 #include cache.h
 #include dir.h
 #include parse-options.h
-#include refs.h
 #include string-list.h
 #include quote.h
 #include column.h
@@ -148,6 +147,32 @@ static int exclude_cb(const struct option *opt, const char 
*arg, int unset)
return 0;
 }
 
+/*
+ * Return 1 if the given path is the root of a git repository or
+ * submodule else 0. Will not return 1 for bare repositories with the
+ * exception of creating a bare repository in foo/.git and calling
+ * is_git_repository(foo).
+ */
+static int is_git_repository(struct strbuf *path)
+{
+   int ret = 0;
+   int gitfile_error;
+   size_t orig_path_len = path-len;
+   assert(orig_path_len != 0);
+   if (path-buf[orig_path_len - 1] != '/')
+   strbuf_addch(path, '/');
+   strbuf_addstr(path, .git);
+   if (read_gitfile_gently(path-buf, gitfile_error) || 
is_git_directory(path-buf))
+   ret = 1;
+   if (gitfile_error == READ_GITFILE_ERR_OPEN_FAILED ||
+   gitfile_error == READ_GITFILE_ERR_READ_FAILED ||
+   gitfile_error == READ_GITFILE_ERR_CANT_VERIFY_PATH)
+   ret = 1;  /* This could be a real .git file, take the
+  * safe option and avoid cleaning */
+   strbuf_setlen(path, orig_path_len);
+   return ret;
+}
+
 static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag,
int dry_run, int quiet, int *dir_gone)
 {
@@ -155,13 +180,11 @@ static int remove_dirs(struct strbuf *path, const char 
*prefix, int force_flag,
struct strbuf quoted = STRBUF_INIT;
struct dirent *e;
int res = 0, ret = 0, gone = 1, original_len = path-len, len;
-   unsigned char submodule_head[20];
struct string_list dels = STRING_LIST_INIT_DUP;
 
*dir_gone = 1;
 
-   if ((force_flag  REMOVE_DIR_KEEP_NESTED_GIT) 
-   !resolve_gitlink_ref(path-buf, HEAD, 
submodule_head)) {
+   if ((force_flag  REMOVE_DIR_KEEP_NESTED_GIT)  
is_git_repository(path)) {
if (!quiet) {
quote_path_relative(path-buf, prefix, quoted);
printf(dry_run ?  _(msg_would_skip_git_dir) : 
_(msg_skip_git_dir),
diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
index 23962e4..fbab888 100755
--- a/t/t7300-clean.sh
+++ b/t/t7300-clean.sh
@@ -455,7 +455,7 @@ test_expect_success 'nested git work tree' '
! test -d bar
 '
 
-test_expect_failure 'should clean things that almost look like git but are 
not' '
+test_expect_success 'should clean things that almost look like git but are 
not' '
rm -fr almost_git almost_bare_git almost_submodule 
mkdir -p almost_git/.git/objects 
mkdir -p almost_git/.git/refs 
@@ -468,8 +468,6 @@ test_expect_failure 'should clean things that almost look 
like git

Re: [PATCH v5 1/5] setup: add gentle version of read_gitfile

2015-04-28 Thread erik elfström
On Tue, Apr 28, 2015 at 8:17 AM, Jeff King p...@peff.net wrote:

 There was a discussion not too long ago on strategies for returning
 errors, and one of the suggestions was to return an error strbuf
 rather than a code[1]. That's less flexible, as the caller can't react
 differently based on the type of error. But for cases like this, where
 the only fate for the code is to get converted back into a message,
 it can reduce the boilerplate.

 What you have here is OK to me, and I don't want to hold up your patch
 series in a flamewar about error-reporting techniques. But I think it's
 an interesting case study.

 -Peff

Thanks. I haven't had time to look through that thread yet, I'll try
to get to that later.

My initial reaction is a bit skeptical though. For this case we
currently don't want any error reporting, the NULL return is
sufficient and even allocating and sending in the int* is pure noise.
Allocating and releasing a strbuf seems like a lot more overhead for
this type of caller? The one other potential candidate caller for
read_gitfile_gently that I have seen (clone.c:get_repo_path) don't
want any error code or message either as far as i can tell.

Also if it turns out that we actually need to treat the file too
large error differently in clean (as discussed in thread on the file
size check) then we can no longer communicate that back using the
strbuf interface.

Overall it seems like a less attractive solution to me but I can be
persuaded if there is more support expressed for the strbuf direction.

/Erik
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 5/5] clean: improve performance when removing lots of directories

2015-04-28 Thread erik elfström
On Tue, Apr 28, 2015 at 8:24 AM, Jeff King p...@peff.net wrote:

 This iteration looks reasonable overall to me.

 Should this is_git_repository() helper be available to other files? I
 think there are other calls to resolve_gitlink_ref() that would want the
 same treatment (e.g., I think git status may have a similar issue).

 -Peff

Yes that is probably the direction to go in but I think the current
version is too tailored to the clean case to be generally useful
(although I haven't check the status case). Right now this is more of
a is_this_a_repo_clean_should_care_about_ignoring_some_cornercase_quirks
than a true is_git_repository check.

I would think a general version would look more like this:

static int is_git_repository(struct strbuf *path)
{
if (is_bare_repository(path))
return BARE_REPO;
if (is_submodule(path))
return SUBMODULE;
if (is_git_director(path))
return REPO;
return 0;
}

probably also communicating any errors back to the caller.

To sum it up I think you are correct in the direction but I suspect
that the current version is not sufficient for the general case and
that it would be best to leave the generalization and making it public
for future work.

/Erik
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 4/5] p7300: add performance tests for clean

2015-04-28 Thread erik elfström
On Tue, Apr 28, 2015 at 8:33 AM, Jeff King p...@peff.net wrote:

 Do we actually need a large repo here? The real cost is coming from the
 directories we create. We could actually start with a totally empty
 repository if we wanted (though I don't think the t/perf system handles
 that right now). But if there's not a reason to use the large repo, I
 think using test_perf_default_repo is better, as it works out of the box
 without specifying extra environment variables (well, it works either
 way, but you get a nasty warning from perf-lib.sh).

 -Peff

Sure, I'll change this to the default. I wasn't sure about the
performance characteristics of clean (does working tree size matter?)
when I started so it felt safest to go with a large repo.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 2/5] setup: sanity check file size in read_gitfile_gently

2015-04-28 Thread erik elfström
On Tue, Apr 28, 2015 at 8:02 AM, Jeff King p...@peff.net wrote:

 My understanding is that PATH_MAX is set absurdly low on Windows
 systems (and doesn't actually represent the real limit of a path!).
 Since the value is picked arbitrarily anyway, could use something more
 independent (like 100K or something, which is large enough to be beyond
 absurd and small enough that a malloc isn't a big deal)?

 -Peff

I'm happy to set the limit to anything that makes everybody feel safe. I'll set
it to 1MB to be on the safe side.

I'm not sure though how the code (in general) is supposed to keep working if
a path can exceed PATH_MAX? A cursory search for PATH_MAX comes
up with char array sizes and check-and-die kind of things. If a path is longer
then surely we will be unable to handle it and abort in all sorts of places?

Are you only worried we might have a submodule with a too long path (that will
create various other problems in different codepaths) that we may mistakenly
clean (if it doesn't trigger any other abort earlier in the clean call
chain) or do
you want clean to keep working and do the right thing even in this case?

While digging around looking at this I also noticed that there is
another problem
I have overlooked previously.

read_gitfile_gently will call is_git_directory at the very end and it
contains the
following check at the very beginning:

if (PATH_MAX = len + strlen(/objects))
die(Too long path: %.*s, 60, suspect);

Now, this is good in the way that we will avoid mistakenly cleaning
stuff because
the path is too long but also bad because it makes read_gitfile_gently
behave very
ungently in this case.

I suspect I should make a gentle version of this also. The question is
what to do
in clean if the path is reported as too long? Abort? Avoid cleaning it
to be safe?
Ignore and clean it?

is_git_directory is also called from the new is_git_repository
directly but here I think
dying is ok since this path is a path in the working tree and if we
can't handle the paths
in the tree then there seem to be little point in trying to go on (as
opposed to when
some string in a file is too large for a path)

Thoughts?

/Erik
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 2/5] setup: sanity check file size in read_gitfile_gently

2015-04-26 Thread Erik Elfström
read_gitfile_gently will allocate a buffer to fit the entire file that
should be read. Add a sanity check of the file size before opening to
avoid allocating a potentially huge amount of memory if we come across
a large file that someone happened to name .git. The limit is set to
a sufficiently unreasonable size that should never be exceeded by a
genuine .git file.

Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 cache.h | 1 +
 setup.c | 6 ++
 2 files changed, 7 insertions(+)

diff --git a/cache.h b/cache.h
index 868e4d3..c9f1f8e 100644
--- a/cache.h
+++ b/cache.h
@@ -439,6 +439,7 @@ extern const char *get_git_work_tree(void);
 #define READ_GITFILE_ERR_INVALID_FORMAT 5
 #define READ_GITFILE_ERR_NO_PATH 6
 #define READ_GITFILE_ERR_NOT_A_REPO 7
+#define READ_GITFILE_ERR_TOO_LARGE 8
 extern const char *read_gitfile_gently(const char *path, int 
*return_error_code);
 #define read_gitfile(path) read_gitfile_gently((path), NULL)
 extern const char *resolve_gitdir(const char *suspect);
diff --git a/setup.c b/setup.c
index c4538ca..792c37b 100644
--- a/setup.c
+++ b/setup.c
@@ -364,6 +364,10 @@ const char *read_gitfile_gently(const char *path, int 
*return_error_code)
error_code = READ_GITFILE_ERR_OPEN_FAILED;
goto cleanup_return;
}
+   if (st.st_size  PATH_MAX * 4) {
+   error_code = READ_GITFILE_ERR_TOO_LARGE;
+   goto cleanup_return;
+   }
buf = xmalloc(st.st_size + 1);
len = read_in_full(fd, buf, st.st_size);
close(fd);
@@ -418,6 +422,8 @@ cleanup_return:
return NULL;
case READ_GITFILE_ERR_OPEN_FAILED:
die_errno(Error opening '%s', path);
+   case READ_GITFILE_ERR_TOO_LARGE:
+   die(Too large to be a .git file: '%s', path);
case READ_GITFILE_ERR_READ_FAILED:
die(Error reading %s, path);
case READ_GITFILE_ERR_INVALID_FORMAT:
-- 
2.4.0.rc3.8.gbb31afb

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 4/5] p7300: add performance tests for clean

2015-04-26 Thread Erik Elfström
The tests are run in dry-run mode to avoid having to restore the test
directories for each timed iteration. Using dry-run is an acceptable
compromise since we are mostly interested in the initial computation
of what to clean and not so much in the cleaning it self.

Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 t/perf/p7300-clean.sh | 31 +++
 1 file changed, 31 insertions(+)
 create mode 100755 t/perf/p7300-clean.sh

diff --git a/t/perf/p7300-clean.sh b/t/perf/p7300-clean.sh
new file mode 100755
index 000..6ae55ec
--- /dev/null
+++ b/t/perf/p7300-clean.sh
@@ -0,0 +1,31 @@
+#!/bin/sh
+
+test_description=Test git-clean performance
+
+. ./perf-lib.sh
+
+test_perf_large_repo
+test_checkout_worktree
+
+test_expect_success 'setup untracked directory with many sub dirs' '
+   rm -rf 500_sub_dirs 10_sub_dirs clean_test_dir 
+   mkdir 500_sub_dirs 10_sub_dirs clean_test_dir 
+   for i in $(test_seq 1 500)
+   do
+   mkdir 500_sub_dirs/dir$i || return $?
+   done 
+   for i in $(test_seq 1 200)
+   do
+   cp -r 500_sub_dirs 10_sub_dirs/dir$i || return $?
+   done
+'
+
+test_perf 'clean many untracked sub dirs, check for nested git' '
+   git clean -n -q -f -d 10_sub_dirs/
+'
+
+test_perf 'clean many untracked sub dirs, ignore nested git' '
+   git clean -n -q -f -f -d 10_sub_dirs/
+'
+
+test_done
-- 
2.4.0.rc3.8.gbb31afb

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 3/5] t7300: add tests to document behavior of clean and nested git

2015-04-26 Thread Erik Elfström
Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 t/t7300-clean.sh | 128 +++
 1 file changed, 128 insertions(+)

diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
index 99be5d9..11f3a6d 100755
--- a/t/t7300-clean.sh
+++ b/t/t7300-clean.sh
@@ -455,6 +455,134 @@ test_expect_success 'nested git work tree' '
! test -d bar
 '
 
+test_expect_failure 'should clean things that almost look like git but are 
not' '
+   rm -fr almost_git almost_bare_git almost_submodule 
+   mkdir -p almost_git/.git/objects 
+   mkdir -p almost_git/.git/refs 
+   cat almost_git/.git/HEAD -\EOF 
+   garbage
+   EOF
+   cp -r almost_git/.git/ almost_bare_git 
+   mkdir almost_submodule/ 
+   cat almost_submodule/.git -\EOF 
+   garbage
+   EOF
+   test_when_finished rm -rf almost_* 
+   ## This will fail due to die(Invalid gitfile format: %s, path); in
+   ## setup.c:read_gitfile.
+   git clean -f -d 
+   test_path_is_missing almost_git 
+   test_path_is_missing almost_bare_git 
+   test_path_is_missing almost_submodule
+'
+
+test_expect_success 'should not clean submodules' '
+   rm -fr repo to_clean sub1 sub2 
+   mkdir repo to_clean 
+   (
+   cd repo 
+   git init 
+   test_commit msg hello.world
+   ) 
+   git submodule add ./repo/.git sub1 
+   git commit -m sub1 
+   git branch before_sub2 
+   git submodule add ./repo/.git sub2 
+   git commit -m sub2 
+   git checkout before_sub2 
+   to_clean/should_clean.this 
+   git clean -f -d 
+   test_path_is_file repo/.git/index 
+   test_path_is_file repo/hello.world 
+   test_path_is_file sub1/.git 
+   test_path_is_file sub1/hello.world 
+   test_path_is_file sub2/.git 
+   test_path_is_file sub2/hello.world 
+   test_path_is_missing to_clean
+'
+
+test_expect_failure 'nested (empty) git should be kept' '
+   rm -fr empty_repo to_clean 
+   git init empty_repo 
+   mkdir to_clean 
+   to_clean/should_clean.this 
+   git clean -f -d 
+   test_path_is_file empty_repo/.git/HEAD 
+   test_path_is_missing to_clean
+'
+
+test_expect_success 'nested bare repositories should be cleaned' '
+   rm -fr bare1 bare2 subdir 
+   git init --bare bare1 
+   git clone --local --bare . bare2 
+   mkdir subdir 
+   cp -r bare2 subdir/bare3 
+   git clean -f -d 
+   test_path_is_missing bare1 
+   test_path_is_missing bare2 
+   test_path_is_missing subdir
+'
+
+test_expect_success 'nested (empty) bare repositories should be cleaned even 
when in .git' '
+   rm -fr strange_bare 
+   mkdir strange_bare 
+   git init --bare strange_bare/.git 
+   git clean -f -d 
+   test_path_is_missing strange_bare
+'
+
+test_expect_failure 'nested (non-empty) bare repositories should be cleaned 
even when in .git' '
+   rm -fr strange_bare 
+   mkdir strange_bare 
+   git clone --local --bare . strange_bare/.git 
+   git clean -f -d 
+   test_path_is_missing strange_bare
+'
+
+test_expect_success 'giving path in nested git work tree will remove it' '
+   rm -fr repo 
+   mkdir repo 
+   (
+   cd repo 
+   git init 
+   mkdir -p bar/baz 
+   test_commit msg bar/baz/hello.world
+   ) 
+   git clean -f -d repo/bar/baz 
+   test_path_is_file repo/.git/HEAD 
+   test_path_is_dir repo/bar/ 
+   test_path_is_missing repo/bar/baz
+'
+
+test_expect_success 'giving path to nested .git will not remove it' '
+   rm -fr repo 
+   mkdir repo untracked 
+   (
+   cd repo 
+   git init 
+   test_commit msg hello.world
+   ) 
+   git clean -f -d repo/.git 
+   test_path_is_file repo/.git/HEAD 
+   test_path_is_dir repo/.git/refs 
+   test_path_is_dir repo/.git/objects 
+   test_path_is_dir untracked/
+'
+
+test_expect_success 'giving path to nested .git/ will remove contents' '
+   rm -fr repo untracked 
+   mkdir repo untracked 
+   (
+   cd repo 
+   git init 
+   test_commit msg hello.world
+   ) 
+   git clean -f -d repo/.git/ 
+   test_path_is_dir repo/.git 
+   test_dir_is_empty repo/.git 
+   test_path_is_dir untracked/
+'
+
 test_expect_success 'force removal of nested git work tree' '
rm -fr foo bar baz 
mkdir -p foo bar baz/boo 
-- 
2.4.0.rc3.8.gbb31afb

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 1/5] setup: add gentle version of read_gitfile

2015-04-26 Thread Erik Elfström
read_gitfile will die on most error cases. This makes it unsuitable
for speculative calls. Extract the core logic and provide a gentle
version that returns NULL on failure.

The first usecase of the new gentle version will be to probe for
submodules during git clean.

Helped-by: Junio C Hamano gits...@pobox.com
Helped-by: Jeff King p...@peff.net
Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 cache.h | 11 -
 setup.c | 82 +++--
 2 files changed, 75 insertions(+), 18 deletions(-)

diff --git a/cache.h b/cache.h
index 3d3244b..868e4d3 100644
--- a/cache.h
+++ b/cache.h
@@ -431,7 +431,16 @@ extern int set_git_dir(const char *path);
 extern const char *get_git_namespace(void);
 extern const char *strip_namespace(const char *namespaced_ref);
 extern const char *get_git_work_tree(void);
-extern const char *read_gitfile(const char *path);
+
+#define READ_GITFILE_ERR_STAT_FAILED 1
+#define READ_GITFILE_ERR_NOT_A_FILE 2
+#define READ_GITFILE_ERR_OPEN_FAILED 3
+#define READ_GITFILE_ERR_READ_FAILED 4
+#define READ_GITFILE_ERR_INVALID_FORMAT 5
+#define READ_GITFILE_ERR_NO_PATH 6
+#define READ_GITFILE_ERR_NOT_A_REPO 7
+extern const char *read_gitfile_gently(const char *path, int 
*return_error_code);
+#define read_gitfile(path) read_gitfile_gently((path), NULL)
 extern const char *resolve_gitdir(const char *suspect);
 extern void set_git_work_tree(const char *tree);
 
diff --git a/setup.c b/setup.c
index 979b13f..c4538ca 100644
--- a/setup.c
+++ b/setup.c
@@ -335,35 +335,53 @@ static int check_repository_format_gently(const char 
*gitdir, int *nongit_ok)
 /*
  * Try to read the location of the git directory from the .git file,
  * return path to git directory if found.
+ *
+ * On failure, if return_error_code is not NULL, return_error_code
+ * will be set to an error code and NULL will be returned. If
+ * return_error_code is NULL the function will die instead (for most
+ * cases).
  */
-const char *read_gitfile(const char *path)
+const char *read_gitfile_gently(const char *path, int *return_error_code)
 {
-   char *buf;
-   char *dir;
+   int error_code = 0;
+   char *buf = NULL;
+   char *dir = NULL;
const char *slash;
struct stat st;
int fd;
ssize_t len;
 
-   if (stat(path, st))
-   return NULL;
-   if (!S_ISREG(st.st_mode))
-   return NULL;
+   if (stat(path, st)) {
+   error_code = READ_GITFILE_ERR_STAT_FAILED;
+   goto cleanup_return;
+   }
+   if (!S_ISREG(st.st_mode)) {
+   error_code = READ_GITFILE_ERR_NOT_A_FILE;
+   goto cleanup_return;
+   }
fd = open(path, O_RDONLY);
-   if (fd  0)
-   die_errno(Error opening '%s', path);
+   if (fd  0) {
+   error_code = READ_GITFILE_ERR_OPEN_FAILED;
+   goto cleanup_return;
+   }
buf = xmalloc(st.st_size + 1);
len = read_in_full(fd, buf, st.st_size);
close(fd);
-   if (len != st.st_size)
-   die(Error reading %s, path);
+   if (len != st.st_size) {
+   error_code = READ_GITFILE_ERR_READ_FAILED;
+   goto cleanup_return;
+   }
buf[len] = '\0';
-   if (!starts_with(buf, gitdir: ))
-   die(Invalid gitfile format: %s, path);
+   if (!starts_with(buf, gitdir: )) {
+   error_code = READ_GITFILE_ERR_INVALID_FORMAT;
+   goto cleanup_return;
+   }
while (buf[len - 1] == '\n' || buf[len - 1] == '\r')
len--;
-   if (len  9)
-   die(No path in gitfile: %s, path);
+   if (len  9) {
+   error_code = READ_GITFILE_ERR_NO_PATH;
+   goto cleanup_return;
+   }
buf[len] = '\0';
dir = buf + 8;
 
@@ -378,11 +396,41 @@ const char *read_gitfile(const char *path)
buf = dir;
}
 
-   if (!is_git_directory(dir))
-   die(Not a git repository: %s, dir);
+   if (!is_git_directory(dir)) {
+   error_code = READ_GITFILE_ERR_NOT_A_REPO;
+   goto cleanup_return;
+   }
path = real_path(dir);
 
+cleanup_return:
free(buf);
+
+   if (return_error_code)
+   *return_error_code = error_code;
+
+   if (error_code) {
+   if (return_error_code)
+   return NULL;
+
+   switch (error_code) {
+   case READ_GITFILE_ERR_STAT_FAILED:
+   case READ_GITFILE_ERR_NOT_A_FILE:
+   return NULL;
+   case READ_GITFILE_ERR_OPEN_FAILED:
+   die_errno(Error opening '%s', path);
+   case READ_GITFILE_ERR_READ_FAILED:
+   die(Error reading %s, path);
+   case READ_GITFILE_ERR_INVALID_FORMAT:
+   die(Invalid gitfile format: %s, path);
+   case

[PATCH v5 5/5] clean: improve performance when removing lots of directories

2015-04-26 Thread Erik Elfström
git clean uses resolve_gitlink_ref() to check for the presence of
nested git repositories, but it has the drawback of creating a
ref_cache entry for every directory that should potentially be
cleaned. The linear search through the ref_cache list causes a massive
performance hit for large number of directories.

Modify clean.c:remove_dirs to use setup.c:is_git_directory and
setup.c:read_gitfile_gently instead.

Both these functions will open files and parse contents when they find
something that looks like a git repository. This is ok from a
performance standpoint since finding repository candidates should be
comparatively rare.

Using is_git_directory and read_gitfile_gently should give a more
standardized check for what is and what isn't a git repository but
also gives three behavioral changes.

The first change is that we will now detect and avoid cleaning empty
nested git repositories (only init run). This is desirable.

Second, we will no longer die when cleaning a file named .git with
garbage content (it will be cleaned instead). This is also desirable.

The last change is that we will detect and avoid cleaning empty bare
repositories that have been placed in a directory named .git. This
is not desirable but should have no real user impact since we already
fail to clean non-empty bare repositories in the same scenario. This
is thus deemed acceptable.

Update t7300 to reflect these changes in behavior.

The time to clean an untracked directory containing 10 sub
directories went from 61s to 1.7s after this change.

Helped-by: Jeff King p...@peff.net
Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 builtin/clean.c  | 26 ++
 t/t7300-clean.sh |  8 +++-
 2 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/builtin/clean.c b/builtin/clean.c
index 98c103f..17a1c13 100644
--- a/builtin/clean.c
+++ b/builtin/clean.c
@@ -10,7 +10,6 @@
 #include cache.h
 #include dir.h
 #include parse-options.h
-#include refs.h
 #include string-list.h
 #include quote.h
 #include column.h
@@ -148,6 +147,27 @@ static int exclude_cb(const struct option *opt, const char 
*arg, int unset)
return 0;
 }
 
+/*
+ * Return 1 if the given path is the root of a git repository or
+ * submodule else 0. Will not return 1 for bare repositories with the
+ * exception of creating a bare repository in foo/.git and calling
+ * is_git_repository(foo).
+ */
+static int is_git_repository(struct strbuf *path)
+{
+   int ret = 0;
+   int unused_error_code;
+   size_t orig_path_len = path-len;
+   assert(orig_path_len != 0);
+   if (path-buf[orig_path_len - 1] != '/')
+   strbuf_addch(path, '/');
+   strbuf_addstr(path, .git);
+   if (read_gitfile_gently(path-buf, unused_error_code) || 
is_git_directory(path-buf))
+   ret = 1;
+   strbuf_setlen(path, orig_path_len);
+   return ret;
+}
+
 static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag,
int dry_run, int quiet, int *dir_gone)
 {
@@ -155,13 +175,11 @@ static int remove_dirs(struct strbuf *path, const char 
*prefix, int force_flag,
struct strbuf quoted = STRBUF_INIT;
struct dirent *e;
int res = 0, ret = 0, gone = 1, original_len = path-len, len;
-   unsigned char submodule_head[20];
struct string_list dels = STRING_LIST_INIT_DUP;
 
*dir_gone = 1;
 
-   if ((force_flag  REMOVE_DIR_KEEP_NESTED_GIT) 
-   !resolve_gitlink_ref(path-buf, HEAD, 
submodule_head)) {
+   if ((force_flag  REMOVE_DIR_KEEP_NESTED_GIT)  
is_git_repository(path)) {
if (!quiet) {
quote_path_relative(path-buf, prefix, quoted);
printf(dry_run ?  _(msg_would_skip_git_dir) : 
_(msg_skip_git_dir),
diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
index 11f3a6d..4d07a4a 100755
--- a/t/t7300-clean.sh
+++ b/t/t7300-clean.sh
@@ -455,7 +455,7 @@ test_expect_success 'nested git work tree' '
! test -d bar
 '
 
-test_expect_failure 'should clean things that almost look like git but are 
not' '
+test_expect_success 'should clean things that almost look like git but are 
not' '
rm -fr almost_git almost_bare_git almost_submodule 
mkdir -p almost_git/.git/objects 
mkdir -p almost_git/.git/refs 
@@ -468,8 +468,6 @@ test_expect_failure 'should clean things that almost look 
like git but are not'
garbage
EOF
test_when_finished rm -rf almost_* 
-   ## This will fail due to die(Invalid gitfile format: %s, path); in
-   ## setup.c:read_gitfile.
git clean -f -d 
test_path_is_missing almost_git 
test_path_is_missing almost_bare_git 
@@ -501,7 +499,7 @@ test_expect_success 'should not clean submodules' '
test_path_is_missing to_clean
 '
 
-test_expect_failure 'nested (empty) git should be kept' '
+test_expect_success 'nested (empty) git should be kept' '
rm -fr

[PATCH v5 0/5] Improving performance of git clean

2015-04-26 Thread Erik Elfström
Changes in v5:
* Added defines for read_gitfile_gently error codes. 
  This was a silly mistake, sorry about that.

Erik Elfström (5):
  setup: add gentle version of read_gitfile
  setup: sanity check file size in read_gitfile_gently
  t7300: add tests to document behavior of clean and nested git
  p7300: add performance tests for clean
  clean: improve performance when removing lots of directories

 builtin/clean.c   |  26 +--
 cache.h   |  12 -
 setup.c   |  88 ---
 t/perf/p7300-clean.sh |  31 +
 t/t7300-clean.sh  | 126 ++
 5 files changed, 261 insertions(+), 22 deletions(-)
 create mode 100755 t/perf/p7300-clean.sh

-- 
2.4.0.rc3.8.gbb31afb

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/5] setup: sanity check file size in read_gitfile_gently

2015-04-25 Thread Erik Elfström
On Sat, Apr 25, 2015 at 6:47 PM, Junio C Hamano gits...@pobox.com wrote:
 I do not think it is wrong per-se, but the changes in this patch
 shows why hardcoded values assigned to error_code without #define is
 not a good idea, as these values are now exposed to the callers of
 the new function.  After we gain a new caller that does care about
 the exact error code (e.g. to react differently to the reason why we
 failed to read by giving different error messages) if we decide to
 revert this step, or if we decide to add a new error type, for
 whatever reason, this organization forces the caller to be updated.

Yes, it was a bit silly of me not to add that. Would something like
this be ok?

---
 cache.h |  9 +
 setup.c | 32 
 2 files changed, 25 insertions(+), 16 deletions(-)

diff --git a/cache.h b/cache.h
index 6e29068..177337a 100644
--- a/cache.h
+++ b/cache.h
@@ -431,6 +431,15 @@ extern int set_git_dir(const char *path);
 extern const char *get_git_namespace(void);
 extern const char *strip_namespace(const char *namespaced_ref);
 extern const char *get_git_work_tree(void);
+
+#define READ_GITFILE_ERR_STAT_FAILED 1
+#define READ_GITFILE_ERR_NOT_A_FILE 2
+#define READ_GITFILE_ERR_OPEN_FAILED 3
+#define READ_GITFILE_ERR_TOO_LARGE 4
+#define READ_GITFILE_ERR_READ_FAILED 5
+#define READ_GITFILE_ERR_INVALID_FORMAT 6
+#define READ_GITFILE_ERR_NO_PATH 7
+#define READ_GITFILE_ERR_NOT_A_REPO 8
 extern const char *read_gitfile_gently(const char *path, int 
*return_error_code);
 #define read_gitfile(path) read_gitfile_gently((path), NULL)
 extern const char *resolve_gitdir(const char *suspect);
diff --git a/setup.c b/setup.c
index ed87334..792c37b 100644
--- a/setup.c
+++ b/setup.c
@@ -352,38 +352,38 @@ const char *read_gitfile_gently(const char *path, int 
*return_error_code)
ssize_t len;
 
if (stat(path, st)) {
-   error_code = 1;
+   error_code = READ_GITFILE_ERR_STAT_FAILED;
goto cleanup_return;
}
if (!S_ISREG(st.st_mode)) {
-   error_code = 2;
+   error_code = READ_GITFILE_ERR_NOT_A_FILE;
goto cleanup_return;
}
fd = open(path, O_RDONLY);
if (fd  0) {
-   error_code = 3;
+   error_code = READ_GITFILE_ERR_OPEN_FAILED;
goto cleanup_return;
}
if (st.st_size  PATH_MAX * 4) {
-   error_code = 4;
+   error_code = READ_GITFILE_ERR_TOO_LARGE;
goto cleanup_return;
}
buf = xmalloc(st.st_size + 1);
len = read_in_full(fd, buf, st.st_size);
close(fd);
if (len != st.st_size) {
-   error_code = 5;
+   error_code = READ_GITFILE_ERR_READ_FAILED;
goto cleanup_return;
}
buf[len] = '\0';
if (!starts_with(buf, gitdir: )) {
-   error_code = 6;
+   error_code = READ_GITFILE_ERR_INVALID_FORMAT;
goto cleanup_return;
}
while (buf[len - 1] == '\n' || buf[len - 1] == '\r')
len--;
if (len  9) {
-   error_code = 7;
+   error_code = READ_GITFILE_ERR_NO_PATH;
goto cleanup_return;
}
buf[len] = '\0';
@@ -401,7 +401,7 @@ const char *read_gitfile_gently(const char *path, int 
*return_error_code)
}
 
if (!is_git_directory(dir)) {
-   error_code = 8;
+   error_code = READ_GITFILE_ERR_NOT_A_REPO;
goto cleanup_return;
}
path = real_path(dir);
@@ -417,20 +417,20 @@ cleanup_return:
return NULL;
 
switch (error_code) {
-   case 1: // failed to stat
-   case 2: // not regular file
+   case READ_GITFILE_ERR_STAT_FAILED:
+   case READ_GITFILE_ERR_NOT_A_FILE:
return NULL;
-   case 3:
+   case READ_GITFILE_ERR_OPEN_FAILED:
die_errno(Error opening '%s', path);
-   case 4:
+   case READ_GITFILE_ERR_TOO_LARGE:
die(Too large to be a .git file: '%s', path);
-   case 5:
+   case READ_GITFILE_ERR_READ_FAILED:
die(Error reading %s, path);
-   case 6:
+   case READ_GITFILE_ERR_INVALID_FORMAT:
die(Invalid gitfile format: %s, path);
-   case 7:
+   case READ_GITFILE_ERR_NO_PATH:
die(No path in gitfile: %s, path);
-   case 8:
+   case READ_GITFILE_ERR_NOT_A_REPO:
die(Not a git repository: %s, dir);
default:
assert(0);
-- 
2.4.0.rc3.8.g86acfbd.dirty

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org

[PATCH v4 3/5] t7300: add tests to document behavior of clean and nested git

2015-04-25 Thread Erik Elfström
Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 t/t7300-clean.sh | 128 +++
 1 file changed, 128 insertions(+)

diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
index 99be5d9..11f3a6d 100755
--- a/t/t7300-clean.sh
+++ b/t/t7300-clean.sh
@@ -455,6 +455,134 @@ test_expect_success 'nested git work tree' '
! test -d bar
 '
 
+test_expect_failure 'should clean things that almost look like git but are 
not' '
+   rm -fr almost_git almost_bare_git almost_submodule 
+   mkdir -p almost_git/.git/objects 
+   mkdir -p almost_git/.git/refs 
+   cat almost_git/.git/HEAD -\EOF 
+   garbage
+   EOF
+   cp -r almost_git/.git/ almost_bare_git 
+   mkdir almost_submodule/ 
+   cat almost_submodule/.git -\EOF 
+   garbage
+   EOF
+   test_when_finished rm -rf almost_* 
+   ## This will fail due to die(Invalid gitfile format: %s, path); in
+   ## setup.c:read_gitfile.
+   git clean -f -d 
+   test_path_is_missing almost_git 
+   test_path_is_missing almost_bare_git 
+   test_path_is_missing almost_submodule
+'
+
+test_expect_success 'should not clean submodules' '
+   rm -fr repo to_clean sub1 sub2 
+   mkdir repo to_clean 
+   (
+   cd repo 
+   git init 
+   test_commit msg hello.world
+   ) 
+   git submodule add ./repo/.git sub1 
+   git commit -m sub1 
+   git branch before_sub2 
+   git submodule add ./repo/.git sub2 
+   git commit -m sub2 
+   git checkout before_sub2 
+   to_clean/should_clean.this 
+   git clean -f -d 
+   test_path_is_file repo/.git/index 
+   test_path_is_file repo/hello.world 
+   test_path_is_file sub1/.git 
+   test_path_is_file sub1/hello.world 
+   test_path_is_file sub2/.git 
+   test_path_is_file sub2/hello.world 
+   test_path_is_missing to_clean
+'
+
+test_expect_failure 'nested (empty) git should be kept' '
+   rm -fr empty_repo to_clean 
+   git init empty_repo 
+   mkdir to_clean 
+   to_clean/should_clean.this 
+   git clean -f -d 
+   test_path_is_file empty_repo/.git/HEAD 
+   test_path_is_missing to_clean
+'
+
+test_expect_success 'nested bare repositories should be cleaned' '
+   rm -fr bare1 bare2 subdir 
+   git init --bare bare1 
+   git clone --local --bare . bare2 
+   mkdir subdir 
+   cp -r bare2 subdir/bare3 
+   git clean -f -d 
+   test_path_is_missing bare1 
+   test_path_is_missing bare2 
+   test_path_is_missing subdir
+'
+
+test_expect_success 'nested (empty) bare repositories should be cleaned even 
when in .git' '
+   rm -fr strange_bare 
+   mkdir strange_bare 
+   git init --bare strange_bare/.git 
+   git clean -f -d 
+   test_path_is_missing strange_bare
+'
+
+test_expect_failure 'nested (non-empty) bare repositories should be cleaned 
even when in .git' '
+   rm -fr strange_bare 
+   mkdir strange_bare 
+   git clone --local --bare . strange_bare/.git 
+   git clean -f -d 
+   test_path_is_missing strange_bare
+'
+
+test_expect_success 'giving path in nested git work tree will remove it' '
+   rm -fr repo 
+   mkdir repo 
+   (
+   cd repo 
+   git init 
+   mkdir -p bar/baz 
+   test_commit msg bar/baz/hello.world
+   ) 
+   git clean -f -d repo/bar/baz 
+   test_path_is_file repo/.git/HEAD 
+   test_path_is_dir repo/bar/ 
+   test_path_is_missing repo/bar/baz
+'
+
+test_expect_success 'giving path to nested .git will not remove it' '
+   rm -fr repo 
+   mkdir repo untracked 
+   (
+   cd repo 
+   git init 
+   test_commit msg hello.world
+   ) 
+   git clean -f -d repo/.git 
+   test_path_is_file repo/.git/HEAD 
+   test_path_is_dir repo/.git/refs 
+   test_path_is_dir repo/.git/objects 
+   test_path_is_dir untracked/
+'
+
+test_expect_success 'giving path to nested .git/ will remove contents' '
+   rm -fr repo untracked 
+   mkdir repo untracked 
+   (
+   cd repo 
+   git init 
+   test_commit msg hello.world
+   ) 
+   git clean -f -d repo/.git/ 
+   test_path_is_dir repo/.git 
+   test_dir_is_empty repo/.git 
+   test_path_is_dir untracked/
+'
+
 test_expect_success 'force removal of nested git work tree' '
rm -fr foo bar baz 
mkdir -p foo bar baz/boo 
-- 
2.4.0.rc3.8.g4ebd28d

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 1/5] setup: add gentle version of read_gitfile

2015-04-25 Thread Erik Elfström
read_gitfile will die on most error cases. This makes it unsuitable
for speculative calls. Extract the core logic and provide a gentle
version that returns NULL on failure.

The first usecase of the new gentle version will be to probe for
submodules during git clean.

Helped-by: Junio C Hamano gits...@pobox.com
Helped-by: Jeff King p...@peff.net
Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 cache.h |  3 ++-
 setup.c | 82 +++--
 2 files changed, 67 insertions(+), 18 deletions(-)

diff --git a/cache.h b/cache.h
index 3d3244b..6e29068 100644
--- a/cache.h
+++ b/cache.h
@@ -431,7 +431,8 @@ extern int set_git_dir(const char *path);
 extern const char *get_git_namespace(void);
 extern const char *strip_namespace(const char *namespaced_ref);
 extern const char *get_git_work_tree(void);
-extern const char *read_gitfile(const char *path);
+extern const char *read_gitfile_gently(const char *path, int 
*return_error_code);
+#define read_gitfile(path) read_gitfile_gently((path), NULL)
 extern const char *resolve_gitdir(const char *suspect);
 extern void set_git_work_tree(const char *tree);
 
diff --git a/setup.c b/setup.c
index 979b13f..e1897cc 100644
--- a/setup.c
+++ b/setup.c
@@ -335,35 +335,53 @@ static int check_repository_format_gently(const char 
*gitdir, int *nongit_ok)
 /*
  * Try to read the location of the git directory from the .git file,
  * return path to git directory if found.
+ *
+ * On failure, if return_error_code is not NULL, return_error_code
+ * will be set to an error code and NULL will be returned. If
+ * return_error_code is NULL the function will die instead (for most
+ * cases).
  */
-const char *read_gitfile(const char *path)
+const char *read_gitfile_gently(const char *path, int *return_error_code)
 {
-   char *buf;
-   char *dir;
+   int error_code = 0;
+   char *buf = NULL;
+   char *dir = NULL;
const char *slash;
struct stat st;
int fd;
ssize_t len;
 
-   if (stat(path, st))
-   return NULL;
-   if (!S_ISREG(st.st_mode))
-   return NULL;
+   if (stat(path, st)) {
+   error_code = 1;
+   goto cleanup_return;
+   }
+   if (!S_ISREG(st.st_mode)) {
+   error_code = 2;
+   goto cleanup_return;
+   }
fd = open(path, O_RDONLY);
-   if (fd  0)
-   die_errno(Error opening '%s', path);
+   if (fd  0) {
+   error_code = 3;
+   goto cleanup_return;
+   }
buf = xmalloc(st.st_size + 1);
len = read_in_full(fd, buf, st.st_size);
close(fd);
-   if (len != st.st_size)
-   die(Error reading %s, path);
+   if (len != st.st_size) {
+   error_code = 4;
+   goto cleanup_return;
+   }
buf[len] = '\0';
-   if (!starts_with(buf, gitdir: ))
-   die(Invalid gitfile format: %s, path);
+   if (!starts_with(buf, gitdir: )) {
+   error_code = 5;
+   goto cleanup_return;
+   }
while (buf[len - 1] == '\n' || buf[len - 1] == '\r')
len--;
-   if (len  9)
-   die(No path in gitfile: %s, path);
+   if (len  9) {
+   error_code = 6;
+   goto cleanup_return;
+   }
buf[len] = '\0';
dir = buf + 8;
 
@@ -378,11 +396,41 @@ const char *read_gitfile(const char *path)
buf = dir;
}
 
-   if (!is_git_directory(dir))
-   die(Not a git repository: %s, dir);
+   if (!is_git_directory(dir)) {
+   error_code = 7;
+   goto cleanup_return;
+   }
path = real_path(dir);
 
+cleanup_return:
free(buf);
+
+   if (return_error_code)
+   *return_error_code = error_code;
+
+   if (error_code) {
+   if (return_error_code)
+   return NULL;
+
+   switch (error_code) {
+   case 1: // failed to stat
+   case 2: // not regular file
+   return NULL;
+   case 3:
+   die_errno(Error opening '%s', path);
+   case 4:
+   die(Error reading %s, path);
+   case 5:
+   die(Invalid gitfile format: %s, path);
+   case 6:
+   die(No path in gitfile: %s, path);
+   case 7:
+   die(Not a git repository: %s, dir);
+   default:
+   assert(0);
+   }
+   }
+
return path;
 }
 
-- 
2.4.0.rc3.8.g4ebd28d

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 5/5] clean: improve performance when removing lots of directories

2015-04-25 Thread Erik Elfström
git clean uses resolve_gitlink_ref() to check for the presence of
nested git repositories, but it has the drawback of creating a
ref_cache entry for every directory that should potentially be
cleaned. The linear search through the ref_cache list causes a massive
performance hit for large number of directories.

Modify clean.c:remove_dirs to use setup.c:is_git_directory and
setup.c:read_gitfile_gently instead.

Both these functions will open files and parse contents when they find
something that looks like a git repository. This is ok from a
performance standpoint since finding repository candidates should be
comparatively rare.

Using is_git_directory and read_gitfile_gently should give a more
standardized check for what is and what isn't a git repository but
also gives three behavioral changes.

The first change is that we will now detect and avoid cleaning empty
nested git repositories (only init run). This is desirable.

Second, we will no longer die when cleaning a file named .git with
garbage content (it will be cleaned instead). This is also desirable.

The last change is that we will detect and avoid cleaning empty bare
repositories that have been placed in a directory named .git. This
is not desirable but should have no real user impact since we already
fail to clean non-empty bare repositories in the same scenario. This
is thus deemed acceptable.

Update t7300 to reflect these changes in behavior.

The time to clean an untracked directory containing 10 sub
directories went from 61s to 1.7s after this change.

Helped-by: Jeff King p...@peff.net
Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 builtin/clean.c  | 26 ++
 t/t7300-clean.sh |  8 +++-
 2 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/builtin/clean.c b/builtin/clean.c
index 98c103f..17a1c13 100644
--- a/builtin/clean.c
+++ b/builtin/clean.c
@@ -10,7 +10,6 @@
 #include cache.h
 #include dir.h
 #include parse-options.h
-#include refs.h
 #include string-list.h
 #include quote.h
 #include column.h
@@ -148,6 +147,27 @@ static int exclude_cb(const struct option *opt, const char 
*arg, int unset)
return 0;
 }
 
+/*
+ * Return 1 if the given path is the root of a git repository or
+ * submodule else 0. Will not return 1 for bare repositories with the
+ * exception of creating a bare repository in foo/.git and calling
+ * is_git_repository(foo).
+ */
+static int is_git_repository(struct strbuf *path)
+{
+   int ret = 0;
+   int unused_error_code;
+   size_t orig_path_len = path-len;
+   assert(orig_path_len != 0);
+   if (path-buf[orig_path_len - 1] != '/')
+   strbuf_addch(path, '/');
+   strbuf_addstr(path, .git);
+   if (read_gitfile_gently(path-buf, unused_error_code) || 
is_git_directory(path-buf))
+   ret = 1;
+   strbuf_setlen(path, orig_path_len);
+   return ret;
+}
+
 static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag,
int dry_run, int quiet, int *dir_gone)
 {
@@ -155,13 +175,11 @@ static int remove_dirs(struct strbuf *path, const char 
*prefix, int force_flag,
struct strbuf quoted = STRBUF_INIT;
struct dirent *e;
int res = 0, ret = 0, gone = 1, original_len = path-len, len;
-   unsigned char submodule_head[20];
struct string_list dels = STRING_LIST_INIT_DUP;
 
*dir_gone = 1;
 
-   if ((force_flag  REMOVE_DIR_KEEP_NESTED_GIT) 
-   !resolve_gitlink_ref(path-buf, HEAD, 
submodule_head)) {
+   if ((force_flag  REMOVE_DIR_KEEP_NESTED_GIT)  
is_git_repository(path)) {
if (!quiet) {
quote_path_relative(path-buf, prefix, quoted);
printf(dry_run ?  _(msg_would_skip_git_dir) : 
_(msg_skip_git_dir),
diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
index 11f3a6d..4d07a4a 100755
--- a/t/t7300-clean.sh
+++ b/t/t7300-clean.sh
@@ -455,7 +455,7 @@ test_expect_success 'nested git work tree' '
! test -d bar
 '
 
-test_expect_failure 'should clean things that almost look like git but are 
not' '
+test_expect_success 'should clean things that almost look like git but are 
not' '
rm -fr almost_git almost_bare_git almost_submodule 
mkdir -p almost_git/.git/objects 
mkdir -p almost_git/.git/refs 
@@ -468,8 +468,6 @@ test_expect_failure 'should clean things that almost look 
like git but are not'
garbage
EOF
test_when_finished rm -rf almost_* 
-   ## This will fail due to die(Invalid gitfile format: %s, path); in
-   ## setup.c:read_gitfile.
git clean -f -d 
test_path_is_missing almost_git 
test_path_is_missing almost_bare_git 
@@ -501,7 +499,7 @@ test_expect_success 'should not clean submodules' '
test_path_is_missing to_clean
 '
 
-test_expect_failure 'nested (empty) git should be kept' '
+test_expect_success 'nested (empty) git should be kept' '
rm -fr

[PATCH v4 4/5] p7300: add performance tests for clean

2015-04-25 Thread Erik Elfström
The tests are run in dry-run mode to avoid having to restore the test
directories for each timed iteration. Using dry-run is an acceptable
compromise since we are mostly interested in the initial computation
of what to clean and not so much in the cleaning it self.

Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 t/perf/p7300-clean.sh | 31 +++
 1 file changed, 31 insertions(+)
 create mode 100755 t/perf/p7300-clean.sh

diff --git a/t/perf/p7300-clean.sh b/t/perf/p7300-clean.sh
new file mode 100755
index 000..6ae55ec
--- /dev/null
+++ b/t/perf/p7300-clean.sh
@@ -0,0 +1,31 @@
+#!/bin/sh
+
+test_description=Test git-clean performance
+
+. ./perf-lib.sh
+
+test_perf_large_repo
+test_checkout_worktree
+
+test_expect_success 'setup untracked directory with many sub dirs' '
+   rm -rf 500_sub_dirs 10_sub_dirs clean_test_dir 
+   mkdir 500_sub_dirs 10_sub_dirs clean_test_dir 
+   for i in $(test_seq 1 500)
+   do
+   mkdir 500_sub_dirs/dir$i || return $?
+   done 
+   for i in $(test_seq 1 200)
+   do
+   cp -r 500_sub_dirs 10_sub_dirs/dir$i || return $?
+   done
+'
+
+test_perf 'clean many untracked sub dirs, check for nested git' '
+   git clean -n -q -f -d 10_sub_dirs/
+'
+
+test_perf 'clean many untracked sub dirs, ignore nested git' '
+   git clean -n -q -f -f -d 10_sub_dirs/
+'
+
+test_done
-- 
2.4.0.rc3.8.g4ebd28d

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 2/5] setup: sanity check file size in read_gitfile_gently

2015-04-25 Thread Erik Elfström
read_gitfile_gently will allocate a buffer to fit the entire file that
should be read. Add a sanity check of the file size before opening to
avoid allocating a potentially huge amount of memory if we come across
a large file that someone happened to name .git. The limit is set to
a sufficiently unreasonable size that should never be exceeded by a
genuine .git file.

Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---

I'm not sure about this one but it felt like the safe thing to do.
This patch can be dropped if it is not desired.

I considered testing it using
 mkdir foo  truncate -s 200G foo/.git  git clean -f -d
but that feels like a pretty evil test that is likely to cause lots
of problems and not fail in any good way.

 setup.c | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/setup.c b/setup.c
index e1897cc..ed87334 100644
--- a/setup.c
+++ b/setup.c
@@ -364,22 +364,26 @@ const char *read_gitfile_gently(const char *path, int 
*return_error_code)
error_code = 3;
goto cleanup_return;
}
+   if (st.st_size  PATH_MAX * 4) {
+   error_code = 4;
+   goto cleanup_return;
+   }
buf = xmalloc(st.st_size + 1);
len = read_in_full(fd, buf, st.st_size);
close(fd);
if (len != st.st_size) {
-   error_code = 4;
+   error_code = 5;
goto cleanup_return;
}
buf[len] = '\0';
if (!starts_with(buf, gitdir: )) {
-   error_code = 5;
+   error_code = 6;
goto cleanup_return;
}
while (buf[len - 1] == '\n' || buf[len - 1] == '\r')
len--;
if (len  9) {
-   error_code = 6;
+   error_code = 7;
goto cleanup_return;
}
buf[len] = '\0';
@@ -397,7 +401,7 @@ const char *read_gitfile_gently(const char *path, int 
*return_error_code)
}
 
if (!is_git_directory(dir)) {
-   error_code = 7;
+   error_code = 8;
goto cleanup_return;
}
path = real_path(dir);
@@ -419,12 +423,14 @@ cleanup_return:
case 3:
die_errno(Error opening '%s', path);
case 4:
-   die(Error reading %s, path);
+   die(Too large to be a .git file: '%s', path);
case 5:
-   die(Invalid gitfile format: %s, path);
+   die(Error reading %s, path);
case 6:
-   die(No path in gitfile: %s, path);
+   die(Invalid gitfile format: %s, path);
case 7:
+   die(No path in gitfile: %s, path);
+   case 8:
die(Not a git repository: %s, dir);
default:
assert(0);
-- 
2.4.0.rc3.8.g4ebd28d

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 0/5] Improving performance of git clean

2015-04-25 Thread Erik Elfström
v3 of the patch can be found here:
http://thread.gmane.org/gmane.comp.version-control.git/267422

Changes in v4:
* changed some tests to use more meaningful dir names.
* fixed performance test by doing git clean -n to avoid
  timing setup code. Increased test size to 10 directories
  (~0.5s runtime).
* changed interface of read_gitfile_gently to be able to
  return error code.
* fixed a compiler warning in read_gitfile_gently (warning:
  ‘dir’ may be used uninitialized in this function).
* added sanity check of git file size in read_gitfile_gently
* updated commit message in [5/5] to more clearly motivate
  remaining behavioral changes of git clean.

Thanks to Junio C Hamano and Jeff King for comments and help on v3.

Erik Elfström (5):
  setup: add gentle version of read_gitfile
  setup: sanity check file size in read_gitfile_gently
  t7300: add tests to document behavior of clean and nested git
  p7300: add performance tests for clean
  clean: improve performance when removing lots of directories

 builtin/clean.c   |  26 +--
 cache.h   |   3 +-
 setup.c   |  88 ---
 t/perf/p7300-clean.sh |  31 +
 t/t7300-clean.sh  | 126 ++
 5 files changed, 252 insertions(+), 22 deletions(-)
 create mode 100755 t/perf/p7300-clean.sh

-- 
2.4.0.rc3.8.g4ebd28d

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC v3 0/4] Improving performance of git clean

2015-04-22 Thread erik elfström
On Tue, Apr 21, 2015 at 11:24 PM, Jeff King p...@peff.net wrote:

 If I understand correctly, the reason that you need per-run setup is
 that your git clean command actually cleans things, and you need to
 restore the original state for each time-trial. Can you instead use git
 clean -n to do a dry-run? I think what you are timing is really the
 figure out what to clean step, and not the cleaning itself.

 -Peff


Yes, that is the problem. A dry run will spot this particular performance
issue but maybe we lose some value as a general performance test if
we only do half the clean? Admittedly we clearly lose some value in
the current state as well due to the copying taking more time than the
cleaning. I could go either way here.

/Erik
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC v3 0/4] Improving performance of git clean

2015-04-22 Thread erik elfström
On Wed, Apr 22, 2015 at 9:46 PM, Jeff King p...@peff.net wrote:
 On Wed, Apr 22, 2015 at 09:30:20PM +0200, erik elfström wrote:

 Yes, that is the problem. A dry run will spot this particular performance
 issue but maybe we lose some value as a general performance test if
 we only do half the clean? Admittedly we clearly lose some value in
 the current state as well due to the copying taking more time than the
 cleaning. I could go either way here.

 I guess it is a matter of opinion. I think testing only the find out
 what to clean half separately is actually beneficial, because it helps
 us isolate any slowdown. If we want to add a test for the other half, we
 can, but I do not actually think it is currently that interesting (it is
 just calling unlink() in a loop).

 So even leaving the practical matters aside, I do not think it is a bad
 thing to split it up. When you add in the fact that it is practically
 much easier to test the first half, it seems to me that testing just
 that is a good first step.

 -Peff

Sounds reasonable to me. I'll make this change in v4, thanks!

(Sorry for the duplicate email Jeff, I'm bad at this mailing list thing...)

/Erik
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC v3 0/4] Improving performance of git clean

2015-04-21 Thread erik elfström
On Sun, Apr 19, 2015 at 3:14 AM, Junio C Hamano gits...@pobox.com wrote:
 Erik Elfström erik.elfst...@gmail.com writes:

 Known Problems:
 * Unsure about the setup.c:read_gitfile refactor, feels a bit
   messy?

 The interface indeed feels somewhat messy.  I suspect that a better
 interface might be more like setup_git_directory_gently() that is a
 gentler version of setup_git_directory().  The function takes an
 optional pointer to an int, and it behaves not-gently-at-all when
 the optional argument is NULL.  The location the optional pointer
 points at, when it is not NULL, can be used to return the error
 state to the caller.  That function pair only uses the optional
 argument to convey one bit of information (i.e. are we in any git
 repository or not?) back to the caller, but the interface could be
 used to tell the caller a lot more if we wanted to.

 If you model read_gitfile_gently() after that pattern, I would
 expect that

  - The extra pattern would be int *error;
  - The implementation of read_gitfile() would be

#define read_gitfile(path) read_gitfile_gently((path), NULL)

and the _gently() version will die when error parameter is set
to NULL and finds any error.

  - The caller of the gentle variant can use the error code to
determine what went wrong, not just the fact that it failed.  I
do not think your caller does not have an immediate need to tell
between invalid gitfile format and No path in gitfile, but
such an interface leaves that possibility open.

 Thanks.


Ok, I'll try that approach, thanks. (and apologies for the late reply)

/Erik
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC v3 0/4] Improving performance of git clean

2015-04-21 Thread erik elfström
Ok, thanks for looking into this.

I have no well founded opinions on the implementation but I do
think the performance tests would be more meaningful if the
setup/cleanup code could be removed from the timed section.
If the community agrees on an implementation I would be happy
to convert the new tests, either directly in this series or as a follow
up if that is preferred.

/Erik

On Tue, Apr 21, 2015 at 12:14 AM, Thomas Gummerer t.gumme...@gmail.com wrote:
 On 04/18, Erik Elfström wrote:
 * Still have issues in the performance tests, see comments
   from Thomas Gummerer on v2

 I've looked at the modern style tests again, and I don't the code
 churn is worth it just for using them for the performance tests.  If
 anyone wants to take a look at the code, it's at
 github.com/tgummerer/git tg/perf-lib.

 I think adding the test_perf_setup_cleanup command would make more
 sense in this case.  If you want I can send a patch for that.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH/RFC v3 4/4] clean: improve performance when removing lots of directories

2015-04-18 Thread Erik Elfström
git clean uses resolve_gitlink_ref() to check for the presence of
nested git repositories, but it has the drawback of creating a
ref_cache entry for every directory that should potentially be
cleaned. The linear search through the ref_cache list causes a massive
performance hit for large number of directories.

Modify clean.c:remove_dirs to use setup.c:is_git_directory and
setup.c:read_gitfile_gently instead.

Both these functions will open files and parse contents when they find
something that looks like a git repository. This is ok from a
performance standpoint since finding repository candidates should be
comparatively rare.

Using is_git_directory and read_gitfile_gently should give a more
standardized check for what is and what isn't a git repository but
also gives a slight behavioral change. We will now detect and respect
empty nested git repositories (only init run) and empty bare
repositories that have been placed in a .git directory. We will also
no longer die when cleaning a file named .git with garbage content
(it will be cleaned instead). Update t7300 to reflect this.

The time to clean an untracked directory containing 10 sub
directories went from 61s to 1.7s after this change.

Helped-by: Jeff King p...@peff.net
Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 builtin/clean.c  | 25 +
 t/t7300-clean.sh |  8 +++-
 2 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/builtin/clean.c b/builtin/clean.c
index 98c103f..5cda3c5 100644
--- a/builtin/clean.c
+++ b/builtin/clean.c
@@ -10,7 +10,6 @@
 #include cache.h
 #include dir.h
 #include parse-options.h
-#include refs.h
 #include string-list.h
 #include quote.h
 #include column.h
@@ -148,6 +147,26 @@ static int exclude_cb(const struct option *opt, const char 
*arg, int unset)
return 0;
 }
 
+/*
+ * Return 1 if the given path is the root of a git repository or
+ * submodule else 0. Will not return 1 for bare repositories with the
+ * exception of creating a bare repository in foo/.git and calling
+ * is_git_repository(foo).
+ */
+static int is_git_repository(struct strbuf *path)
+{
+   int ret = 0;
+   size_t orig_path_len = path-len;
+   assert(orig_path_len != 0);
+   if (path-buf[orig_path_len - 1] != '/')
+   strbuf_addch(path, '/');
+   strbuf_addstr(path, .git);
+   if (read_gitfile_gently(path-buf) || is_git_directory(path-buf))
+   ret = 1;
+   strbuf_setlen(path, orig_path_len);
+   return ret;
+}
+
 static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag,
int dry_run, int quiet, int *dir_gone)
 {
@@ -155,13 +174,11 @@ static int remove_dirs(struct strbuf *path, const char 
*prefix, int force_flag,
struct strbuf quoted = STRBUF_INIT;
struct dirent *e;
int res = 0, ret = 0, gone = 1, original_len = path-len, len;
-   unsigned char submodule_head[20];
struct string_list dels = STRING_LIST_INIT_DUP;
 
*dir_gone = 1;
 
-   if ((force_flag  REMOVE_DIR_KEEP_NESTED_GIT) 
-   !resolve_gitlink_ref(path-buf, HEAD, 
submodule_head)) {
+   if ((force_flag  REMOVE_DIR_KEEP_NESTED_GIT)  
is_git_repository(path)) {
if (!quiet) {
quote_path_relative(path-buf, prefix, quoted);
printf(dry_run ?  _(msg_would_skip_git_dir) : 
_(msg_skip_git_dir),
diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
index 4b9a72a..1bbb8ef 100755
--- a/t/t7300-clean.sh
+++ b/t/t7300-clean.sh
@@ -455,7 +455,7 @@ test_expect_success 'nested git work tree' '
! test -d bar
 '
 
-test_expect_failure 'should clean things that almost look like git but are 
not' '
+test_expect_success 'should clean things that almost look like git but are 
not' '
rm -fr almost_git almost_bare_git almost_submodule 
mkdir -p almost_git/.git/objects 
mkdir -p almost_git/.git/refs 
@@ -468,8 +468,6 @@ test_expect_failure 'should clean things that almost look 
like git but are not'
garbage
EOF
test_when_finished rm -rf almost_* 
-   ## This will fail due to die(Invalid gitfile format: %s, path); in
-   ## setup.c:read_gitfile.
git clean -f -d 
test_path_is_missing almost_git 
test_path_is_missing almost_bare_git 
@@ -501,7 +499,7 @@ test_expect_success 'should not clean submodules' '
test_path_is_missing to_clean
 '
 
-test_expect_failure 'nested (empty) git should be kept' '
+test_expect_success 'nested (empty) git should be kept' '
rm -fr foo bar 
git init foo 
mkdir bar 
@@ -523,7 +521,7 @@ test_expect_success 'nested bare repositories should be 
cleaned' '
test_path_is_missing subdir
 '
 
-test_expect_success 'nested (empty) bare repositories should be cleaned even 
when in .git' '
+test_expect_failure 'nested (empty) bare repositories should be cleaned even 
when in .git' '
rm -fr

[PATCH/RFC v3 2/4] t7300: add tests to document behavior of clean and nested git

2015-04-18 Thread Erik Elfström
Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 t/t7300-clean.sh | 127 +++
 1 file changed, 127 insertions(+)

diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
index 99be5d9..4b9a72a 100755
--- a/t/t7300-clean.sh
+++ b/t/t7300-clean.sh
@@ -455,6 +455,133 @@ test_expect_success 'nested git work tree' '
! test -d bar
 '
 
+test_expect_failure 'should clean things that almost look like git but are 
not' '
+   rm -fr almost_git almost_bare_git almost_submodule 
+   mkdir -p almost_git/.git/objects 
+   mkdir -p almost_git/.git/refs 
+   cat almost_git/.git/HEAD -\EOF 
+   garbage
+   EOF
+   cp -r almost_git/.git/ almost_bare_git 
+   mkdir almost_submodule/ 
+   cat almost_submodule/.git -\EOF 
+   garbage
+   EOF
+   test_when_finished rm -rf almost_* 
+   ## This will fail due to die(Invalid gitfile format: %s, path); in
+   ## setup.c:read_gitfile.
+   git clean -f -d 
+   test_path_is_missing almost_git 
+   test_path_is_missing almost_bare_git 
+   test_path_is_missing almost_submodule
+'
+
+test_expect_success 'should not clean submodules' '
+   rm -fr repo to_clean sub1 sub2 
+   mkdir repo to_clean 
+   (
+   cd repo 
+   git init 
+   test_commit msg hello.world
+   ) 
+   git submodule add ./repo/.git sub1 
+   git commit -m sub1 
+   git branch before_sub2 
+   git submodule add ./repo/.git sub2 
+   git commit -m sub2 
+   git checkout before_sub2 
+   to_clean/should_clean.this 
+   git clean -f -d 
+   test_path_is_file repo/.git/index 
+   test_path_is_file repo/hello.world 
+   test_path_is_file sub1/.git 
+   test_path_is_file sub1/hello.world 
+   test_path_is_file sub2/.git 
+   test_path_is_file sub2/hello.world 
+   test_path_is_missing to_clean
+'
+
+test_expect_failure 'nested (empty) git should be kept' '
+   rm -fr foo bar 
+   git init foo 
+   mkdir bar 
+   bar/goodbye.people 
+   git clean -f -d 
+   test_path_is_file foo/.git/HEAD 
+   test_path_is_missing bar
+'
+
+test_expect_success 'nested bare repositories should be cleaned' '
+   rm -fr bare1 bare2 subdir 
+   git init --bare bare1 
+   git clone --local --bare . bare2 
+   mkdir subdir 
+   cp -r bare2 subdir/bare3 
+   git clean -f -d 
+   test_path_is_missing bare1 
+   test_path_is_missing bare2 
+   test_path_is_missing subdir
+'
+
+test_expect_success 'nested (empty) bare repositories should be cleaned even 
when in .git' '
+   rm -fr strange_bare 
+   mkdir strange_bare 
+   git init --bare strange_bare/.git 
+   git clean -f -d 
+   test_path_is_missing strange_bare
+'
+
+test_expect_failure 'nested (non-empty) bare repositories should be cleaned 
even when in .git' '
+   rm -fr strange_bare 
+   mkdir strange_bare 
+   git clone --local --bare . strange_bare/.git 
+   git clean -f -d 
+   test_path_is_missing strange_bare
+'
+
+test_expect_success 'giving path in nested git work tree will remove it' '
+   rm -fr foo 
+   mkdir foo 
+   (
+   cd foo 
+   git init 
+   mkdir -p bar/baz 
+   test_commit msg bar/baz/hello.world
+   ) 
+   git clean -f -d foo/bar/baz 
+   test_path_is_file foo/.git/HEAD 
+   test_path_is_dir foo/bar/ 
+   test_path_is_missing foo/bar/baz
+'
+
+test_expect_success 'giving path to nested .git will not remove it' '
+   rm -fr foo 
+   mkdir foo bar 
+   (
+   cd foo 
+   git init 
+   test_commit msg hello.world
+   ) 
+   git clean -f -d foo/.git 
+   test_path_is_file foo/.git/HEAD 
+   test_path_is_dir foo/.git/refs 
+   test_path_is_dir foo/.git/objects 
+   test_path_is_dir bar/
+'
+
+test_expect_success 'giving path to nested .git/ will remove contents' '
+   rm -fr foo bar 
+   mkdir foo bar 
+   (
+   cd foo 
+   git init 
+   test_commit msg hello.world
+   ) 
+   git clean -f -d foo/.git/ 
+   test_path_is_dir foo/.git 
+   test_dir_is_empty foo/.git
+'
+
 test_expect_success 'force removal of nested git work tree' '
rm -fr foo bar baz 
mkdir -p foo bar baz/boo 
-- 
2.4.0.rc2.5.g2871d5e

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH/RFC v3 0/4] Improving performance of git clean

2015-04-18 Thread Erik Elfström
I've marked this RFC since there are known problems here.

v2 of the patch can be found here:
http://thread.gmane.org/gmane.comp.version-control.git/267023/focus=267023

Changes in v3:
* Created setup.c:read_gitfile_gently to use for submodule
  probing
* Cleanup of some tests by use of test_commit helper
* Added more tests of cleaning in the presence of submodules
* Reversed expectation of test for cleaning nested bare repos.
  They are now expected to be cleaned. Added one more case.
* Fixed bug where submodules could be cleaned by using new
  read_gitfile_gently for additional submodule check in
  clean.c:is_git_repository
* Attempt to change behavior of patch implementation to clean
  bare repositories (only partially successful)
* Reworded commit message of the performance fix commit

Known Problems:
* Unsure about the setup.c:read_gitfile refactor, feels a bit
  messy?
* Potentially a missing sanity check of git file size in
  setup.c:read_gitfile_gently_or_non_gently
* We still get a behavioral change for empty bare repositories
  placed in a .git directory. Currently we clean empty bare
  repos in a .git folder but not non-empty one. After this
  patch we won't clean either. How serious is this? Is there
  an easy fix (preferebly to clean all bare repositories)?
* Still have issues in the performance tests, see comments
  from Thomas Gummerer on v2

Thanks to Junio C Hamano and Jeff King for spotting fundamental
problems in v2 and suggesting a solution.

Erik Elfström (4):
  setup: add gentle version of read_gitfile
  t7300: add tests to document behavior of clean and nested git
  p7300: add performance tests for clean
  clean: improve performance when removing lots of directories

 builtin/clean.c   |  25 --
 cache.h   |   1 +
 setup.c   |  94 -
 t/perf/p7300-clean.sh |  37 +++
 t/t7300-clean.sh  | 125 ++
 5 files changed, 257 insertions(+), 25 deletions(-)
 create mode 100755 t/perf/p7300-clean.sh

-- 
2.4.0.rc2.5.g2871d5e

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH/RFC v3 1/4] setup: add gentle version of read_gitfile

2015-04-18 Thread Erik Elfström
read_gitfile will die on most error cases. This makes it unsuitable
for speculative calls. Extract the core logic and provide a gentle
version that returns NULL on failure.

The first usecase of the new gentle version will be to probe for
submodules during git clean.

Helped-by: Junio C Hamano gits...@pobox.com
Helped-by: Jeff King p...@peff.net
Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
If this is going to be used for speculative probing should there
be a sanity check before:
buf = xmalloc(st.st_size + 1);
len = read_in_full(fd, buf, st.st_size);
Something like:
if (st.st_size  PATH_MAX*2) {
error = N;
goto cleanup_return;
{
What do you think?

 cache.h |  1 +
 setup.c | 94 ++---
 2 files changed, 74 insertions(+), 21 deletions(-)

diff --git a/cache.h b/cache.h
index 3d3244b..9d9199d 100644
--- a/cache.h
+++ b/cache.h
@@ -432,6 +432,7 @@ extern const char *get_git_namespace(void);
 extern const char *strip_namespace(const char *namespaced_ref);
 extern const char *get_git_work_tree(void);
 extern const char *read_gitfile(const char *path);
+extern const char *read_gitfile_gently(const char *path);
 extern const char *resolve_gitdir(const char *suspect);
 extern void set_git_work_tree(const char *tree);
 
diff --git a/setup.c b/setup.c
index 979b13f..a33b293 100644
--- a/setup.c
+++ b/setup.c
@@ -332,38 +332,46 @@ static int check_repository_format_gently(const char 
*gitdir, int *nongit_ok)
return 0;
 }
 
-/*
- * Try to read the location of the git directory from the .git file,
- * return path to git directory if found.
- */
-const char *read_gitfile(const char *path)
+static const char *read_gitfile_gently_or_non_gently(const char *path, int 
gently)
 {
-   char *buf;
+   int error = 0;
+   char *buf = NULL;
char *dir;
const char *slash;
struct stat st;
int fd;
ssize_t len;
-
-   if (stat(path, st))
-   return NULL;
-   if (!S_ISREG(st.st_mode))
-   return NULL;
+   if (stat(path, st)) {
+   error = 1;
+   goto cleanup_return;
+   }
+   if (!S_ISREG(st.st_mode)) {
+   error = 2;
+   goto cleanup_return;
+   }
fd = open(path, O_RDONLY);
-   if (fd  0)
-   die_errno(Error opening '%s', path);
+   if (fd  0) {
+   error = 3;
+   goto cleanup_return;
+   }
buf = xmalloc(st.st_size + 1);
len = read_in_full(fd, buf, st.st_size);
close(fd);
-   if (len != st.st_size)
-   die(Error reading %s, path);
+   if (len != st.st_size) {
+   error = 4;
+   goto cleanup_return;
+   }
buf[len] = '\0';
-   if (!starts_with(buf, gitdir: ))
-   die(Invalid gitfile format: %s, path);
+   if (!starts_with(buf, gitdir: )) {
+   error = 5;
+   goto cleanup_return;
+   }
while (buf[len - 1] == '\n' || buf[len - 1] == '\r')
len--;
-   if (len  9)
-   die(No path in gitfile: %s, path);
+   if (len  9) {
+   error = 6;
+   goto cleanup_return;
+   }
buf[len] = '\0';
dir = buf + 8;
 
@@ -378,14 +386,58 @@ const char *read_gitfile(const char *path)
buf = dir;
}
 
-   if (!is_git_directory(dir))
-   die(Not a git repository: %s, dir);
+   if (!is_git_directory(dir)) {
+   error = 7;
+   goto cleanup_return;
+   }
path = real_path(dir);
 
+cleanup_return:
free(buf);
+
+   if (error) {
+   if (gently)
+   return NULL;
+
+   switch (error) {
+   case 1: // failed to stat
+   case 2: // not regular file
+   return NULL;
+   case 3:
+   die_errno(Error opening '%s', path);
+   case 4:
+   die(Error reading %s, path);
+   case 5:
+   die(Invalid gitfile format: %s, path);
+   case 6:
+   die(No path in gitfile: %s, path);
+   case 7:
+   die(Not a git repository: %s, dir);
+   default:
+   assert(0);
+   }
+   }
+
return path;
 }
 
+/*
+ * Try to read the location of the git directory from the .git file,
+ * return path to git directory if found, die on (most) failures.
+ */
+const char *read_gitfile(const char *path)
+{
+   return read_gitfile_gently_or_non_gently(path, 0);
+}
+
+/*
+ * Same as read_gitfile but return NULL on failure.
+ */
+const char *read_gitfile_gently(const char *path)
+{
+   return read_gitfile_gently_or_non_gently(path, 1);
+}
+
 static const char

[PATCH/RFC v3 3/4] p7300: add performance tests for clean

2015-04-18 Thread Erik Elfström
Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 t/perf/p7300-clean.sh | 37 +
 1 file changed, 37 insertions(+)
 create mode 100755 t/perf/p7300-clean.sh

diff --git a/t/perf/p7300-clean.sh b/t/perf/p7300-clean.sh
new file mode 100755
index 000..af50d5d
--- /dev/null
+++ b/t/perf/p7300-clean.sh
@@ -0,0 +1,37 @@
+#!/bin/sh
+
+test_description=Test git-clean performance
+
+. ./perf-lib.sh
+
+test_perf_large_repo
+test_checkout_worktree
+
+test_expect_success 'setup untracked directory with many sub dirs' '
+   rm -rf 500_sub_dirs 5_sub_dirs clean_test_dir 
+   mkdir 500_sub_dirs 5_sub_dirs clean_test_dir 
+   for i in $(test_seq 1 500)
+   do
+   mkdir 500_sub_dirs/dir$i || return $?
+   done 
+   for i in $(test_seq 1 100)
+   do
+   cp -r 500_sub_dirs 5_sub_dirs/dir$i || return $?
+   done
+'
+
+test_perf 'clean many untracked sub dirs, check for nested git' '
+   rm -rf clean_test_dir/5_sub_dirs_cpy 
+   cp -r 5_sub_dirs clean_test_dir/5_sub_dirs_cpy 
+   git clean -q -f -d  clean_test_dir/ 
+   test_dir_is_empty clean_test_dir
+'
+
+test_perf 'clean many untracked sub dirs, ignore nested git' '
+   rm -rf clean_test_dir/5_sub_dirs_cpy 
+   cp -r 5_sub_dirs clean_test_dir/5_sub_dirs_cpy 
+   git clean -q -f -f -d  clean_test_dir/ 
+   test_dir_is_empty clean_test_dir
+'
+
+test_done
-- 
2.4.0.rc2.5.g2871d5e

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/3] clean: improve performance when removing lots of directories

2015-04-17 Thread erik elfström
On Wed, Apr 15, 2015 at 7:56 PM, Junio C Hamano gits...@pobox.com wrote:
 Erik Elfström erik.elfst...@gmail.com writes:

 Before this change, clean used resolve_gitlink_ref to check for the
 presence of nested git repositories. This had the drawback of creating
 a ref_cache entry for every directory that should potentially be
 cleaned. The linear search through the ref_cache list caused a massive
 performance hit for large number of directories.

 I'd prefer to see the current state described in the current
 tense, e.g.

 git clean uses resolve_gitlink_ref() to check for the presence of
 nested git repositories, but it has the drawback of creating a
 ref_cache entry for every directory that should potentially be
 cleaned. The linear search through the ref_cache list causes a
 massive performance hit for large number of directories.


Yes, that reads better.

 Teach clean.c:remove_dirs to use setup.c:is_git_directory
 instead. is_git_directory will actually open HEAD and parse the HEAD
 ref but this implies a nested git repository and should be rare when
 cleaning.

 I am not sure what you wanted to say in this paragraph.  What does
 it being rare have to do with it?  Even if it is not rare (i.e. the
 top-level project you are working with has many submodules checked
 out without using the more recent a file .git pointing into
 .git/modules/ via 'gitdir: $overThere' mechanism), if we found a
 nested git repository, we treat it as special and exclude it from
 cleaning it out, which is a good thing, no?


I was trying to motivate that the performance of is_git_directory is not a
problem for us even though it opens a file and parses it. I see now when I
read it again that this is not very clear.

 Doesn't this implementation get confused by modern submodule
 checkouts and descend into and clean their working tree, though?
 Module M with path P would have a directory P in the working tree of
 the top-level project, and P/.git is a regular file that will fail
 is_git_directory() test but records the location of the real
 submodule repository i.e. .git/modules/M via the gitdir:
 mechanism.


Yes, there is a problem here. I've added the test below and it fails after
my change by cleaning sub2 (sub1 is not cleaned). Are there more cases
here that I should test for?

+test_expect_success 'should not clean submodules' '
+   rm -fr repo to_clean sub1 sub2 
+   mkdir -p repo to_clean 
+   (
+   cd repo 
+   git init 
+   hello.world
+   git add . 
+   git commit -a -m nested
+   ) 
+   git submodule add ./repo/.git sub1 
+   git commit -m sub1 
+   git branch before_sub2 
+   git submodule add ./repo/.git sub2 
+   git commit -m sub2 
+   git checkout before_sub2 
+   to_clean/should_clean.this 
+   git clean -f -d 
+   test_path_is_file repo/.git/index 
+   test_path_is_file repo/hello.world 
+   test_path_is_file sub1/.git 
+   test_path_is_file sub1/hello.world 
+   test_path_is_file sub2/.git 
+   test_path_is_file sub2/hello.world 
+   test_path_is_missing to_clean
+'

Base on the previous discussion of the patch topic I can see 3 options
for how to fix this:

Option 1:
 Plug the hole in my new is_git_repository function. A quick and dirty
 fix that passes the above test would be:

diff --git a/builtin/clean.c b/builtin/clean.c
index b679913..4f2fe95 100644
--- a/builtin/clean.c
+++ b/builtin/clean.c
@@ -153,14 +153,21 @@ static int is_git_repository(struct strbuf *path)
if (is_git_directory(path-buf))
ret = 1;
else {
-   size_t orig_path_len = path-len;
-   assert(orig_path_len != 0);
-   if (path-buf[orig_path_len - 1] != '/')
-   strbuf_addch(path, '/');
-   strbuf_addstr(path, .git);
-   if (is_git_directory(path-buf))
-   ret = 1;
-   strbuf_setlen(path, orig_path_len);
+   struct stat st;
+   const char *submodule_git_dir =
git_path_submodule(path-buf, );
+   lstat(submodule_git_dir, st);
+   if (S_ISDIR(st.st_mode))
+   ret = 1;
+   else {
+   size_t orig_path_len = path-len;
+   assert(orig_path_len != 0);
+   if (path-buf[orig_path_len - 1] != '/')
+   strbuf_addch(path, '/');
+   strbuf_addstr(path, .git);
+   if (is_git_directory(path-buf))
+   ret = 1;
+   strbuf_setlen(path, orig_path_len);
+   }
}

return ret;

There are probably more elegant solutions available here, suggestions
welcome.

Option 2:
 Go with the current solution of using resolve_gitlink_ref but either
 A) avoid placing non-submodule paths in the ref_cache list

Re: [PATCH v2 2/3] p7300: add performance tests for clean

2015-04-12 Thread erik elfström
On Sat, Apr 11, 2015 at 7:59 PM, Thomas Gummerer t.gumme...@gmail.com wrote:
 On 04/11, Erik Elfström wrote:
 Signed-off-by: Erik Elfström erik.elfst...@gmail.com
 ---
  t/perf/p7300-clean.sh | 37 +
  1 file changed, 37 insertions(+)
  create mode 100755 t/perf/p7300-clean.sh

 diff --git a/t/perf/p7300-clean.sh b/t/perf/p7300-clean.sh
 new file mode 100755
 index 000..af50d5d
 --- /dev/null
 +++ b/t/perf/p7300-clean.sh
 @@ -0,0 +1,37 @@
 +#!/bin/sh
 +
 +test_description=Test git-clean performance
 +
 +. ./perf-lib.sh
 +
 +test_perf_large_repo
 +test_checkout_worktree
 +
 +test_expect_success 'setup untracked directory with many sub dirs' '
 + rm -rf 500_sub_dirs 5_sub_dirs clean_test_dir 
 + mkdir 500_sub_dirs 5_sub_dirs clean_test_dir 
 + for i in $(test_seq 1 500)
 + do
 + mkdir 500_sub_dirs/dir$i || return $?
 + done 
 + for i in $(test_seq 1 100)
 + do
 + cp -r 500_sub_dirs 5_sub_dirs/dir$i || return $?
 + done
 +'
 +
 +test_perf 'clean many untracked sub dirs, check for nested git' '
 + rm -rf clean_test_dir/5_sub_dirs_cpy 
 + cp -r 5_sub_dirs clean_test_dir/5_sub_dirs_cpy 

 Maybe this would be a good place to use test_perf_cleanup, which I
 introduced a while ago and you can find in the
 tg/perf-lib-test-perf-cleanup branch?  It probably won't influence the
 performance a lot, but still better separate the code that actually
 needs to be tested from the cleanup/preparation code.  Ditto in the
 other test.


Yes, that would be a clear improvement. I was looking for something like
this, the copy takes more time than the clean currently.

The cleanup hook is maybe not exactly the right fit here though. I would
need to do one initial copy in the setup test and then a copy in the
cleanup, something like this:

test_expect_success 'setup untracked directory with many sub dirs' '
...
cp -r 5_sub_dirs clean_test_dir/5_sub_dirs_cpy
'

test_perf_cleanup 'clean many untracked sub dirs, check for nested git' '
git clean -q -f -d  clean_test_dir/
' '
test_dir_is_empty clean_test_dir 
rm -rf clean_test_dir/5_sub_dirs_cpy 
cp -r 5_sub_dirs clean_test_dir/5_sub_dirs_cpy
'

This works better than my original code but maybe we can do even better
with something like:

test_setup_perf_cleanup 'clean many untracked sub dirs, check for nested git' '
rm -rf clean_test_dir/5_sub_dirs_cpy 
cp -r 5_sub_dirs clean_test_dir/5_sub_dirs_cpy
' '
git clean -q -f -d  clean_test_dir/
' '
test_dir_is_empty clean_test_dir
'

Having a setup phase avoids the initial copy in the setup test making
things a little easier to follow. I'm not sure its worth the extra complexity
in perf-lib though (and I'm not sure I would be able to implement it either).

Also, what would the implications be if I were to use your new cleanup
function that is not yet on master? Should I rebase on top of your topic
or make a follow up patch to switch over?

 + git clean -q -f -d  clean_test_dir/ 
 + test_dir_is_empty clean_test_dir
 +'
 +
 +test_perf 'clean many untracked sub dirs, ignore nested git' '
 + rm -rf clean_test_dir/5_sub_dirs_cpy 
 + cp -r 5_sub_dirs clean_test_dir/5_sub_dirs_cpy 
 + git clean -q -f -f -d  clean_test_dir/ 
 + test_dir_is_empty clean_test_dir
 +'
 +
 +test_done
 --
 2.4.0.rc0.37.ga3b75b3

 --
 To unsubscribe from this list: send the line unsubscribe git in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/3] Improving performance of git clean

2015-04-11 Thread Erik Elfström
v1 of the patch can be found here:
http://thread.gmane.org/gmane.comp.version-control.git/266839/focus=266839

changes in v2:
* fixed commit message,
  p7300: added performance tests for clean
  change to:
  p7300: add performance tests for clean
* simplified test code
* removed non portable ls -A in test
* removed non portable $(seq ) in test
* fixed missing  || return $? in test
* fixed missing sub shell for 'cd' command in test
* fixed broken  chains in test
* added assert new clean.c:is_git_repository to guard against
  negative array index
* use size_t instead of int for strbuf-len

fixes held back for cleanup patches:
* fixed existing broken  chains
* added assert in existing code to guard against
  negative array index

Thanks to Eric Sunshine and Torsten Bögershausen for the very helpful
review!


Erik Elfström (3):
  t7300: add tests to document behavior of clean and nested git
  p7300: add performance tests for clean
  clean: improve performance when removing lots of directories

 builtin/clean.c   | 24 ++---
 t/perf/p7300-clean.sh | 37 ++
 t/t7300-clean.sh  | 72 +++
 3 files changed, 129 insertions(+), 4 deletions(-)
 create mode 100755 t/perf/p7300-clean.sh

-- 
2.4.0.rc0.37.ga3b75b3

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/3] t7300: add tests to document behavior of clean and nested git

2015-04-11 Thread Erik Elfström
Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 t/t7300-clean.sh | 72 
 1 file changed, 72 insertions(+)

diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
index 99be5d9..58e6b4a 100755
--- a/t/t7300-clean.sh
+++ b/t/t7300-clean.sh
@@ -455,6 +455,78 @@ test_expect_success 'nested git work tree' '
! test -d bar
 '
 
+test_expect_failure 'nested git (only init) should be kept' '
+   rm -fr foo bar 
+   git init foo 
+   mkdir bar 
+   bar/goodbye.people 
+   git clean -f -d 
+   test_path_is_file foo/.git/HEAD 
+   test_path_is_missing bar
+'
+
+test_expect_failure 'nested git (bare) should be kept' '
+   rm -fr foo bar 
+   git init --bare foo 
+   mkdir bar 
+   bar/goodbye.people 
+   git clean -f -d 
+   test_path_is_file foo/HEAD 
+   test_path_is_missing bar
+'
+
+test_expect_success 'giving path in nested git work tree will remove it' '
+   rm -fr foo 
+   mkdir foo 
+   (
+   cd foo 
+   git init 
+   mkdir -p bar/baz 
+   (
+   cd bar/baz 
+   hello.world 
+   git add . 
+   git commit -a -m nested
+   )
+   ) 
+   git clean -f -d foo/bar/baz 
+   test_path_is_file foo/.git/HEAD 
+   test_path_is_dir foo/bar/ 
+   test_path_is_missing foo/bar/baz
+'
+
+test_expect_success 'giving path to nested .git will not remove it' '
+   rm -fr foo 
+   mkdir foo bar 
+   (
+   cd foo 
+   git init 
+   hello.world 
+   git add . 
+   git commit -a -m nested
+   ) 
+   git clean -f -d foo/.git 
+   test_path_is_file foo/.git/HEAD 
+   test_path_is_dir foo/.git/refs 
+   test_path_is_dir foo/.git/objects 
+   test_path_is_dir bar/
+'
+
+test_expect_success 'giving path to nested .git/ will remove contents' '
+   rm -fr foo bar 
+   mkdir foo bar 
+   (
+   cd foo 
+   git init 
+   hello.world 
+   git add . 
+   git commit -a -m nested
+   ) 
+   git clean -f -d foo/.git/ 
+   test_path_is_dir foo/.git 
+   test_dir_is_empty foo/.git
+'
+
 test_expect_success 'force removal of nested git work tree' '
rm -fr foo bar baz 
mkdir -p foo bar baz/boo 
-- 
2.4.0.rc0.37.ga3b75b3

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/3] clean: improve performance when removing lots of directories

2015-04-11 Thread Erik Elfström
Before this change, clean used resolve_gitlink_ref to check for the
presence of nested git repositories. This had the drawback of creating
a ref_cache entry for every directory that should potentially be
cleaned. The linear search through the ref_cache list caused a massive
performance hit for large number of directories.

Teach clean.c:remove_dirs to use setup.c:is_git_directory
instead. is_git_directory will actually open HEAD and parse the HEAD
ref but this implies a nested git repository and should be rare when
cleaning.

Using is_git_directory should give a more standardized check for what
is and what isn't a git repository but also gives a slight behavioral
change. We will now detect and respect bare and empty nested git
repositories (only init run). Update t7300 to reflect this.

The time to clean an untracked directory containing 10 sub
directories went from 61s to 1.7s after this change.

Helped-by: Jeff King p...@peff.net
Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 builtin/clean.c  | 24 
 t/t7300-clean.sh |  4 ++--
 2 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/builtin/clean.c b/builtin/clean.c
index 98c103f..b679913 100644
--- a/builtin/clean.c
+++ b/builtin/clean.c
@@ -10,7 +10,6 @@
 #include cache.h
 #include dir.h
 #include parse-options.h
-#include refs.h
 #include string-list.h
 #include quote.h
 #include column.h
@@ -148,6 +147,25 @@ static int exclude_cb(const struct option *opt, const char 
*arg, int unset)
return 0;
 }
 
+static int is_git_repository(struct strbuf *path)
+{
+   int ret = 0;
+   if (is_git_directory(path-buf))
+   ret = 1;
+   else {
+   size_t orig_path_len = path-len;
+   assert(orig_path_len != 0);
+   if (path-buf[orig_path_len - 1] != '/')
+   strbuf_addch(path, '/');
+   strbuf_addstr(path, .git);
+   if (is_git_directory(path-buf))
+   ret = 1;
+   strbuf_setlen(path, orig_path_len);
+   }
+
+   return ret;
+}
+
 static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag,
int dry_run, int quiet, int *dir_gone)
 {
@@ -155,13 +173,11 @@ static int remove_dirs(struct strbuf *path, const char 
*prefix, int force_flag,
struct strbuf quoted = STRBUF_INIT;
struct dirent *e;
int res = 0, ret = 0, gone = 1, original_len = path-len, len;
-   unsigned char submodule_head[20];
struct string_list dels = STRING_LIST_INIT_DUP;
 
*dir_gone = 1;
 
-   if ((force_flag  REMOVE_DIR_KEEP_NESTED_GIT) 
-   !resolve_gitlink_ref(path-buf, HEAD, 
submodule_head)) {
+   if ((force_flag  REMOVE_DIR_KEEP_NESTED_GIT)  
is_git_repository(path)) {
if (!quiet) {
quote_path_relative(path-buf, prefix, quoted);
printf(dry_run ?  _(msg_would_skip_git_dir) : 
_(msg_skip_git_dir),
diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
index 58e6b4a..da294fe 100755
--- a/t/t7300-clean.sh
+++ b/t/t7300-clean.sh
@@ -455,7 +455,7 @@ test_expect_success 'nested git work tree' '
! test -d bar
 '
 
-test_expect_failure 'nested git (only init) should be kept' '
+test_expect_success 'nested git (only init) should be kept' '
rm -fr foo bar 
git init foo 
mkdir bar 
@@ -465,7 +465,7 @@ test_expect_failure 'nested git (only init) should be kept' 
'
test_path_is_missing bar
 '
 
-test_expect_failure 'nested git (bare) should be kept' '
+test_expect_success 'nested git (bare) should be kept' '
rm -fr foo bar 
git init --bare foo 
mkdir bar 
-- 
2.4.0.rc0.37.ga3b75b3

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/3] p7300: add performance tests for clean

2015-04-11 Thread Erik Elfström
Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 t/perf/p7300-clean.sh | 37 +
 1 file changed, 37 insertions(+)
 create mode 100755 t/perf/p7300-clean.sh

diff --git a/t/perf/p7300-clean.sh b/t/perf/p7300-clean.sh
new file mode 100755
index 000..af50d5d
--- /dev/null
+++ b/t/perf/p7300-clean.sh
@@ -0,0 +1,37 @@
+#!/bin/sh
+
+test_description=Test git-clean performance
+
+. ./perf-lib.sh
+
+test_perf_large_repo
+test_checkout_worktree
+
+test_expect_success 'setup untracked directory with many sub dirs' '
+   rm -rf 500_sub_dirs 5_sub_dirs clean_test_dir 
+   mkdir 500_sub_dirs 5_sub_dirs clean_test_dir 
+   for i in $(test_seq 1 500)
+   do
+   mkdir 500_sub_dirs/dir$i || return $?
+   done 
+   for i in $(test_seq 1 100)
+   do
+   cp -r 500_sub_dirs 5_sub_dirs/dir$i || return $?
+   done
+'
+
+test_perf 'clean many untracked sub dirs, check for nested git' '
+   rm -rf clean_test_dir/5_sub_dirs_cpy 
+   cp -r 5_sub_dirs clean_test_dir/5_sub_dirs_cpy 
+   git clean -q -f -d  clean_test_dir/ 
+   test_dir_is_empty clean_test_dir
+'
+
+test_perf 'clean many untracked sub dirs, ignore nested git' '
+   rm -rf clean_test_dir/5_sub_dirs_cpy 
+   cp -r 5_sub_dirs clean_test_dir/5_sub_dirs_cpy 
+   git clean -q -f -f -d  clean_test_dir/ 
+   test_dir_is_empty clean_test_dir
+'
+
+test_done
-- 
2.4.0.rc0.37.ga3b75b3

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] p7300: added performance tests for clean

2015-04-07 Thread erik elfström
Will fix!

Also I forgot to ask, does anyone have a good way of moving the copy
out of the performance timing?

After the fix this test spends more time copying than cleaning and
that is not so good. I'm not very good at shell scripting and the only
way I could think of was to make multiple copies at the start and then
in the test check and clean the first non empty directory. That feels
kind of ugly and will fail if a different number of iterations per
test is used. Any suggestions?

I looked a bit at the framework code but I couldn't figure out if it
was easy to add a setup option to be called be before each
iteration.

On Tue, Apr 7, 2015 at 12:09 AM, Eric Sunshine sunsh...@sunshineco.com wrote:
 On Mon, Apr 6, 2015 at 4:40 PM, Torsten Bögershausen tbo...@web.de wrote:
 On 2015-04-06 13.48, Erik Elfström wrote:
 Signed-off-by: Erik Elfström erik.elfst...@gmail.com
 ---
 diff --git a/t/perf/p7300-clean.sh b/t/perf/p7300-clean.sh
 new file mode 100755
 index 000..3f56fb2
 --- /dev/null
 +++ b/t/perf/p7300-clean.sh
 @@ -0,0 +1,37 @@
 +#!/bin/sh
 +
 +test_description=Test git-clean performance
 +
 +. ./perf-lib.sh
 +
 +test_perf_large_repo
 +test_checkout_worktree
 +
 +test_expect_success 'setup untracked directory with many sub dirs' '
 + rm -rf 500_sub_dirs 5_sub_dirs clean_test_dir 
 + mkdir 500_sub_dirs 5_sub_dirs clean_test_dir 
 + for i in $(seq 1 500)
 I think seq is bash-only, and can be easily replaced by test_seq

 test_seq is definitely desirable. 'seq' is not present on some systems I use.


 + do
 + mkdir 500_sub_dirs/dir$i

 You may want:

 mkdir 500_sub_dirs/dir$i || return $?

 to catch failure of 'mkdir'.

 + done 
 + for i in $(seq 1 100)
 + do
 + cp -r 500_sub_dirs 5_sub_dirs/dir$i

 Ditto:

 cp -r 500_sub_dirs 5_sub_dirs/dir$i || return $?

 + done
 +'
 +
 +test_perf 'clean many untracked sub dirs, check for nested git' '
 + rm -rf clean_test_dir/5_sub_dirs_cpy 
 + cp -r 5_sub_dirs clean_test_dir/5_sub_dirs_cpy 
 + git clean -q -f -d  clean_test_dir/ 
 + test 0 = $(ls -A clean_test_dir/ | wc -l)

 Is the ls -A portable on all systems:
 http://pubs.opengroup.org/onlinepubs/009695399/utilities/ls.html

 My feeling is that the emptiness of a directory can by tested by simply 
 removing it:
 + git clean -q -f -d  clean_test_dir/ 
 + rmdir clean_test_dir
 (And similar below)

 There's also the test_dir_is_empty() function which makes the intent
 even clearer than 'rmdir'.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] clean: improve performance when removing lots of directories

2015-04-07 Thread erik elfström
On Tue, Apr 7, 2015 at 12:10 AM, Eric Sunshine sunsh...@sunshineco.com wrote:
 On Mon, Apr 6, 2015 at 7:48 AM, Erik Elfström erik.elfst...@gmail.com wrote:
 Before this change, clean used resolve_gitlink_ref to check for the
 presence of nested git repositories. This had the drawback of creating
 a ref_cache entry for every directory that should potentially be
 cleaned. The linear search through the ref_cache list caused a massive
 performance hit for large number of directories.

 Teach clean.c:remove_dirs to use setup.c:is_git_directory
 instead. is_git_directory will actually open HEAD and parse the HEAD
 ref but this implies a nested git repository and should be rare when
 cleaning.

 Using is_git_directory should give a more standardized check for what
 is and what isn't a git repository but also gives a slight behavioral
 change. We will now detect and respect bare and empty nested git
 repositories (only init run). Update t7300 to reflect this.

 The time to clean an untracked directory containing 10 sub
 directories went from 61s to 1.7s after this change.

 Impressive.

 Signed-off-by: Erik Elfström erik.elfst...@gmail.com
 Helped-by: Jeff King p...@peff.net

 It is customary for your sign-off to be last.

 More below...

 ---
 diff --git a/builtin/clean.c b/builtin/clean.c
 index 98c103f..e951bd9 100644
 --- a/builtin/clean.c
 +++ b/builtin/clean.c
 @@ -148,6 +147,24 @@ static int exclude_cb(const struct option *opt, const 
 char *arg, int unset)
 return 0;
  }

 +static int is_git_repository(struct strbuf *path)
 +{
 +   int ret = 0;
 +   if (is_git_directory(path-buf))
 +   ret = 1;
 +   else {
 +   int orig_path_len = path-len;
 +   if (path-buf[orig_path_len - 1] != '/')

 Minor: I don't know how others feel about it, but I always find it a
 bit disturbing to see a potential negative array access without a
 safety check that orig_path_len is not 0, either directly in the
 conditional or as a documenting assert().



I think I would prefer to accept empty input and return false rather
than assert. What to you think about:

static int is_git_repository(struct strbuf *path)
{
int ret = 0;
size_t orig_path_len = path-len;
if (orig_path_len == 0)
ret = 0;
else if (is_git_directory(path-buf))
ret = 1;
else {
if (path-buf[orig_path_len - 1] != '/')
strbuf_addch(path, '/');
strbuf_addstr(path, .git);
if (is_git_directory(path-buf))
ret = 1;
strbuf_setlen(path, orig_path_len);
}

return ret;
}


Also I borrowed this pattern from remove_dirs and it has the same
problem. Should I add something like this as a separate commit?

diff --git a/builtin/clean.c b/builtin/clean.c
index ccffd8a..88850e3 100644
--- a/builtin/clean.c
+++ b/builtin/clean.c
@@ -173,7 +173,8 @@ static int remove_dirs(struct strbuf *path, const
char *prefix, int force_flag,
DIR *dir;
struct strbuf quoted = STRBUF_INIT;
struct dirent *e;
-   int res = 0, ret = 0, gone = 1, original_len = path-len, len;
+   int res = 0, ret = 0, gone = 1;
+   size_t original_len = path-len, len;
struct string_list dels = STRING_LIST_INIT_DUP;

*dir_gone = 1;
@@ -201,6 +202,7 @@ static int remove_dirs(struct strbuf *path, const
char *prefix, int force_flag,
return res;
}

+   assert(original_len  0  expects non-empty path);
if (path-buf[original_len - 1] != '/')
strbuf_addch(path, '/');


 +   strbuf_addch(path, '/');
 +   strbuf_addstr(path, .git);
 +   if (is_git_directory(path-buf))
 +   ret = 1;
 +   strbuf_setlen(path, orig_path_len);
 +   }
 +
 +   return ret;
 +}
 +
  static int remove_dirs(struct strbuf *path, const char *prefix, int 
 force_flag,
 int dry_run, int quiet, int *dir_gone)
  {
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] t7300: add tests to document behavior of clean and nested git

2015-04-07 Thread erik elfström
will fix!

On Tue, Apr 7, 2015 at 12:06 AM, Eric Sunshine sunsh...@sunshineco.com wrote:
 On Mon, Apr 6, 2015 at 7:48 AM, Erik Elfström erik.elfst...@gmail.com wrote:
 Signed-off-by: Erik Elfström erik.elfst...@gmail.com
 ---
 diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
 index 99be5d9..cfdf6d4 100755
 --- a/t/t7300-clean.sh
 +++ b/t/t7300-clean.sh
 @@ -455,6 +455,88 @@ test_expect_success 'nested git work tree' '
 ! test -d bar
  '

 +test_expect_failure 'nested git (only init) should be kept' '
 +   rm -fr foo bar 
 +   mkdir foo bar 
 +   (
 +   cd foo 
 +   git init
 +   ) 
 +   (
 +   cd bar 
 +   goodbye.people
 +   ) 

 (minor; ignore if desired) The above could be simplified to:

 rm -fr foo bar 
 git init foo 
 mkdir bar 
 bar/goodbye.people 

 +   git clean -f -d 
 +   test -f foo/.git/HEAD 
 +   ! test -d bar

 Here and elsewhere, you could instead use test_path_is_file() and
 test_path_is_missing(), respectively.

 +'
 +
 +test_expect_failure 'nested git (bare) should be kept' '
 +   rm -fr foo bar 
 +   mkdir foo bar 
 +   (
 +   cd foo 
 +   git init --bare
 +   ) 
 +   (
 +   cd bar 
 +   goodbye.people
 +   ) 

 Simplified:

 rm -fr foo bar 
 git init --bare foo 
 mkdir bar 
 bar/goodbye.people 

 +   git clean -f -d 
 +   test -f foo/HEAD 
 +   ! test -d bar
 +'
 +
 +test_expect_success 'giving path in nested git work tree will remove it' '
 +   rm -fr foo 
 +   mkdir foo 
 +   (
 +   cd foo 
 +   git init 
 +   mkdir -p bar/baz 
 +   cd bar/baz 
 +   hello.world
 +   git add . 
 +   git commit -a -m nested
 +   ) 
 +   git clean -f -d foo/bar/baz 
 +   test -f foo/.git/HEAD 
 +   test -d foo/bar/ 

 Alternative, here and elsewhere: test_path_is_dir()

 +   ! test -d foo/bar/baz
 +'
 +
 +test_expect_success 'giving path to nested .git will not remove it' '
 +   rm -fr foo 
 +   mkdir foo bar 
 +   (
 +   cd foo 
 +   git init 
 +   hello.world
 +   git add . 
 +   git commit -a -m nested
 +   ) 
 +   git clean -f -d foo/.git 
 +   test -f foo/.git/HEAD 
 +   test -d foo/.git/refs 
 +   test -d foo/.git/objects 
 +   test -d bar/
 +'
 +
 +test_expect_success 'giving path to nested .git/ will remove contents' '
 +   rm -fr foo bar 
 +   mkdir foo bar 
 +   (
 +   cd foo 
 +   git init 
 +   hello.world
 +   git add . 
 +   git commit -a -m nested
 +   ) 
 +   git clean -f -d foo/.git/ 
 +   test 0 = $(ls -A foo/.git | wc -l) 

 Although in the latest POSIX, -A may not be portable.

 Alternative: test_dir_is_empty()

 +   test -d foo/.git
 +'
 +
  test_expect_success 'force removal of nested git work tree' '
 rm -fr foo bar baz 
 mkdir -p foo bar baz/boo 
 --
 2.4.0.rc0.37.ga3b75b3
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] clean: improve performance when removing lots of directories

2015-04-06 Thread Erik Elfström
Before this change, clean used resolve_gitlink_ref to check for the
presence of nested git repositories. This had the drawback of creating
a ref_cache entry for every directory that should potentially be
cleaned. The linear search through the ref_cache list caused a massive
performance hit for large number of directories.

Teach clean.c:remove_dirs to use setup.c:is_git_directory
instead. is_git_directory will actually open HEAD and parse the HEAD
ref but this implies a nested git repository and should be rare when
cleaning.

Using is_git_directory should give a more standardized check for what
is and what isn't a git repository but also gives a slight behavioral
change. We will now detect and respect bare and empty nested git
repositories (only init run). Update t7300 to reflect this.

The time to clean an untracked directory containing 10 sub
directories went from 61s to 1.7s after this change.

Signed-off-by: Erik Elfström erik.elfst...@gmail.com
Helped-by: Jeff King p...@peff.net
---
 builtin/clean.c  | 23 +++
 t/t7300-clean.sh |  4 ++--
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/builtin/clean.c b/builtin/clean.c
index 98c103f..e951bd9 100644
--- a/builtin/clean.c
+++ b/builtin/clean.c
@@ -10,7 +10,6 @@
 #include cache.h
 #include dir.h
 #include parse-options.h
-#include refs.h
 #include string-list.h
 #include quote.h
 #include column.h
@@ -148,6 +147,24 @@ static int exclude_cb(const struct option *opt, const char 
*arg, int unset)
return 0;
 }
 
+static int is_git_repository(struct strbuf *path)
+{
+   int ret = 0;
+   if (is_git_directory(path-buf))
+   ret = 1;
+   else {
+   int orig_path_len = path-len;
+   if (path-buf[orig_path_len - 1] != '/')
+   strbuf_addch(path, '/');
+   strbuf_addstr(path, .git);
+   if (is_git_directory(path-buf))
+   ret = 1;
+   strbuf_setlen(path, orig_path_len);
+   }
+
+   return ret;
+}
+
 static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag,
int dry_run, int quiet, int *dir_gone)
 {
@@ -155,13 +172,11 @@ static int remove_dirs(struct strbuf *path, const char 
*prefix, int force_flag,
struct strbuf quoted = STRBUF_INIT;
struct dirent *e;
int res = 0, ret = 0, gone = 1, original_len = path-len, len;
-   unsigned char submodule_head[20];
struct string_list dels = STRING_LIST_INIT_DUP;
 
*dir_gone = 1;
 
-   if ((force_flag  REMOVE_DIR_KEEP_NESTED_GIT) 
-   !resolve_gitlink_ref(path-buf, HEAD, 
submodule_head)) {
+   if ((force_flag  REMOVE_DIR_KEEP_NESTED_GIT)  
is_git_repository(path)) {
if (!quiet) {
quote_path_relative(path-buf, prefix, quoted);
printf(dry_run ?  _(msg_would_skip_git_dir) : 
_(msg_skip_git_dir),
diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
index cfdf6d4..ada569d 100755
--- a/t/t7300-clean.sh
+++ b/t/t7300-clean.sh
@@ -455,7 +455,7 @@ test_expect_success 'nested git work tree' '
! test -d bar
 '
 
-test_expect_failure 'nested git (only init) should be kept' '
+test_expect_success 'nested git (only init) should be kept' '
rm -fr foo bar 
mkdir foo bar 
(
@@ -471,7 +471,7 @@ test_expect_failure 'nested git (only init) should be kept' 
'
! test -d bar
 '
 
-test_expect_failure 'nested git (bare) should be kept' '
+test_expect_success 'nested git (bare) should be kept' '
rm -fr foo bar 
mkdir foo bar 
(
-- 
2.4.0.rc0.37.ga3b75b3

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] t7300: add tests to document behavior of clean and nested git

2015-04-06 Thread Erik Elfström
Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---

These tests were added so that I could understand the corner case
behaviors of clean and nested repositories and document the changes in
behavior that this series will cause. The ones marked as expect
failure will be changed to expect success later in the series.

 t/t7300-clean.sh | 82 
 1 file changed, 82 insertions(+)

diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
index 99be5d9..cfdf6d4 100755
--- a/t/t7300-clean.sh
+++ b/t/t7300-clean.sh
@@ -455,6 +455,88 @@ test_expect_success 'nested git work tree' '
! test -d bar
 '
 
+test_expect_failure 'nested git (only init) should be kept' '
+   rm -fr foo bar 
+   mkdir foo bar 
+   (
+   cd foo 
+   git init
+   ) 
+   (
+   cd bar 
+   goodbye.people
+   ) 
+   git clean -f -d 
+   test -f foo/.git/HEAD 
+   ! test -d bar
+'
+
+test_expect_failure 'nested git (bare) should be kept' '
+   rm -fr foo bar 
+   mkdir foo bar 
+   (
+   cd foo 
+   git init --bare
+   ) 
+   (
+   cd bar 
+   goodbye.people
+   ) 
+   git clean -f -d 
+   test -f foo/HEAD 
+   ! test -d bar
+'
+
+test_expect_success 'giving path in nested git work tree will remove it' '
+   rm -fr foo 
+   mkdir foo 
+   (
+   cd foo 
+   git init 
+   mkdir -p bar/baz 
+   cd bar/baz 
+   hello.world
+   git add . 
+   git commit -a -m nested
+   ) 
+   git clean -f -d foo/bar/baz 
+   test -f foo/.git/HEAD 
+   test -d foo/bar/ 
+   ! test -d foo/bar/baz
+'
+
+test_expect_success 'giving path to nested .git will not remove it' '
+   rm -fr foo 
+   mkdir foo bar 
+   (
+   cd foo 
+   git init 
+   hello.world
+   git add . 
+   git commit -a -m nested
+   ) 
+   git clean -f -d foo/.git 
+   test -f foo/.git/HEAD 
+   test -d foo/.git/refs 
+   test -d foo/.git/objects 
+   test -d bar/
+'
+
+test_expect_success 'giving path to nested .git/ will remove contents' '
+   rm -fr foo bar 
+   mkdir foo bar 
+   (
+   cd foo 
+   git init 
+   hello.world
+   git add . 
+   git commit -a -m nested
+   ) 
+   git clean -f -d foo/.git/ 
+   test 0 = $(ls -A foo/.git | wc -l) 
+   test -d foo/.git
+'
+
 test_expect_success 'force removal of nested git work tree' '
rm -fr foo bar baz 
mkdir -p foo bar baz/boo 
-- 
2.4.0.rc0.37.ga3b75b3

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] p7300: added performance tests for clean

2015-04-06 Thread Erik Elfström
Signed-off-by: Erik Elfström erik.elfst...@gmail.com
---
 t/perf/p7300-clean.sh | 37 +
 1 file changed, 37 insertions(+)
 create mode 100755 t/perf/p7300-clean.sh

diff --git a/t/perf/p7300-clean.sh b/t/perf/p7300-clean.sh
new file mode 100755
index 000..3f56fb2
--- /dev/null
+++ b/t/perf/p7300-clean.sh
@@ -0,0 +1,37 @@
+#!/bin/sh
+
+test_description=Test git-clean performance
+
+. ./perf-lib.sh
+
+test_perf_large_repo
+test_checkout_worktree
+
+test_expect_success 'setup untracked directory with many sub dirs' '
+   rm -rf 500_sub_dirs 5_sub_dirs clean_test_dir 
+   mkdir 500_sub_dirs 5_sub_dirs clean_test_dir 
+   for i in $(seq 1 500)
+   do
+   mkdir 500_sub_dirs/dir$i
+   done 
+   for i in $(seq 1 100)
+   do
+   cp -r 500_sub_dirs 5_sub_dirs/dir$i
+   done
+'
+
+test_perf 'clean many untracked sub dirs, check for nested git' '
+   rm -rf clean_test_dir/5_sub_dirs_cpy 
+   cp -r 5_sub_dirs clean_test_dir/5_sub_dirs_cpy 
+   git clean -q -f -d  clean_test_dir/ 
+   test 0 = $(ls -A clean_test_dir/ | wc -l)
+'
+
+test_perf 'clean many untracked sub dirs, ignore nested git' '
+   rm -rf clean_test_dir/5_sub_dirs_cpy 
+   cp -r 5_sub_dirs clean_test_dir/5_sub_dirs_cpy 
+   git clean -q -f -f -d  clean_test_dir/ 
+   test 0 = $(ls -A clean_test_dir/ | wc -l)
+'
+
+test_done
-- 
2.4.0.rc0.37.ga3b75b3

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] Improving performance of git clean

2015-04-06 Thread Erik Elfström
This series addresses a performance issue of git clean previously
discussed here:
http://thread.gmane.org/gmane.comp.version-control.git/265560/focus=265560
and here:
http://thread.gmane.org/gmane.comp.version-control.git/266777/focus=266777

The issue manifests when trying to clean a large number of untracked
directories. In my case this scenario triggered by a test suite
running in the repository creating a directory for each test resulting
in a build directory with ~10 sub directories that needs to be
cleaned. For some extreme cases, clean times of more than 1h have been
observed.

With this series, the time to clean an untracked directory containing
10 sub directories goes from 61s to 1.7s.

The main change is to switch the repository check in
clean.c:remove_dirs from using refs.c:resolve_gitlink_ref to
setup.c:is_git_directory.

One potential issue that is_git_directory contains the following check:

if (getenv(DB_ENVIRONMENT)) {
if (access(getenv(DB_ENVIRONMENT), X_OK))
return 0;
}

I'm not sure how this will affect this usecase (checking for some
other nested git repo). Please give some thought to this when
reviewing.

Jeff King also expressed concerns that we may have similar performance
issues in other commands and that it could be good to unify these is
this a repo?-checks. This series only attempts to solve the git-clean
case.

Erik Elfström (3):
  t7300: add tests to document behavior of clean and nested git
  p7300: added performance tests for clean
  clean: improve performance when removing lots of directories

 builtin/clean.c   | 23 ---
 t/perf/p7300-clean.sh | 37 +++
 t/t7300-clean.sh  | 82 +++
 3 files changed, 138 insertions(+), 4 deletions(-)
 create mode 100755 t/perf/p7300-clean.sh

-- 
2.4.0.rc0.37.ga3b75b3

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


git clean performance issues

2015-04-04 Thread erik elfström
Hi,

I'm having a performance issue with git clean -qxfd (note, not using
-ff).

The performance issue shows up when trying to clean untracked
directories that themselves contain many sub directories. The
performance is highly non linear with the number of sub
directories. Some test numbers:

DirsTime
1   0m0.754s
5   0m16.606s
10  1m31.418s

When running git clean -qxffd (note, using -ff) the performance is
fast and linear:

DirsTime
1   0m0.158s
5   0m0.918s
10  0m1.639s

After checking the source of git-clean my understanding of the problem
is as follows:

When clean.c:cmd_clean finds a directory and the -d flag is given it
will call clean.c:remove_dirs to potentially remove the directory and
all sub directories.

Unless -ff is given remove_dirs tries to be nice and not remove
directories containing other git repositories. To do this it does the
following check:

...
if ((force_flag  REMOVE_DIR_KEEP_NESTED_GIT) 
!resolve_gitlink_ref(path-buf, HEAD, submodule_head)) {
...

The problem is that refs.c:resolve_gitlink_ref will call
refs.c:get_ref_cache that will linearly search a linked list of cache
entries and create and insert a new ref_cache entry in the list for
each path it is given if it fails to find an existing entry:

for (refs = submodule_ref_caches; refs; refs = refs-next)
if (!strcmp(submodule, refs-name))
return refs;

refs = create_ref_cache(submodule);
refs-next = submodule_ref_caches;
submodule_ref_caches = refs;
return refs;

In my scenario get_ref_cache will be called 1+ times, each time
with a new path. The final few calls will need to search through and
compare 1+ entries before realizing that there is no existing
entry. This quickly ads up to 100 million+ calls to strcmp().

From what I can understand, the calls to get_ref_cache in this
scenario will never do any useful work. Is this correct? If so, would
it be possible to bypass it, maybe by calling
resolve_gitlink_ref_recursive directly or by using some other way of
checking for the presence of a git repo in clean.c:remove_dirs?

/Erik
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git clean performance issues

2015-04-04 Thread erik elfström
That looks like the same issue. The use is_git_directory approach
sounds good to me, is that the direction you would prefer? I can try
to cobble something together although I must warn you I have zero
previous experience with this code base so a few iterations will
probably be needed.

/Erik

On Sat, Apr 4, 2015 at 9:55 PM, Jeff King p...@peff.net wrote:
 On Sat, Apr 04, 2015 at 08:32:45PM +0200, erik elfström wrote:

 In my scenario get_ref_cache will be called 1+ times, each time
 with a new path. The final few calls will need to search through and
 compare 1+ entries before realizing that there is no existing
 entry. This quickly ads up to 100 million+ calls to strcmp().

 From what I can understand, the calls to get_ref_cache in this
 scenario will never do any useful work. Is this correct? If so, would
 it be possible to bypass it, maybe by calling
 resolve_gitlink_ref_recursive directly or by using some other way of
 checking for the presence of a git repo in clean.c:remove_dirs?

 I think this is the same issue that was discussed here:

   http://thread.gmane.org/gmane.comp.version-control.git/265560/focus=265585

 There is some discussion of a possible fix in that thread. I was hoping
 that Andreas was going to look further and produce a patch, but I
 imagine he got busy with other things. Do you want to try picking it up?

 -Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html