Re: Git in Outreachy Dec-Mar?

2018-09-02 Thread Christian Couder
On Sat, Sep 1, 2018 at 10:43 AM, Jeff King  wrote:
> On Fri, Aug 31, 2018 at 10:16:49AM +0200, Christian Couder wrote:
>
>> >   2. To get our landing page and list of projects in order (and also
>> >  micro-projects for applicants). This can probably build on the
>> >  previous round at:
>> >
>> >https://git.github.io/Outreachy-15/
>> >
>> >  and on the project/microprojects lists for GSoC (which will need
>> >  some updating and culling).
>>
>> Ok to take a look at that.
>
> Thanks. I think sooner is better for this (for you or anybody else who's
> interested in mentoring). The application period opens on September
> 10th, but I think the (still growing) list of projects is already being
> looked at by potential candidates.

So here is a landing page for the next Outreachy round:

https://git.github.io/Outreachy-17/

about the microprojects I am not sure which page I should create or improve.


Re: What's cooking in git.git (Aug 2018, #06; Wed, 29)

2018-09-02 Thread Stephen & Linda Smith
On Wednesday, August 29, 2018 3:35:57 PM MST Junio C Hamano wrote:
> 
> * mk/use-size-t-in-zlib (2017-08-10) 1 commit
>  . zlib.c: use size_t for size
> 
>  The wrapper to call into zlib followed our long tradition to use
>  "unsigned long" for sizes of regions in memory, which have been
>  updated to use "size_t".
> 
>  Needs resurrecting by making sure the fix is good and still applies
>  (or adjusted to today's codebase).

This appears to be part of a large patch set. [1]  Does the entire patch set 
need revisiting or just the one for zlib.c?

sps

[1] 
https://public-inbox.org/git/1502914591-26215-1-git-send-email-martin@mail.zuhause/




Re: Feature request: hooks directory

2018-09-02 Thread Christian Couder
Hi Wesley,

On Sun, Sep 2, 2018 at 11:38 PM, Wesley Schwengle  wrote:
> Hi all,
>
> I've made some progress with the hook.d implementation. It isn't
> finished, as it is my first C project I'm still somewhat rocky with
> how pointers and such work, but I'm getting somewhere. I haven't
> broken any tests \o/.

Great! Welcome to the Git community!

>  You have a nice testsuite btw. Feel free to comment on the code.
>
> I've moved some of the hooks-code found in run-command.c to hooks.c
>
> You can see the progress on gitlab: https://gitlab.com/waterkip/git
> or on github: https://github.com/waterkip/git
> The output of format-patch is down below.

It would be nicer if you could give links that let us see more
directly the commits you made, for example:
https://gitlab.com/waterkip/git/commits/multi-hooks

> I have some questions regarding the following two functions in run-command.c:
> * run_hook_le
> * run_hook_ve
>
> What do the postfixes le and ve mean?

It's about the arguments the function accepts, in a similar way to
exec*() functions, see `man execve` and `man execle`.
In short 'l' means list, 'v' means array of pointer to strings and 'e'
means environment.

> format-patch:
>
> From 129d8aff8257b22210beadc155cdbcae99b0fc4b Mon Sep 17 00:00:00 2001
> From: Wesley Schwengle 
> Date: Sun, 2 Sep 2018 02:40:04 +0200
> Subject: [PATCH] WIP: Add hook.d support in git

This is not the best way to embed a patch in an email. There is
Documentation/SubmittingPatches in the code base, that should explain
better ways to send patches to the mailing list.

> Add a generic mechanism to find and execute one or multiple hooks found
> in $GIT_DIR/hooks/ and/or $GIT_DIR/hooks/.d/*
>
> The API is as follows:
>
> #include "hooks.h"
>
> array hooks   = find_hooks('pre-receive');
> int hooks_ran = run_hooks(hooks);
>
> The implemented behaviour is:
>
> * If we find a hooks/.d directory and the hooks.multiHook flag isn't
>   set we make use of that directory.
>
> * If we find a hooks/.d and we also have hooks/ and the
>   hooks.multiHook isn't set or set to false we don't use the hook.d
>   directory. If the hook isn't set we issue a warning to the user
>   telling him/her that we support multiple hooks via the .d directory
>   structure
>
> * If the hooks.multiHook is set to true we use the hooks/ and all
>   the entries found in hooks/.d
>
> * All the scripts are executed and fail on the first error

Maybe the above documentation should be fully embedded as comments in
"hooks.h" (or perhaps added to a new file in
"Documentation/technical/", though it looks like we prefer to embed
doc in header files these days).

> diff --git a/hooks.h b/hooks.h
> new file mode 100644
> index 0..278d63344
> --- /dev/null
> +++ b/hooks.h
> @@ -0,0 +1,35 @@
> +#ifndef HOOKS_H
> +#define HOOKS_H
> +
> +#ifndef NO_PTHREADS
> +#include 
> +#endif
> +/*
> + * Returns all the hooks found in either
> + * $GIT_DIR/hooks/$hook and/or $GIT_DIR/hooks/$hook.d/
> + *
> + * Note that this points to static storage that will be
> + * overwritten by further calls to find_hooks and run_hook_*.
> + */
> +//extern const struct string_list *find_hooks(const char *name);

The above comment is using "//" which we forbid and should probably be
removed anyway.

> +extern const char *find_hooks(const char *name);
> +
> +/* Unsure what this does/is/etc */
> +//LAST_ARG_MUST_BE_NULL

This is to make it easier for tools like compilers to check that a
function is used correctly. You should not remove such macros.

> +/*
> + * Run all the runnable hooks found in
> + * $GIT_DIR/hooks/$hook and/or $GIT_DIR/hooks/$hook.d/
> + *
> + */
> +//extern int run_hooks_le(const char *const *env, const char *name, ...);
> +//extern int run_hooks_ve(const char *const *env, const char *name,
> va_list args);

Strange that these functions are commented out.

> +#endif
> +
> +/* Workings of hooks
> + *
> + * array_of_hooks  = find_hooks(name);
> + * number_of_hooks_ran = run_hooks(array_of_hooks);
> + *
> + */

This kind of documentation should probably be at the beginning of the
file, see strbuf.h for example.

Thanks,
Christian.


Re: Feature request: hooks directory

2018-09-02 Thread Wesley Schwengle
Hi all,

I've made some progress with the hook.d implementation. It isn't
finished, as it is my first C project I'm still somewhat rocky with
how pointers and such work, but I'm getting somewhere. I haven't
broken any tests \o/.
 You have a nice testsuite btw. Feel free to comment on the code.

I've moved some of the hooks-code found in run-command.c to hooks.c

You can see the progress on gitlab: https://gitlab.com/waterkip/git
or on github: https://github.com/waterkip/git
The output of format-patch is down below.

I have some questions regarding the following two functions in run-command.c:
* run_hook_le
* run_hook_ve

What do the postfixes le and ve mean?

Cheers,
Wesley

format-patch:

>From 129d8aff8257b22210beadc155cdbcae99b0fc4b Mon Sep 17 00:00:00 2001
From: Wesley Schwengle 
Date: Sun, 2 Sep 2018 02:40:04 +0200
Subject: [PATCH] WIP: Add hook.d support in git

Add a generic mechanism to find and execute one or multiple hooks found
in $GIT_DIR/hooks/ and/or $GIT_DIR/hooks/.d/*

The API is as follows:

#include "hooks.h"

array hooks   = find_hooks('pre-receive');
int hooks_ran = run_hooks(hooks);

The implemented behaviour is:

* If we find a hooks/.d directory and the hooks.multiHook flag isn't
  set we make use of that directory.

* If we find a hooks/.d and we also have hooks/ and the
  hooks.multiHook isn't set or set to false we don't use the hook.d
  directory. If the hook isn't set we issue a warning to the user
  telling him/her that we support multiple hooks via the .d directory
  structure

* If the hooks.multiHook is set to true we use the hooks/ and all
  the entries found in hooks/.d

* All the scripts are executed and fail on the first error
---
 Makefile   |   1 +
 TODO-hooks.md  |  38 
 builtin/am.c   |   4 +-
 builtin/commit.c   |   4 +-
 builtin/receive-pack.c |  10 +--
 builtin/worktree.c |   3 +-
 cache.h|   1 +
 config.c   |   5 ++
 environment.c  |   1 +
 hooks.c| 134 +
 hooks.h|  35 +++
 run-command.c  |  36 +--
 run-command.h  |   6 --
 sequencer.c|   7 ++-
 transport.c|   3 +-
 15 files changed, 237 insertions(+), 51 deletions(-)
 create mode 100644 TODO-hooks.md
 create mode 100644 hooks.c
 create mode 100644 hooks.h

diff --git a/Makefile b/Makefile
index 5a969f583..f5eddf1d7 100644
--- a/Makefile
+++ b/Makefile
@@ -810,6 +810,7 @@ LIB_H = $(shell $(FIND) . \
  -name Documentation -prune -o \
  -name '*.h' -print)

+LIB_OBJS += hooks.o
 LIB_OBJS += abspath.o
 LIB_OBJS += advice.o
 LIB_OBJS += alias.o
diff --git a/TODO-hooks.md b/TODO-hooks.md
new file mode 100644
index 0..13b15bffb
--- /dev/null
+++ b/TODO-hooks.md
@@ -0,0 +1,38 @@
+# All hooks
+# See Documentation/githooks.txt for more information about each and every hook
+# that git knows about
+commit-msg
+fsmoninor-watchman
+p4-pre-submit
+post-applypatch
+post-checkout
+post-commit
+post-merge
+post-receive
+post-rewrite
+post-update
+pre-applypatch
+pre-auto-gc
+pre-commit
+pre-push
+pre-rebase
+pre-receive
+prepare-commit-msg
+push-to-checkout
+sendemail-validate
+update
+
+# builtin/receive-pack.c
+feed_recieve_hook
+find_hook
+find_receive_hook
+push_to_checkout_hook
+receive_hook_feed_state
+run_and_feed_hook
+run_hook_le
+run_receive_hook
+run_update_hook
+
+
+# run-command.c
+find_hook
diff --git a/builtin/am.c b/builtin/am.c
index 9f7ecf6ec..d1276dd47 100644
--- a/builtin/am.c
+++ b/builtin/am.c
@@ -34,6 +34,8 @@
 #include "packfile.h"
 #include "repository.h"

+#include "hooks.h"
+
 /**
  * Returns 1 if the file is empty or does not exist, 0 otherwise.
  */
@@ -509,7 +511,7 @@ static int run_applypatch_msg_hook(struct am_state *state)
 static int run_post_rewrite_hook(const struct am_state *state)
 {
  struct child_process cp = CHILD_PROCESS_INIT;
- const char *hook = find_hook("post-rewrite");
+ const char *hook = find_hooks("post-rewrite");
  int ret;

  if (!hook)
diff --git a/builtin/commit.c b/builtin/commit.c
index 0d9828e29..389224d66 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -34,6 +34,8 @@
 #include "mailmap.h"
 #include "help.h"

+#include "hooks.h"
+
 static const char * const builtin_commit_usage[] = {
  N_("git commit [] [--] ..."),
  NULL
@@ -932,7 +934,7 @@ static int prepare_to_commit(const char
*index_file, const char *prefix,
  return 0;
  }

- if (!no_verify && find_hook("pre-commit")) {
+ if (!no_verify && find_hooks("pre-commit")) {
  /*
  * Re-read the index as pre-commit hook could have updated it,
  * and write it out as a tree.  We must do this before we invoke
diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c
index c17ce94e1..dd10ef4af 100644
--- a/builtin/receive-pack.c
+++ b/builtin/receive-pack.c
@@ -28,6 +28,8 @@
 #include "object-store.h"
 #include "protocol.h"

+#include "hooks.h"
+
 static const char * const receive_pack_usage[] = {
  

Re: [ANNOUNCE] Git v2.19.0-rc0

2018-09-02 Thread Kaartic Sivaraam
On Thu, 2018-08-23 at 06:26 -0400, Derrick Stolee wrote:
> 
> Around the time that my proposed approaches were getting vetoed for 
> alignment issues, I figured I was out of my depth here. I reached out to 
> Daniel Lemire (of EWAH bitmap fame) on Twitter [1]. His blog is full of 
> posts of word-based approaches to different problems, so I thought he 
> might know something off the top of his head that would be applicable. 
> His conclusion (after looking only a short time) was to take a 'hasheq' 
> approach [2] like Peff suggested [3]. Since that requires auditing all 
> callers of hashcmp to see if hasheq is appropriate, it is not a good 
> solution for 2.19 but (in my opinion) should be evaluated as part of the 
> 2.20 cycle.
> 

That was an interesting blog post, indeed. It had an interesting
comments section. One comment especially caught my eyes was [a]:

"So the way gcc (and maybe clang) handles this is specifically by
recognizing memcmp and checking whether a only a 2-way result is needed
and then essentially replacing it with a memcmp_eq call.

..."

I find this to be an interesting note. It seems GCC does optimize when
we clearly indicate that we use the result of the memcmp as a boolean.
So would that help in anyway? Maybe it would help in writing a `hasheq`
method easily? I'm not sure.

> [1] https://twitter.com/stolee/status/1032312965754748930
> 
> [2] 
> https://lemire.me/blog/2018/08/22/avoid-lexicographical-comparisons-when-testing-for-string-equality/
> 
> [3] 
> https://public-inbox.org/git/20180822030344.ga14...@sigill.intra.peff.net/
> 
> [4] 
> https://public-inbox.org/git/7ea416cf-b043-1274-e161-85a8780b8...@gmail.com/


[a]: 
https://lemire.me/blog/2018/08/22/avoid-lexicographical-comparisons-when-testing-for-string-equality/#comment-344073

--
Sivaraam



[PATCH v2 1/1] read-cache.c: optimize reading index format v4

2018-09-02 Thread Nguyễn Thái Ngọc Duy
Index format v4 requires some more computation to assemble a path
based on a previous one. The current code is not very efficient
because

 - it doubles memory copy, we assemble the final path in a temporary
   first before putting it back to a cache_entry

 - strbuf_remove() in expand_name_field() is not exactly a good fit
   for stripping a part at the end, _setlen() would do the same job
   and is much cheaper.

 - the open-coded loop to find the end of the string in
   expand_name_field() can't beat an optimized strlen()

This patch avoids the temporary buffer and writes directly to the new
cache_entry, which addresses the first two points. The last point
could also be avoided if the total string length fits in the first 12
bits of ce_flags, if not we fall back to strlen().

Running "test-tool read-cache 100" on webkit.git (275k files), reading
v2 only takes 4.226 seconds, while v4 takes 5.711 seconds, 35% more
time. The patch reduces read time on v4 to 4.319 seconds.

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 read-cache.c | 128 ---
 1 file changed, 60 insertions(+), 68 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index 7b1354d759..8628d0f3a8 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1713,63 +1713,24 @@ int read_index(struct index_state *istate)
return read_index_from(istate, get_index_file(), get_git_dir());
 }
 
-static struct cache_entry *cache_entry_from_ondisk(struct mem_pool *mem_pool,
-  struct ondisk_cache_entry 
*ondisk,
-  unsigned int flags,
-  const char *name,
-  size_t len)
-{
-   struct cache_entry *ce = mem_pool__ce_alloc(mem_pool, len);
-
-   ce->ce_stat_data.sd_ctime.sec = get_be32(>ctime.sec);
-   ce->ce_stat_data.sd_mtime.sec = get_be32(>mtime.sec);
-   ce->ce_stat_data.sd_ctime.nsec = get_be32(>ctime.nsec);
-   ce->ce_stat_data.sd_mtime.nsec = get_be32(>mtime.nsec);
-   ce->ce_stat_data.sd_dev   = get_be32(>dev);
-   ce->ce_stat_data.sd_ino   = get_be32(>ino);
-   ce->ce_mode  = get_be32(>mode);
-   ce->ce_stat_data.sd_uid   = get_be32(>uid);
-   ce->ce_stat_data.sd_gid   = get_be32(>gid);
-   ce->ce_stat_data.sd_size  = get_be32(>size);
-   ce->ce_flags = flags & ~CE_NAMEMASK;
-   ce->ce_namelen = len;
-   ce->index = 0;
-   hashcpy(ce->oid.hash, ondisk->sha1);
-   memcpy(ce->name, name, len);
-   ce->name[len] = '\0';
-   return ce;
-}
-
-/*
- * Adjacent cache entries tend to share the leading paths, so it makes
- * sense to only store the differences in later entries.  In the v4
- * on-disk format of the index, each on-disk cache entry stores the
- * number of bytes to be stripped from the end of the previous name,
- * and the bytes to append to the result, to come up with its name.
- */
-static unsigned long expand_name_field(struct strbuf *name, const char *cp_)
-{
-   const unsigned char *ep, *cp = (const unsigned char *)cp_;
-   size_t len = decode_varint();
-
-   if (name->len < len)
-   die("malformed name field in the index");
-   strbuf_remove(name, name->len - len, len);
-   for (ep = cp; *ep; ep++)
-   ; /* find the end */
-   strbuf_add(name, cp, ep - cp);
-   return (const char *)ep + 1 - cp_;
-}
-
-static struct cache_entry *create_from_disk(struct mem_pool *mem_pool,
+static struct cache_entry *create_from_disk(struct index_state *istate,
struct ondisk_cache_entry *ondisk,
unsigned long *ent_size,
-   struct strbuf *previous_name)
+   const struct cache_entry 
*previous_ce)
 {
struct cache_entry *ce;
size_t len;
const char *name;
unsigned int flags;
+   size_t copy_len;
+   /*
+* Adjacent cache entries tend to share the leading paths, so it makes
+* sense to only store the differences in later entries.  In the v4
+* on-disk format of the index, each on-disk cache entry stores the
+* number of bytes to be stripped from the end of the previous name,
+* and the bytes to append to the result, to come up with its name.
+*/
+   int expand_name_field = istate->version == 4;
 
/* On-disk flags are just 16 bits */
flags = get_be16(>flags);
@@ -1789,21 +1750,54 @@ static struct cache_entry *create_from_disk(struct 
mem_pool *mem_pool,
else
name = ondisk->name;
 
-   if (!previous_name) {
-   /* v3 and earlier */
-   if (len == CE_NAMEMASK)
-   len = strlen(name);
-   ce = cache_entry_from_ondisk(mem_pool, ondisk, flags, name, 
len);
+  

[PATCH v2 0/1] optimize reading index format v4

2018-09-02 Thread Nguyễn Thái Ngọc Duy
v2 removes unrelated changes and the dummy_entry. strip_len is also
replaced with copy_len to reduce repeated subtraction calculation.
Diff: 

diff --git a/read-cache.c b/read-cache.c
index 5c04c8f200..8628d0f3a8 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1713,7 +1713,7 @@ int read_index(struct index_state *istate)
return read_index_from(istate, get_index_file(), get_git_dir());
 }
 
-static struct cache_entry *create_from_disk(struct mem_pool *mem_pool,
+static struct cache_entry *create_from_disk(struct index_state *istate,
struct ondisk_cache_entry *ondisk,
unsigned long *ent_size,
const struct cache_entry 
*previous_ce)
@@ -1722,7 +1722,15 @@ static struct cache_entry *create_from_disk(struct 
mem_pool *mem_pool,
size_t len;
const char *name;
unsigned int flags;
-   size_t strip_len;
+   size_t copy_len;
+   /*
+* Adjacent cache entries tend to share the leading paths, so it makes
+* sense to only store the differences in later entries.  In the v4
+* on-disk format of the index, each on-disk cache entry stores the
+* number of bytes to be stripped from the end of the previous name,
+* and the bytes to append to the result, to come up with its name.
+*/
+   int expand_name_field = istate->version == 4;
 
/* On-disk flags are just 16 bits */
flags = get_be16(>flags);
@@ -1735,37 +1743,37 @@ static struct cache_entry *create_from_disk(struct 
mem_pool *mem_pool,
extended_flags = get_be16(>flags2) << 16;
/* We do not yet understand any bit out of CE_EXTENDED_FLAGS */
if (extended_flags & ~CE_EXTENDED_FLAGS)
-   die(_("unknown index entry format %08x"), 
extended_flags);
+   die("Unknown index entry format %08x", extended_flags);
flags |= extended_flags;
name = ondisk2->name;
}
else
name = ondisk->name;
 
-   /*
-* Adjacent cache entries tend to share the leading paths, so it makes
-* sense to only store the differences in later entries.  In the v4
-* on-disk format of the index, each on-disk cache entry stores the
-* number of bytes to be stripped from the end of the previous name,
-* and the bytes to append to the result, to come up with its name.
-*/
-   if (previous_ce) {
+   if (expand_name_field) {
const unsigned char *cp = (const unsigned char *)name;
+   size_t strip_len, previous_len;
 
+   previous_len = previous_ce ? previous_ce->ce_namelen : 0;
strip_len = decode_varint();
-   if (previous_ce->ce_namelen < strip_len)
-   die(_("malformed name field in the index, path '%s'"),
-   previous_ce->name);
+   if (previous_len < strip_len) {
+   if (previous_ce)
+   die(_("malformed name field in the index, near 
path '%s'"),
+   previous_ce->name);
+   else
+   die(_("malformed name field in the index in the 
first path"));
+   }
+   copy_len = previous_len - strip_len;
name = (const char *)cp;
}
 
if (len == CE_NAMEMASK) {
len = strlen(name);
-   if (previous_ce)
-   len += previous_ce->ce_namelen - strip_len;
+   if (expand_name_field)
+   len += copy_len;
}
 
-   ce = mem_pool__ce_alloc(mem_pool, len);
+   ce = mem_pool__ce_alloc(istate->ce_mem_pool, len);
 
ce->ce_stat_data.sd_ctime.sec = get_be32(>ctime.sec);
ce->ce_stat_data.sd_mtime.sec = get_be32(>mtime.sec);
@@ -1782,9 +1790,9 @@ static struct cache_entry *create_from_disk(struct 
mem_pool *mem_pool,
ce->index = 0;
hashcpy(ce->oid.hash, ondisk->sha1);
 
-   if (previous_ce) {
-   size_t copy_len = previous_ce->ce_namelen - strip_len;
-   memcpy(ce->name, previous_ce->name, copy_len);
+   if (expand_name_field) {
+   if (copy_len)
+   memcpy(ce->name, previous_ce->name, copy_len);
memcpy(ce->name + copy_len, name, len + 1 - copy_len);
*ent_size = (name - ((char *)ondisk)) + len + 1 - copy_len;
} else {
@@ -1885,7 +1893,6 @@ int do_read_index(struct index_state *istate, const char 
*path, int must_exist)
void *mmap;
size_t mmap_size;
const struct cache_entry *previous_ce = NULL;
-   struct cache_entry *dummy_entry = NULL;
 
if (istate->initialized)
return istate->cache_nr;
@@ -1923,7 +1930,6 @@ int 

Re: non-smooth progress indication for git fsck and git gc

2018-09-02 Thread Jeff King
On Sun, Sep 02, 2018 at 03:55:28AM -0400, Jeff King wrote:

> I still think the more interesting long-term thing here is to reuse the
> pack verification from index-pack, which actually hashes as it does the
> per-object countup.
> 
> That code isn't lib-ified enough to be run in process, but I think the
> patch below should give similar behavior to what fsck currently does.
> We'd need to tell index-pack to use our fsck.* config for its checks, I
> imagine. The progress here is still per-pack, but I think we could pass
> in sufficient information to have it do one continuous meter across all
> of the packs (see the in-code comment).
> 
> And it makes the result multi-threaded, and lets us drop a bunch of
> duplicate code.

Here's a little polish on that to pass enough progress data to
index-pack to let it have one nice continuous meter. I'm not sure if
it's worth all the hackery or not. The dual-meter that index-pack
generally uses is actually more informative (since it meters the initial
pass through the pack and the delta reconstruction separately).

And there are definitely a few nasty bits (like the way the progress is
ended). I'm not planning on taking this further for now, but maybe
you or somebody can find it interesting or useful.

---
 builtin/fsck.c   |  59 +--
 builtin/index-pack.c |  43 ++-
 pack-check.c | 142 ---
 pack.h   |   1 -
 4 files changed, 75 insertions(+), 170 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 250f5af118..2d774ea2e5 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -386,25 +386,6 @@ static int fsck_obj(struct object *obj, void *buffer, 
unsigned long size)
return err;
 }
 
-static int fsck_obj_buffer(const struct object_id *oid, enum object_type type,
-  unsigned long size, void *buffer, int *eaten)
-{
-   /*
-* Note, buffer may be NULL if type is OBJ_BLOB. See
-* verify_packfile(), data_valid variable for details.
-*/
-   struct object *obj;
-   obj = parse_object_buffer(the_repository, oid, type, size, buffer,
- eaten);
-   if (!obj) {
-   errors_found |= ERROR_OBJECT;
-   return error("%s: object corrupt or missing", oid_to_hex(oid));
-   }
-   obj->flags &= ~(REACHABLE | SEEN);
-   obj->flags |= HAS_OBJ;
-   return fsck_obj(obj, buffer, size);
-}
-
 static int default_refs;
 
 static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid,
@@ -662,6 +643,32 @@ static int mark_packed_for_connectivity(const struct 
object_id *oid,
return 0;
 }
 
+static int verify_pack(struct packed_git *p,
+  unsigned long count, unsigned long total)
+{
+   struct child_process index_pack = CHILD_PROCESS_INIT;
+
+   if (is_pack_valid(p) < 0)
+   return -1;
+   for_each_object_in_pack(p, mark_packed_for_connectivity, NULL, 0);
+
+   index_pack.git_cmd = 1;
+   argv_array_pushl(_pack.args,
+"index-pack",
+"--verify-fsck",
+NULL);
+   if (show_progress)
+   argv_array_pushf(_pack.args,
+"--fsck-progress=%lu,%lu,Checking pack %s",
+count, total, sha1_to_hex(p->sha1));
+   argv_array_push(_pack.args, p->pack_name);
+
+   if (run_command(_pack))
+   return -1;
+
+   return 0;
+}
+
 static char const * const fsck_usage[] = {
N_("git fsck [] [...]"),
NULL
@@ -737,7 +744,6 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
if (check_full) {
struct packed_git *p;
uint32_t total = 0, count = 0;
-   struct progress *progress = NULL;
 
if (show_progress) {
for (p = get_packed_git(the_repository); p;
@@ -746,18 +752,21 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
continue;
total += p->num_objects;
}
-
-   progress = start_progress(_("Checking 
objects"), total);
}
for (p = get_packed_git(the_repository); p;
 p = p->next) {
/* verify gives error messages itself */
-   if (verify_pack(p, fsck_obj_buffer,
-   progress, count))
+   if (verify_pack(p, count, total))
errors_found |= ERROR_PACK;
count += p->num_objects;
}
-   stop_progress();
+   /*
+ 

Re: Git in Outreachy Dec-Mar?

2018-09-02 Thread Jeff King
On Sun, Sep 02, 2018 at 09:37:59AM +0200, Christian Couder wrote:

> > I also think it doesn't need to be the mentor's responsibility to find
> > the funding. That can be up to an "org admin", and I don't think it
> > should be too big a deal (I had no trouble getting funding from GitHub
> > last year, and I don't expect any this year; I just didn't want to start
> > that process until I knew we were serious about participating).
> 
> My experience so far with org admins who don't mentor is that they are
> likely to loose interest in the program over time and stop doing much
> (which is natural, I don't blame anyone). This is what happened with
> GSoC org admins (who don't mentor), so most of the admin work now
> falls back on mentors (org admins that mentor).
> 
> That's why I fear that in a few years the burden of finding funds for
> Outreachy might fall back on the mentors too.

Yeah, I agree that might eventually happen. I think if there are admins
willing to look for funds, though, we are better off saving our project
money for now.

-Peff


Re: [BUG] index corruption with git commit -p

2018-09-02 Thread Jeff King
On Sun, Sep 02, 2018 at 09:53:53AM +0200, Luc Van Oostenryck wrote:

> > At any rate, I think this perfectly explains the behavior we're seeing.
> 
> Yes, this certainly make sense.
> 
> For fun (and testing) I tried to take the problem in the other direction:
> * why hasn't this be detected earlier (I'm reasonably be sure I did the
>   same operation a few times already without seeing the corruption)?
> * how easy it is to reproduce the problem?
> * Is it enough to do an interactive commit with nothing needing interactive?
> 
> So I tried the following and some variants:
>   > for i in $(seq -w 0 100); do echo $i > f$i; done
>   > git add f* && git commit -m add f* && git rm -q f* && git commit -m rm -p
> 
> but I didn't succeed to recreate the problem. So I'm still wondering
> what is special in my repository & tree that trigger the corruption.

I think the partial deletion is important, because it means that the
resulting index is going to be smaller. And the truncation problem only
matters when we go from a larger file to a smaller one (since otherwise
overwrite the old content completely).

And it doesn't seem to trigger without the interactive "-p". I think
that's not directly related, but it somehow triggers the case where we
actually need to update the cache tree in this particular order.

That's pretty hand-wavy, but I think it gives a sense of why most people
didn't run into this. I do wish I understood better what it would take
to create a minimal example for the test suite.

-Peff


Re: [PATCH] bisect.c: make show_list() build again

2018-09-02 Thread Christian Couder
On Sun, Sep 2, 2018 at 9:42 AM, Nguyễn Thái Ngọc Duy  wrote:

> In order to stop this from happening again, the function is now
> compiled unconditionally but exits early unless DEBUG_BISECT is
> non-zero.

Thanks for going the extra mile and doing this!

I wonder if we should also try to make the show_list() function part
of the trace_*() functions to make it even more regular. This can be a
separate patch or topic though.


Re: non-smooth progress indication for git fsck and git gc

2018-09-02 Thread Jeff King
On Sun, Sep 02, 2018 at 03:46:57AM -0400, Jeff King wrote:

> Something like this, which chunks it there, uses a per-packfile meter
> (though still does not give any clue how many packfiles there are), and
> shows a throughput meter.

Actually, in typical cases it would not matter how many packfiles there
are, because there's generally one big one, and then none of the small
ones would take long enough to trigger the delayed meter. So you'd only
see one such meter usually, and it would cover the majority of the time.
In theory, anyway.

I still think the more interesting long-term thing here is to reuse the
pack verification from index-pack, which actually hashes as it does the
per-object countup.

That code isn't lib-ified enough to be run in process, but I think the
patch below should give similar behavior to what fsck currently does.
We'd need to tell index-pack to use our fsck.* config for its checks, I
imagine. The progress here is still per-pack, but I think we could pass
in sufficient information to have it do one continuous meter across all
of the packs (see the in-code comment).

And it makes the result multi-threaded, and lets us drop a bunch of
duplicate code.

---
 builtin/fsck.c |  53 +++--
 pack-check.c   | 142 ---
 pack.h |   1 -
 3 files changed, 32 insertions(+), 164 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 250f5af118..643e16125b 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -386,25 +386,6 @@ static int fsck_obj(struct object *obj, void *buffer, 
unsigned long size)
return err;
 }
 
-static int fsck_obj_buffer(const struct object_id *oid, enum object_type type,
-  unsigned long size, void *buffer, int *eaten)
-{
-   /*
-* Note, buffer may be NULL if type is OBJ_BLOB. See
-* verify_packfile(), data_valid variable for details.
-*/
-   struct object *obj;
-   obj = parse_object_buffer(the_repository, oid, type, size, buffer,
- eaten);
-   if (!obj) {
-   errors_found |= ERROR_OBJECT;
-   return error("%s: object corrupt or missing", oid_to_hex(oid));
-   }
-   obj->flags &= ~(REACHABLE | SEEN);
-   obj->flags |= HAS_OBJ;
-   return fsck_obj(obj, buffer, size);
-}
-
 static int default_refs;
 
 static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid,
@@ -662,6 +643,37 @@ static int mark_packed_for_connectivity(const struct 
object_id *oid,
return 0;
 }
 
+static int verify_pack(struct packed_git *p)
+{
+   struct child_process index_pack = CHILD_PROCESS_INIT;
+
+   for_each_object_in_pack(p, mark_packed_for_connectivity, NULL, 0);
+
+   /*
+* This should probably be replaced with a command-line option
+* that teaches "index-pack --verify" to show a more compact and
+* fsck-oriented progress (see also the "-v" below).
+*/
+   if (show_progress)
+   fprintf(stderr, "Checking packfile %s...\n",
+   p->pack_name);
+
+   index_pack.git_cmd = 1;
+   argv_array_pushl(_pack.args,
+"index-pack",
+"--verify",
+"--strict",
+NULL);
+   if (show_progress)
+   argv_array_push(_pack.args, "-v");
+   argv_array_push(_pack.args, p->pack_name);
+
+   if (run_command(_pack))
+   return -1;
+
+   return 0;
+}
+
 static char const * const fsck_usage[] = {
N_("git fsck [] [...]"),
NULL
@@ -752,8 +764,7 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
for (p = get_packed_git(the_repository); p;
 p = p->next) {
/* verify gives error messages itself */
-   if (verify_pack(p, fsck_obj_buffer,
-   progress, count))
+   if (verify_pack(p))
errors_found |= ERROR_PACK;
count += p->num_objects;
}
diff --git a/pack-check.c b/pack-check.c
index d3a57df34f..ea1457ce53 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -15,17 +15,6 @@ struct idx_entry {
unsigned int nr;
 };
 
-static int compare_entries(const void *e1, const void *e2)
-{
-   const struct idx_entry *entry1 = e1;
-   const struct idx_entry *entry2 = e2;
-   if (entry1->offset < entry2->offset)
-   return -1;
-   if (entry1->offset > entry2->offset)
-   return 1;
-   return 0;
-}
-
 int check_pack_crc(struct packed_git *p, struct pack_window **w_curs,
   off_t offset, off_t len, unsigned int nr)
 {
@@ -48,121 +37,6 @@ int check_pack_crc(struct packed_git *p, struct pack_window 
**w_curs,
return data_crc != 

Re: [BUG] index corruption with git commit -p

2018-09-02 Thread Luc Van Oostenryck
On Sun, Sep 02, 2018 at 03:24:09AM -0400, Jeff King wrote:
> On Sun, Sep 02, 2018 at 09:12:04AM +0200, Duy Nguyen wrote:
> > diff --git a/builtin/commit.c b/builtin/commit.c
> > index 2be7bdb331..60f30b3780 100644
> > --- a/builtin/commit.c
> > +++ b/builtin/commit.c
> > @@ -432,6 +432,7 @@ static const char *prepare_index(int argc, const char 
> > **argv, const char *prefix
> > if (update_main_cache_tree(WRITE_TREE_SILENT) == 0) {
> > if (reopen_lock_file(_lock) < 0)
> > die(_("unable to write index file"));
> > +   ftruncate(index_lock.tempfile->fd, 0);
> > if (write_locked_index(_index, _lock, 0))
> > die(_("unable to update temporary index"));
> > } else
> 
> Doh, of course. I even thought about this issue and dug all the way into
> reopen_lock_file(), but for some reason temporarily forgot that O_WRONLY
> does not imply O_TRUNC.
> 
> Arguably this should be the default for reopen_lockfile(), as getting a
> write pointer into an existing file is not ever going to be useful for
> the way Git uses lockfiles. Opening with O_APPEND could conceivably be
> useful, but it's pretty unlikely (and certainly not helpful here, and
> this is the only caller). Alternatively, the function should just take
> open(2) flags.
> 
> At any rate, I think this perfectly explains the behavior we're seeing.

Yes, this certainly make sense.

For fun (and testing) I tried to take the problem in the other direction:
* why hasn't this be detected earlier (I'm reasonably be sure I did the
  same operation a few times already without seeing the corruption)?
* how easy it is to reproduce the problem?
* Is it enough to do an interactive commit with nothing needing interactive?

So I tried the following and some variants:
  > for i in $(seq -w 0 100); do echo $i > f$i; done
  > git add f* && git commit -m add f* && git rm -q f* && git commit -m rm -p

but I didn't succeed to recreate the problem. So I'm still wondering
what is special in my repository & tree that trigger the corruption.

Anyway, thanks to al of you for looking at this so quickly.

-- Luc


Re: non-smooth progress indication for git fsck and git gc

2018-09-02 Thread Jeff King
On Sat, Sep 01, 2018 at 02:53:28PM +0200, Ævar Arnfjörð Bjarmason wrote:

> With this we'll get output like:
> 
> $ ~/g/git/git  -C ~/g/2015-04-03-1M-git/  --exec-path=$PWD fsck
> Checking object directories: 100% (256/256), done.
> Hashing: 100% (452634108/452634108), done.
> Hashing: 100% (1073741824/1073741824), done.
> Hashing: 100% (1073741824/1073741824), done.
> Hashing: 100% (1008001572/1008001572), done.
> Checking objects:   2% (262144/13064614)
> ^C
> 
> All tests pass with this. Isn't it awesome? Except it's of course a
> massive hack, we wouldn't want to just hook into SHA1DC like this.

I still consider that output so-so; the byte counts are big and there's
no indication how many "hashing" lines we're going to see. It's also
broken up in a weird way (it's not one per file; it's one per giant
chunk we fed to sha1).

> The problem comes down to us needing to call git_hash_sha1_update() with
> some really huge input, that function is going to take a *long* time,
> and the only way we're getting incremental progress is:
> 
>  1) If we ourselves split the input into N chunks
>  2) If we hack into the SHA1 library itself
> 
> This patch does #2, but for this to be acceptable we'd need to do
> something like #1.

I think we could just do the chunking in verify_packfile(), couldn't we?
(And the .idx hash, if we really want to cover that case, but IMHO
that's way less interesting).

Something like this, which chunks it there, uses a per-packfile meter
(though still does not give any clue how many packfiles there are), and
shows a throughput meter.

diff --git a/pack-check.c b/pack-check.c
index d3a57df34f..c94223664f 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -62,10 +62,25 @@ static int verify_packfile(struct packed_git *p,
uint32_t nr_objects, i;
int err = 0;
struct idx_entry *entries;
+   struct progress *hashing_progress;
+   char *title;
+   off_t total_hashed = 0;
 
if (!is_pack_valid(p))
return error("packfile %s cannot be accessed", p->pack_name);
 
+   if (progress) {
+   /* Probably too long... */
+   title = xstrfmt("Hashing %s", p->pack_name);
+
+   /*
+* I don't think it actually works to have two progresses going
+* at the same time, because when the first one ends, we'll
+* cancel the alarm. But hey, this is a hacky proof of concept.
+*/
+   hashing_progress = start_progress(title, 0);
+   }
+
the_hash_algo->init_fn();
do {
unsigned long remaining;
@@ -75,9 +90,25 @@ static int verify_packfile(struct packed_git *p,
pack_sig_ofs = p->pack_size - the_hash_algo->rawsz;
if (offset > pack_sig_ofs)
remaining -= (unsigned int)(offset - pack_sig_ofs);
-   the_hash_algo->update_fn(, in, remaining);
+   while (remaining) {
+   int chunk = remaining < 4096 ? remaining : 4096;
+   the_hash_algo->update_fn(, in, chunk);
+   in += chunk;
+   remaining -= chunk;
+   total_hashed += chunk;
+   /*
+* The progress code needs tweaking to show throughputs
+* better for open-ended meters.
+*/
+   display_throughput(hashing_progress, total_hashed);
+   display_progress(hashing_progress, 0);
+   }
} while (offset < pack_sig_ofs);
+
the_hash_algo->final_fn(hash, );
+   stop_progress(_progress);
+   free(title);
+
pack_sig = use_pack(p, w_curs, pack_sig_ofs, NULL);
if (hashcmp(hash, pack_sig))
err = error("%s pack checksum mismatch",


[PATCH] bisect.c: make show_list() build again

2018-09-02 Thread Nguyễn Thái Ngọc Duy
This function only compiles when DEBUG_BISECT is 1, which is often not
the case. As a result there are two commits [1] [2] that break it but
the breakages went unnoticed because the code did not compile by
default. Update the function and include the new header file to make this
function build again.

In order to stop this from happening again, the function is now
compiled unconditionally but exits early unless DEBUG_BISECT is
non-zero. A smart compiler generates no extra code (not even a
function call). But even if it does not, this function does not seem
to be in a hot path that the extra cost becomes a big problem.

[1] bb408ac95d (bisect.c: use commit-slab for commit weight instead of
commit->util - 2018-05-19)

[2] cbd53a2193 (object-store: move object access functions to
object-store.h - 2018-05-15)

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 bisect.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/bisect.c b/bisect.c
index e1275ba79e..e65f6821b8 100644
--- a/bisect.c
+++ b/bisect.c
@@ -13,6 +13,7 @@
 #include "sha1-array.h"
 #include "argv-array.h"
 #include "commit-slab.h"
+#include "object-store.h"
 
 static struct oid_array good_revs;
 static struct oid_array skipped_revs;
@@ -120,14 +121,14 @@ static inline int halfway(struct commit_list *p, int nr)
}
 }
 
-#if !DEBUG_BISECT
-#define show_list(a,b,c,d) do { ; } while (0)
-#else
 static void show_list(const char *debug, int counted, int nr,
  struct commit_list *list)
 {
struct commit_list *p;
 
+   if (!DEBUG_BISECT)
+   return;
+
fprintf(stderr, "%s (%d/%d)\n", debug, counted, nr);
 
for (p = list; p; p = p->next) {
@@ -145,7 +146,7 @@ static void show_list(const char *debug, int counted, int 
nr,
(flags & TREESAME) ? ' ' : 'T',
(flags & UNINTERESTING) ? 'U' : ' ',
(flags & COUNTED) ? 'C' : ' ');
-   if (commit->util)
+   if (*commit_weight_at(_weight, p->item))
fprintf(stderr, "%3d", weight(p));
else
fprintf(stderr, "---");
@@ -160,7 +161,6 @@ static void show_list(const char *debug, int counted, int 
nr,
fprintf(stderr, "\n");
}
 }
-#endif /* DEBUG_BISECT */
 
 static struct commit_list *best_bisection(struct commit_list *list, int nr)
 {
-- 
2.19.0.rc0.337.ge906d732e7



Re: Git in Outreachy Dec-Mar?

2018-09-02 Thread Christian Couder
On Sat, Sep 1, 2018 at 10:43 AM, Jeff King  wrote:
> On Fri, Aug 31, 2018 at 10:16:49AM +0200, Christian Couder wrote:

>> I can also look at getting outside funds.
>>
>> My opinion though is that it is probably better if the Git project can
>> use its own fund for this, as it makes it easier for possible mentors
>> if they don't need to look at getting outside funds.
>
> I disagree. An internship costs more than we generally take in over the
> course of a year. So we would eventually run out of money doing this.

I think we would have time to figure out a way to get more funds
before that happens.

> I also think it doesn't need to be the mentor's responsibility to find
> the funding. That can be up to an "org admin", and I don't think it
> should be too big a deal (I had no trouble getting funding from GitHub
> last year, and I don't expect any this year; I just didn't want to start
> that process until I knew we were serious about participating).

My experience so far with org admins who don't mentor is that they are
likely to loose interest in the program over time and stop doing much
(which is natural, I don't blame anyone). This is what happened with
GSoC org admins (who don't mentor), so most of the admin work now
falls back on mentors (org admins that mentor).

That's why I fear that in a few years the burden of finding funds for
Outreachy might fall back on the mentors too.

> So if you (or anybody else) wants to mentor, please focus on the project
> list and application materials.

Ok, I will do that. Thanks for taking care of the funding.


Re: [PATCH v6] Implement --first-parent for git rev-list --bisect

2018-09-02 Thread Duy Nguyen
On Tue, Aug 28, 2018 at 6:45 PM Junio C Hamano  wrote:
> > @@ -146,10 +147,14 @@ static void show_list(const char *debug, int counted, 
> > int nr,
>
> An unrelated tangent, but I think I just spotted a bug in the
> existing code on the line immediately before this hunk, which reads
>
> if (commit->util)
> fprintf(stderr, "%3d", weight(p));
>
> I think this was a bug introduced at bb408ac9 ("bisect.c: use
> commit-slab for commit weight instead of commit->util", 2018-05-19)
> where the internal implementation of weight() was changed not to
> touch commit->util but instead to use a separate commit-slab storage
>
> Looking at the code before that conversion, it seems that we were
> using ->util to store a pointer to an integer, so we had the ability
> to differenciate non-negative weight (i.e. weight already computed
> for the commit), negative weight (i.e. not computed yet, but will
> be), and commits to which the concept of weight is not applicable.
> When we went to the commit-slab with the change, we lost the ability
> to represent the third case.  I am offhand not sure what the best
> remedy would be.  Perhaps stuff a so-far unused value like -3 to the
> weight() and use weight(p) == -3 instead of the old !commit->util or
> something like that?

Hmm.. no? the commit-slab stores the pointer to the weight, not the
weight itself, so we still have the ability to check the third case, I
think.

> (Duy CC'ed to help checking my sanity on this point).
>
> In any case, this is an existing bug in a debug helper, and the
> focus of this patch is not about fixing that bug, so you can and
> should leave it as-is, until this patch successfully adds the
> "bisection following only the first parent" feature.

Yes. I'll post a patch soon to fix this "commit->util" leftover.
-- 
Duy


Re: [BUG] index corruption with git commit -p

2018-09-02 Thread Jeff King
On Sun, Sep 02, 2018 at 09:12:04AM +0200, Duy Nguyen wrote:

> > diff --git a/builtin/commit.c b/builtin/commit.c
> > index 0d9828e29e..779c5e2cb5 100644
> > --- a/builtin/commit.c
> > +++ b/builtin/commit.c
> > @@ -359,13 +359,6 @@ static const char *prepare_index(int argc, const char 
> > **argv, const char *prefix
> >  
> > discard_cache();
> > read_cache_from(get_lock_file_path(_lock));
> > -   if (update_main_cache_tree(WRITE_TREE_SILENT) == 0) {
> > -   if (reopen_lock_file(_lock) < 0)
> > -   die(_("unable to write index file"));
> > -   if (write_locked_index(_index, _lock, 0))
> > -   die(_("unable to update temporary index"));
> > -   } else
> > -   warning(_("Failed to update main cache tree"));
> >
> 
> Narrowing down to this does help. This patch seems to fix it to me. I
> guess we have some leftover from the interactive add that should not
> be there after we have written the new index.
> 
> diff --git a/builtin/commit.c b/builtin/commit.c
> index 2be7bdb331..60f30b3780 100644
> --- a/builtin/commit.c
> +++ b/builtin/commit.c
> @@ -432,6 +432,7 @@ static const char *prepare_index(int argc, const char 
> **argv, const char *prefix
>   if (update_main_cache_tree(WRITE_TREE_SILENT) == 0) {
>   if (reopen_lock_file(_lock) < 0)
>   die(_("unable to write index file"));
> + ftruncate(index_lock.tempfile->fd, 0);
>   if (write_locked_index(_index, _lock, 0))
>   die(_("unable to update temporary index"));
>   } else

Doh, of course. I even thought about this issue and dug all the way into
reopen_lock_file(), but for some reason temporarily forgot that O_WRONLY
does not imply O_TRUNC.

Arguably this should be the default for reopen_lockfile(), as getting a
write pointer into an existing file is not ever going to be useful for
the way Git uses lockfiles. Opening with O_APPEND could conceivably be
useful, but it's pretty unlikely (and certainly not helpful here, and
this is the only caller). Alternatively, the function should just take
open(2) flags.

At any rate, I think this perfectly explains the behavior we're seeing.

-Peff


Re: [BUG] index corruption with git commit -p

2018-09-02 Thread Duy Nguyen
On Sun, Sep 02, 2018 at 01:08:03AM -0400, Jeff King wrote:
> On Sun, Sep 02, 2018 at 12:17:53AM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
> > > Here are the steps to reproduce it:
> > >   $ git clone git://github.com/lucvoo/sparse-dev.git 
> > >   $ cd 
> > >   $ git co index-corruption
> > >   $ git rm -r validation/ Documentation/
> > >   $ git commit -m  -p
> > >   $ git status
> > > error: index uses $?+? extension, which we do not understand
> > > fatal: index file corrupt
> > >
> > > The 'extension' pattern '$?+?', can vary a bit, sometimes
> > > it's just '', but always seems 4 chars.
> > > If the commit command doesn't use the '-p' flag, there is no
> > > problem. The repository itself is not corrupted, it's only
> > > the index. It happends with git 2.18.0 and 2.17.0
> > 
> > Yeah this is a bug, I didn't dig much but testing with this script down
> > to 2.8.0:
> > [...]
> > I found that the first bad commit was: 680ee550d7 ("commit: skip
> > discarding the index if there is no pre-commit hook", 2017-08-14)
> 
> I think it's much older than that. I set up my test repo like this:
> 
>   git clone git://github.com/lucvoo/sparse-dev.git
>   cd sparse-dev
>   git checkout --detach
> 
> and then bisected with this script:
> 
>   cd /path/to/sparse-dev
>   rm .git/index
>   git reset --hard index-corruption &&
>   git rm -q -r validation/ Documentation/ &&
>   git commit -qm foo -p &&
>   git status
> 
> Since a33fc72fe9 (read-cache: force_verify_index_checksum, 2017-04-14),
> that produces the corrupt extension error. But before that, I
> consistently get:
> 
>   error: bad index file sha1 signature
>   fatal: index file corrupt
> 
> from git-commit. And that bisects back to 9c4d6c0297 (cache-tree: Write
> updated cache-tree after commit, 2014-07-13).
> 
> If I revert that commit (which takes some untangling, see below), then
> the problem seems to go away. Here's the patch I tried on top of the
> current master, though I think it is actually the first hunk that is
> making the difference.
> 
> ---
> diff --git a/builtin/commit.c b/builtin/commit.c
> index 0d9828e29e..779c5e2cb5 100644
> --- a/builtin/commit.c
> +++ b/builtin/commit.c
> @@ -359,13 +359,6 @@ static const char *prepare_index(int argc, const char 
> **argv, const char *prefix
>  
>   discard_cache();
>   read_cache_from(get_lock_file_path(_lock));
> - if (update_main_cache_tree(WRITE_TREE_SILENT) == 0) {
> - if (reopen_lock_file(_lock) < 0)
> - die(_("unable to write index file"));
> - if (write_locked_index(_index, _lock, 0))
> - die(_("unable to update temporary index"));
> - } else
> - warning(_("Failed to update main cache tree"));
>

Narrowing down to this does help. This patch seems to fix it to me. I
guess we have some leftover from the interactive add that should not
be there after we have written the new index.

diff --git a/builtin/commit.c b/builtin/commit.c
index 2be7bdb331..60f30b3780 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -432,6 +432,7 @@ static const char *prepare_index(int argc, const char 
**argv, const char *prefix
if (update_main_cache_tree(WRITE_TREE_SILENT) == 0) {
if (reopen_lock_file(_lock) < 0)
die(_("unable to write index file"));
+   ftruncate(index_lock.tempfile->fd, 0);
if (write_locked_index(_index, _lock, 0))
die(_("unable to update temporary index"));
} else


--
Duy