[PATCH 0/16] make prune mtime-checking more careful

2014-10-03 Thread Jeff King
At GitHub we've occasionally run across repos getting corrupted by trees
and blobs near the tip going missing. We do a lot of "test merges"
between branches and HEAD (this is what feeds the "OK to merge" button
on the web interface), and the objects are almost always related to
these merges. The objects are removed by prune, which doesn't realize
that they are part of an ongoing operation. Prune uses the filesystem
mtime to determine this, but we are not very thorough in making sure
that is kept up to date.

This series tries to fix that with two techniques:

  1. When we try to write an object to disk, we optimize out the write
 if we already have the object. Instead, we should still update the
 mtime of the object in this case.

  2. Treat objects reachable from "recent" objects as recent themselves.

 When we check that we have an object, we do not check whether we
 have all of the objects it can reach. If you have some new objects
 that refer to some old objects (e.g., you create and delete a tree
 on day 1, and then create a new tree referring to the blob on day
 2), then prune may delete the old object but not the new (in this
 case, we delete the blob but not the tree).

 Any subsequent use of the new object will check that we have it
 (e.g., commit-tree makes sure we have the tree we feed it), but not
 other objects it can reach. This can lead to referencing a
 half-formed part of the graph.

Note that this does not make prune race-free. For example, you could
check for and update the mtime of an object just as prune is deleting
it, and think that it is written when it is not. Fixing that would
require some atomic mechanism for prune to check the mtime and delete.
But I do think this series cuts us down to "real" race conditions, with
millisecond-ish timing. The problems we're fixing here are much worse
than that. The distance between an object being written and being
referred may operate on human timescales (e.g., writing a commit
message). Or the time distance between two objects that refer to each
other may be days or weeks; a prune where one falls in the "recent"
boundary and another does not can be disastrous.

There's quite a lot of patches here, but most of them are preparatory
cleanups. The meat is in patches 13, 15, and 16.

  [01/16]: foreach_alt_odb: propagate return value from callback
  [02/16]: isxdigit: cast input to unsigned char
  [03/16]: object_array: factor out slopbuf-freeing logic
  [04/16]: object_array: add a "clear" function
  [05/16]: clean up name allocation in prepare_revision_walk
  [06/16]: reachable: clear pending array after walking it
  [07/16]: t5304: use test_path_is_* instead of "test -f"
  [08/16]: t5304: use helper to report failure of "test foo = bar"
  [09/16]: prune: factor out loose-object directory traversal
  [10/16]: count-objects: do not use xsize_t when counting object size
  [11/16]: count-objects: use for_each_loose_file_in_objdir
  [12/16]: sha1_file: add for_each iterators for loose and packed objects
  [13/16]: prune: keep objects reachable from recent objects
  [14/16]: pack-objects: refactor unpack-unreachable expiration check
  [15/16]: pack-objects: match prune logic for discarding objects
  [16/16]: write_sha1_file: freshen existing objects

Note that these aren't yet running on GitHub servers. I know that they
fix real potential problems (see the new t6501 for examples), but I
don't know for sure if they will catch the problems we have seen. The
frequency of these issues is relatively rare, so even once deployed, we
won't know for sure until a few weeks or months have passed.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/16] make prune mtime-checking more careful

2014-10-03 Thread Junio C Hamano
Jeff King  writes:

> ... The objects are removed by prune, which doesn't realize
> that they are part of an ongoing operation. Prune uses the filesystem
> mtime to determine this, but we are not very thorough in making sure
> that is kept up to date.

The whole series looked quite sensible.  Thanks.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/16] make prune mtime-checking more careful

2014-10-04 Thread Junio C Hamano
Jeff King  writes:

> There's quite a lot of patches here, but most of them are preparatory
> cleanups. The meat is in patches 13, 15, and 16.
>
>   [01/16]: foreach_alt_odb: propagate return value from callback
>   [02/16]: isxdigit: cast input to unsigned char
>   [03/16]: object_array: factor out slopbuf-freeing logic
>   [04/16]: object_array: add a "clear" function
>   [05/16]: clean up name allocation in prepare_revision_walk
>   [06/16]: reachable: clear pending array after walking it
>   [07/16]: t5304: use test_path_is_* instead of "test -f"
>   [08/16]: t5304: use helper to report failure of "test foo = bar"
>   [09/16]: prune: factor out loose-object directory traversal
>   [10/16]: count-objects: do not use xsize_t when counting object size
>   [11/16]: count-objects: use for_each_loose_file_in_objdir
>   [12/16]: sha1_file: add for_each iterators for loose and packed objects
>   [13/16]: prune: keep objects reachable from recent objects
>   [14/16]: pack-objects: refactor unpack-unreachable expiration check
>   [15/16]: pack-objects: match prune logic for discarding objects
>   [16/16]: write_sha1_file: freshen existing objects

Somebody please help me out here.

This applied on top of 'maint' (which does have c40fdd01) makes the
test #9 (prune: do not prune detached HEAD with no reflog) fail.

If we merge 'dt/cache-tree-repair' (which in turn needs
'jc/reopen-lock-file') to 'maint' and then apply these on top, the
said test passes.  But I do not see an apparent reason why X-<.


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/16] make prune mtime-checking more careful

2014-10-05 Thread René Scharfe

Am 05.10.2014 um 00:22 schrieb Junio C Hamano:

Jeff King  writes:


There's quite a lot of patches here, but most of them are preparatory
cleanups. The meat is in patches 13, 15, and 16.

   [01/16]: foreach_alt_odb: propagate return value from callback
   [02/16]: isxdigit: cast input to unsigned char
   [03/16]: object_array: factor out slopbuf-freeing logic
   [04/16]: object_array: add a "clear" function
   [05/16]: clean up name allocation in prepare_revision_walk
   [06/16]: reachable: clear pending array after walking it
   [07/16]: t5304: use test_path_is_* instead of "test -f"
   [08/16]: t5304: use helper to report failure of "test foo = bar"
   [09/16]: prune: factor out loose-object directory traversal
   [10/16]: count-objects: do not use xsize_t when counting object size
   [11/16]: count-objects: use for_each_loose_file_in_objdir
   [12/16]: sha1_file: add for_each iterators for loose and packed objects
   [13/16]: prune: keep objects reachable from recent objects
   [14/16]: pack-objects: refactor unpack-unreachable expiration check
   [15/16]: pack-objects: match prune logic for discarding objects
   [16/16]: write_sha1_file: freshen existing objects


Somebody please help me out here.

This applied on top of 'maint' (which does have c40fdd01) makes the
test #9 (prune: do not prune detached HEAD with no reflog) fail.


The test passes if the return value of freshen_file() is negated in 
freshen_packed_object() (see my reply to patch 16).



If we merge 'dt/cache-tree-repair' (which in turn needs
'jc/reopen-lock-file') to 'maint' and then apply these on top, the
said test passes.  But I do not see an apparent reason why X-<.


Didn't check this interaction, but it sounds strange.

René

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/16] make prune mtime-checking more careful

2014-10-05 Thread Jeff King
On Sat, Oct 04, 2014 at 03:22:10PM -0700, Junio C Hamano wrote:

> This applied on top of 'maint' (which does have c40fdd01) makes the
> test #9 (prune: do not prune detached HEAD with no reflog) fail.

I'll fix the bone-headed error returns that René noticed and double
check that they were the complete culprit in the test failure you saw
(and not just masking some other problem).

> If we merge 'dt/cache-tree-repair' (which in turn needs
> 'jc/reopen-lock-file') to 'maint' and then apply these on top, the
> said test passes.  But I do not see an apparent reason why X-<.

I suspect it's an unintended interaction that just happens to let my
bogus code code the right thing, but I'll double check on that, too.

Thanks both of you for spotting the problems.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/16] make prune mtime-checking more careful

2014-10-08 Thread Jeff King
On Sun, Oct 05, 2014 at 09:42:49PM -0400, Jeff King wrote:

> On Sat, Oct 04, 2014 at 03:22:10PM -0700, Junio C Hamano wrote:
> 
> > This applied on top of 'maint' (which does have c40fdd01) makes the
> > test #9 (prune: do not prune detached HEAD with no reflog) fail.
> 
> I'll fix the bone-headed error returns that René noticed and double
> check that they were the complete culprit in the test failure you saw
> (and not just masking some other problem).

OK, I figured out what is going on. The short of it is that yes, it's
the return-value bug René noticed. Read on only if you are really
interested.  :)

The test runs "git prune" and intends to check that the detached HEAD
commit is not in the list. It does so by checking the whole output of
"git prune -n", which it expects to be empty (and it generally is,
because we've gc'd in previous tests and all objects were either packed
or pruned).

However, when the test fails, there is a pruned object. But it is not
the HEAD commit that we just wrote, but rather the tree that it points
to. When we run "git commit", it tries to write out the tree with
write_sha1_file. This in turn calls freshen_packed_object to check
whether we have the object packed. We do, but the return-value bug makes
us think we do not. As a result, we write out an extra copy of the tree
as a loose object, and that is what gets pruned (and not by prune's
normal is-it-old-and-unreachable code, but by its call to prune_packed).

So that explains that bug (as a side note, you might think that if we
are flipping return values, lots of things would fail when they ask "do
we have this packed object" and it erroneously says "yes". But that does
not happen. The wrong return value comes from freshening the file, so we
only flip "yes" to "no", and never the opposite).

> > If we merge 'dt/cache-tree-repair' (which in turn needs
> > 'jc/reopen-lock-file') to 'maint' and then apply these on top, the
> > said test passes.  But I do not see an apparent reason why X-<.

When dt/cache-tree-repair is merged, we have a valid cache tree when we
run "git commit", and we realize that we do not need to write out the
tree object at all. Thus we never hit the buggy code, the object isn't
created, and the subsequent prune reports that there is nothing to
prune.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/16] make prune mtime-checking more careful

2014-10-08 Thread Junio C Hamano
Jeff King  writes:

> So that explains that bug (as a side note, you might think that if we
> are flipping return values, lots of things would fail when they ask "do
> we have this packed object" and it erroneously says "yes". But that does
> not happen. The wrong return value comes from freshening the file, so we
> only flip "yes" to "no", and never the opposite).
> ...
> When dt/cache-tree-repair is merged, we have a valid cache tree when we
> run "git commit", and we realize that we do not need to write out the
> tree object at all. Thus we never hit the buggy code, the object isn't
> created, and the subsequent prune reports that there is nothing to
> prune.

Wow, that is a tricky "bug" (rather the series with the bug failed
to fail when applied to 'master', so it is an "unbug" that hides a
bug) to hunt down.  Thanks for digging.


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html