Re: [PATCH 3/6] fsck: prepare dummy objects for --connectivity-check

2017-01-17 Thread Jeff King
On Tue, Jan 17, 2017 at 01:15:57PM -0800, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > +static void mark_object_for_connectivity(const unsigned char *sha1)
> > +{
> > +   struct object *obj = lookup_object(sha1);
> > +
> > +   /*
> > +* Setting the object type here isn't strictly necessary here for a
> > +* connectivity check.
> 
> Drop one of these "here"s?

Yeah.

> > The cmd_fsck() part of the diff is pretty nasty without
> > "-b".
> 
> True.  I also wonder if swapping the if/else arms make the end
> result and the patch easier to read. i.e.
> 
> + if (connectivity_only) {
> + mark loose for connectivity;
> + mark packed for connectivity;
> + } else {
>   ... existing code comes here reindented ...
>   }
> 
> But the patch makes sense.

Yeah. It doesn't help the diff, but the end result is more readable. And
the conditional lacks a negation, which is nice. Will change.

> > +test_expect_success 'fsck --connectivity-only with explicit head' '
> > +   (
> > +   cd connectivity-only &&
> > +   test_must_fail git fsck --connectivity-only $tree
> > +   )
> > +'
> 
> Most of the earlier "tree=..." assignments are done in subshells,
> and it is not clear which tree this refers to.  Is this the one that
> was written in 'rev-list --verify-objects with bad sha1' that has
> been removed in its when-finished handler?

Doh. It was meant to be the one from the earlier --connectivity-only
check. But that is doubly silly, even, because the tree variable in that
case is holding the filename, not the sha1. The right thing is to call
rev-parse again, but of course that doesn't work because the repo has
been corrupted. :-/

Here's a re-roll with the two changes above (and the notice thing you
mentioned elsewhere), plus a test that actually checks the right thing
(fails before this commit and passes after).

If everything else looks good, you can queue it along with the rest.
Otherwise if I need a re-roll, it will be in the next version.

-- >8 --
Subject: fsck: prepare dummy objects for --connectivity-check

Normally fsck makes a pass over all objects to check their
integrity, and then follows up with a reachability check to
make sure we have all of the referenced objects (and to know
which ones are dangling). The latter checks for the HAS_OBJ
flag in obj->flags to see if we found the object in the
first pass.

Commit 02976bf85 (fsck: introduce `git fsck --connectivity-only`,
2015-06-22) taught fsck to skip the initial pass, and to
fallback to has_sha1_file() instead of the HAS_OBJ check.

However, it converted only one HAS_OBJ check to use
has_sha1_file(). But there are many other places in
builtin/fsck.c that assume that the flag is set (or that
lookup_object() will return an object at all). This leads to
several bugs with --connectivity-only:

  1. mark_object() will not queue objects for examination,
 so recursively following links from commits to trees,
 etc, did nothing. I.e., we were checking the
 reachability of hardly anything at all.

  2. When a set of heads is given on the command-line, we
 use lookup_object() to see if they exist. But without
 the initial pass, we assume nothing exists.

  3. When loading reflog entries, we do a similar
 lookup_object() check, and complain that the reflog is
 broken if the object doesn't exist in our hash.

So in short, --connectivity-only is broken pretty badly, and
will claim that your repository is fine when it's not.
Presumably nobody noticed for a few reasons.

One is that the embedded test does not actually test the
recursive nature of the reachability check. All of the
missing objects are still in the index, and we directly
check items from the index. This patch modifies the test to
delete the index, which shows off breakage (1).

Another is that --connectivity-only just skips the initial
pass for loose objects. So on a real repository, the packed
objects were still checked correctly. But on the flipside,
it means that "git fsck --connectivity-only" still checks
the sha1 of all of the packed objects, nullifying its
original purpose of being a faster git-fsck.

And of course the final problem is that the bug only shows
up when there _is_ corruption, which is rare. So anybody
running "git fsck --connectivity-only" proactively would
assume it was being thorough, when it was not.

One possibility for fixing this is to find all of the spots
that rely on HAS_OBJ and tweak them for the connectivity-only
case. But besides the risk that we might miss a spot (and I
found three already, corresponding to the three bugs above),
there are other parts of fsck that _can't_ work without a
full list of objects. E.g., the list of dangling objects.

Instead, let's make the connectivity-only case look more
like the normal case. Rather than skip the initial pass
completely, we'll do an abbreviated one that sets up the
HAS_OBJ flag for each object, without 

Re: [PATCH 3/6] fsck: prepare dummy objects for --connectivity-check

2017-01-17 Thread Junio C Hamano
Jeff King  writes:

> +static void mark_object_for_connectivity(const unsigned char *sha1)
> +{
> + struct object *obj = lookup_object(sha1);
> +
> + /*
> +  * Setting the object type here isn't strictly necessary here for a
> +  * connectivity check.

Drop one of these "here"s?


> The cmd_fsck() part of the diff is pretty nasty without
> "-b".

True.  I also wonder if swapping the if/else arms make the end
result and the patch easier to read. i.e.

+   if (connectivity_only) {
+   mark loose for connectivity;
+   mark packed for connectivity;
+   } else {
... existing code comes here reindented ...
}

But the patch makes sense.

> diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
> index e88ec7747..c1b2dda33 100755
> --- a/t/t1450-fsck.sh
> +++ b/t/t1450-fsck.sh
> @@ -523,9 +523,21 @@ test_expect_success 'fsck --connectivity-only' '
>   touch empty &&
>   git add empty &&
>   test_commit empty &&
> +
> + # Drop the index now; we want to be sure that we
> + # recursively notice that we notice the broken objects
> + # because they are reachable from refs, not because
> + # they are in the index.
> + rm -f .git/index &&
> +
> + # corrupt the blob, but in a way that we can still identify
> + # its type. That lets us see that --connectivity-only is
> + # not actually looking at the contents, but leaves it
> + # free to examine the type if it chooses.
>   empty=.git/objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391 &&
> - rm -f $empty &&
> - echo invalid >$empty &&
> + blob=$(echo unrelated | git hash-object -w --stdin) &&
> + mv $(sha1_file $blob) $empty &&
> +
>   test_must_fail git fsck --strict &&
>   git fsck --strict --connectivity-only &&
>   tree=$(git rev-parse HEAD:) &&
> @@ -537,6 +549,13 @@ test_expect_success 'fsck --connectivity-only' '
>   )
>  '
>  
> +test_expect_success 'fsck --connectivity-only with explicit head' '
> + (
> + cd connectivity-only &&
> + test_must_fail git fsck --connectivity-only $tree
> + )
> +'

Most of the earlier "tree=..." assignments are done in subshells,
and it is not clear which tree this refers to.  Is this the one that
was written in 'rev-list --verify-objects with bad sha1' that has
been removed in its when-finished handler?

>  remove_loose_object () {
>   sha1="$(git rev-parse "$1")" &&
>   remainder=${sha1#??} &&


Re: [PATCH 3/6] fsck: prepare dummy objects for --connectivity-check

2017-01-17 Thread Junio C Hamano
Jeff King  writes:

> + # Drop the index now; we want to be sure that we
> + # recursively notice that we notice the broken objects
> + # because they are reachable from refs, not because
> + # they are in the index.

Rephrase to reduce redundunt "notice"s?