On Tue, Feb 02, 2016 at 11:25:44AM -0800, Junio C Hamano wrote:

> Having said that, I do not necessarily think "git checkout" should
> revalidate the object name.  The repository that you use for your
> daily work would have the same error/corruption rate as your working
> tree files, and I do not think you would constantly "validate" what
> is in your working tree by comparing their contents with what you
> think ought to be there.

No, but there is a question of how much it costs to do so.

In the past, I've spent a lot of time trying to speed up object access,
but it was usually about replacing `parse_object` with
`lookup_object` or similar to avoid loading from disk in the first
place, as most of the time goes to zlib. If we are accessing the object
already (and obviously we have to for something like checkout), I'm not
sure what the marginal cost is of computing the sha1 on the data as it
passes through.

It might be significant, but I don't have numbers.

If it's not, then I think it's a nice feature that we would notice
problems earlier rather than later.

> If you are working on extremely poor quality disks and SSDs, it
> might make sense to constantly revalidating the object data to catch
> corruption early, as that is what we can do (as opposed to the
> working tree files, corruption to which you probably do not have
> anything to catch bitflipping on).

Hopefully if your working tree files bit-flip, then you notice via `git
diff`. I guess you wouldn't for stat-clean ones. But if a bit flips in
the forest, and it's not stat-dirty enough for anyone to hear it, does
it make a sound?

> http://article.gmane.org/gmane.comp.version-control.git/283380 (not
> necessarily the entire thread, but that exact article) is a
> reasonable summary that illustrates the way how we view the object
> integrity.
> 
>     So "index-pack" is the enforcement point, and the rest of the
>     git commands generally assume that we can trust what is on disk
>     (as it is has either been generated by us, or checked by
>     index-pack).  The rest of the commands do not spend time
>     checking that the on-disk contents are sane (though you can run
>     git-fsck if you want to do that).

I think that's me your quoting. I was specifically talking about
malicious tampering there. Which isn't to say I disagree with the
world-view you're proposing, but I think for random errors it's a little
more complicated. We certainly should be checking incoming data (and we
do).

For local repository operations, most of them are about reading data.
And there I generally favor performance over extra validation, with the
caveat that we should always be aware of the tradeoff. An extra
comparison to make sure we are not going out-of-bounds on a pack .idx
pointer is cheap. Loading a blob just to make sure its sha1 is valid
before we mention it in `diff --raw` output is stupid. Checking the sha1
on objects we are otherwise accessing is somewhere in between. :)

For local write operations, like repacking, we should err on the careful
side. And I think we do a good job of balancing performance and
validation there (e.g., we reuse deltas without reconstructing the
object, but _with_ a crc check on the delta data itself).

> In fact, we do this, which is quite suboptimal:
> 
>         static int fsck_sha1(const unsigned char *sha1)
>         {
>                 struct object *obj = parse_object(sha1);
>                 if (!obj) {
>                         errors_found |= ERROR_OBJECT;
>                         return error("%s: object corrupt or missing",
>                                      sha1_to_hex(sha1));
>                 }
>                 obj->flags |= HAS_OBJ;
>                 return fsck_obj(obj);
>         }
> 
> This function is called for each loose object file we find in
> fsck_object_dir(), and there are a few problems:
> 
>  * The function parse_object() called from here would issue an error
>    message and returns NULL; then you get another "corrupt or
>    missing" error message, because this code cannot tell from the
>    NULL which is the case.
> 
>  * The intent of the callchain to fsck_sha1() is to iterate over
>    loose object files xx/x{38} and validate what is contained in
>    them, but that behaviour is not guaranteed because it calls
>    parse_object(), which may get the object data from a packfile
>    if the loose object is also in the packfile.
> 
> This function should instead take "path" (the only caller of this
> function fsck_loose() has it), read the data in the file, hash the
> data to validate that it matches "sha1" and then create the object
> out of that data it read by calling parse_object_buffer().

Yeah, I agree. I think as it is written, we also end up loading the
loose objects twice (once in parse_object, and then later again in
fsck_object to do the real work). Your solution would fix that, too.

It looks like we _don't_ load loose commits twice, though. We never turn
off save_commit_buffer, so we happily cache the buffer for every single
commit, even though we won't need it again after fsck_object returns.

I guess nobody has really noticed either issue, because repositories
large enough for it to make a difference will usually be packed.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to