On Thu, Mar 2, 2017 at 10:37 AM, Jeff Hostetler <g...@jeffhostetler.com> wrote:
>>
>> Now, if your _file_ index is 300-400MB (and I do think we check the
>> SHA fingerprint on that even on just reading it - verify_hdr() in
>> do_read_index()), then that's going to be a somewhat noticeable hit on
>> every normal "git diff" etc.
>
> Yes, the .git/index is 450MB with ~3.1M entries.  verify_hdr() is called
> each time we read it into memory.

Ok. So that's really just a purely historical artifact.

The file index is actually the first part of git to have ever been
written. You can't even see it in the history, because the initial
revision from Apr 7, 2005, obviously depended on the actual object
hashing.

But the file index actually came first. You can _kind_ of see that in
the layout of the original git tree, and how the main header file is
still called "cache.h", and how the original ".git" directory was
actually called ".dircache".

And the two biggest files (by a fairly big margin) are "read-cache.c"
and "update-cache.c".

So that file index cache was in many ways _the_ central part of the
original git model. The sha1 file indexing and object database was
just the backing store for the file index.

But part of that history is then how much I worried about corruption
of that index (and, let's face it, general corruption resistance _was_
one of the primary design goals - performance was high up there too,
but safety in the face of filesystem corruption was and is a primary
issue).

But realistically, I don't think we've *ever* hit anything serious on
the index file, and it's obviously not a security issue. It also isn't
even a compatibility issue, so it would be trivial to just bump the
version header and saying that the signature changes the meaning of
the checksum.

That said:

> We have been testing a patch in GfW to run the verification in a separate 
> thread
> while the main thread parses (and mallocs) the cache_entries.  This does help
> offset the time.

Yeah, that seems an even better solution, honestly.

The patch would be cleaner without the NO_PTHREADS things.

I wonder how meaningful that thing even is today. Looking at what
seems to select NO_PTHREADS, I suspect that's all entirely historical.
For example, you'll see it for QNX etc, which seems wrong - QNX
definitely has pthreads according to their docs, for example.

                     Linus

Reply via email to