On Tue, 11 Dec 2007, Jon Smirl wrote:
> This makes sense. Those runs that blew up to 4.5GB were a combination
> of this effect and fragmentation in the gcc allocator.
I disagree. This is insane.
> Google allocator appears to be much better at controlling fragmentation.
Indeed. And if fragmentation is indeed wasting half of Git's memory
usage then we'll have to come with a custom memory allocator.
> Is there a reasonable scheme to force the chains to only be loaded
> once and then shared between worker threads? The memory blow up
> appears to be directly correlated with chain length.
No. That would be the equivalent of holding each revision of all files
uncompressed all at once in memory.
> > That said, I suspect there are a few things fighting you:
> >
> > - threading is hard. I haven't looked a lot at the changes Nico did to do
> > a threaded object packer, but what I've seen does not convince me it is
> > correct. The "trg_entry" accesses are *mostly* protected with
> > "cache_lock", but nothing else really seems to be, so quite frankly, I
> > wouldn't trust the threaded version very much. It's off by default, and
> > for a good reason, I think.
> >
> > For example: the packing code does this:
> >
> > if (!src->data) {
> > read_lock();
> > src->data = read_sha1_file(src_entry->idx.sha1, &type, &sz);
> > read_unlock();
> > ...
> >
> > and that's racy. If two threads come in at roughly the same time and
> > see a NULL src->data, theÿ́'ll both get the lock, and they'll both
> > (serially) try to fill it in. It will all *work*, but one of them will
> > have done unnecessary work, and one of them will have their result
> > thrown away and leaked.
>
> That may account for the threaded version needing an extra 20 minutes
> CPU time. An extra 12% of CPU seems like too much overhead for
> threading. Just letting a couple of those long chain compressions be
> done twice
No it may not. This theory is wrong as explained before.
> >
> > Are you hitting issues like this? I dunno. The object sorting means
> > that different threads normally shouldn't look at the same objects (not
> > even the sources), so probably not, but basically, I wouldn't trust the
> > threading 100%. It needs work, and it needs to stay off by default.
> >
> > - you're working on a problem that isn't really even worth optimizing
> > that much. The *normal* case is to re-use old deltas, which makes all
> > of the issues you are fighting basically go away (because you only have
> > a few _incremental_ objects that need deltaing).
>
> I agree, this problem only occurs when people import giant
> repositories. But every time someone hits these problems they declare
> git to be screwed up and proceed to thrash it in their blogs.
It's not only for repack. Someone just reported git-blame being
unusable too due to insane memory usage, which I suspect is due to the
same issue.
Nicolas