On Fri, Feb 28, 2014 at 10:09:08AM -0700, Nasser Grainawi wrote:

> > Exactly. The two features (bitmaps and .keep) are not compatible with
> > each other, so you have to prioritize one. If you are using static .keep
> > files, you might want them to continue being respected at the expense of
> > using bitmaps for that repo. So I think you want a separate option from
> > --write-bitmap-index to allow the appropriate flexibility.
> 
> Has anyone thought about how to make them compatible?

Yes, but it's complicated and not likely to happen soon.

Having .keep files means that you are not including some objects in the
newly created pack. Each bit in a commit's bitmap corresponds to one
object in the pack, and whether it is reachable from that commit. The
bitmap is only useful if we can calculate the full reachability from it,
and it has no way to specify objects outside of the pack.

To fix this, you would need to change the on-disk format of the bitmaps
to somehow reference objects outside of the pack. Either by having the
bitmaps index a repo-global set of objects, or by permitting a list of
"edge" objects that are referenced from the pack, but not included (and
then when assembling the full reachable list, you would have to recurse
across "edge" objects to find their reachable list in another pack,
etc).

So it's possible, but it would complicate the scheme quite a bit, and
would not be backwards compatible with either JGit or C Git.

> We're using Martin Fick's git-exproll script which makes heavy use of
> keeps to reduce pack file churn. In addition to the on-disk benefits
> we get there, the driving factor behind creating exproll was to
> prevent Gerrit from having two large (30GB+) mostly duplicated pack
> files open in memory at the same time. Repacking in JGit would help in
> a single-master environment, but we'd be back to having this problem
> once we go to a multi-master setup.
> 
> Perhaps the solution here is actually something in JGit where it could
> aggressively try to close references to pack files

In C git we don't worry about this too much, because our programs tend
to be short-lived, and references to the old pack will go away quickly.
Plus it is all mmap'd, so as we simply stop accessing the pages of the
old pack, they should eventually be dropped if there is memory pressure.

I seem to recall that JGit does not mmap its packfiles. Does it pread?
In that case, I'd expect unused bits from the duplicated packfile to get
dropped from the disk cache over time. If it loads whole packfiles into
memory, then yes, it should probably close more aggressively.

> , but that still
> doesn't help the disk churn problem. As Peff says below, we would want
> to repack often to get up-to-date bitmaps, but ideally we could do
> that without writing hundreds of GBs to disk (which is obviously worse
> when "disk" is a NFS mount).

Ultimately I think the solution to the churn problem is a packfile-like
storage that allows true appending of deltas. You can come up with a
scheme to allow deltas between on-disk packs (i.e., "thin" packs on
disk). The trick there is handling the dependencies and cycles. I think
you could get by with a strict ordering of packs and a few rules:

  1. An object in a pack with weight A cannot have as a base an object
     in a pack with weight <= A.

  2. A pack with weight A cannot be deleted if there exists a pack with
     weight > A.

But you'd want to also add in a single update-able index over all the
packfiles, and even then you'd still want to pack occasionally (because
you'd end up with deltas on bases going back in time, but you really
prefer your bases to be near the tip of history).

So I am not volunteering to work on it. :)

At GitHub we mostly deal with the churn by throwing more server
resources at it. But we have the advantage of having a very large number
of small-to-medium repos, which is relatively easy to scale up. A small
number of huge repos is trickier.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to