On Wed, May 16, 2018 at 04:02:53PM -0400, Konstantin Ryabitsev wrote:

> On 05/16/18 15:37, Jeff King wrote:
> > Yes, that's pretty close to what we do at GitHub. Before doing any
> > repacking in the mother repo, we actually do the equivalent of:
> > 
> >   git fetch --prune ../$id.git +refs/*:refs/remotes/$id/*
> >   git repack -Adl
> > 
> > from each child to pick up any new objects to de-duplicate (our "mother"
> > repos are not real repos at all, but just big shared-object stores).
> 
> Yes, I keep thinking of doing the same, too -- instead of using
> torvalds/linux.git for alternates, have an internal repo where objects
> from all forks are stored. This conversation may finally give me the
> shove I've been needing to poke at this. :)
> 
> Is your delta-islands patch heading into upstream, or is that something
> that's going to remain external?

I have vague plans to submit it upstream, but I'm still not convinced
it's quite optimal. The resulting packs tend to be a fair bit larger
than they could be when packed by themselves, because we miss many delta
opportunities (and it's important to "repack -f --window=250" once in a
while, since we're throwing away so many delta candidates).

There's an alternative way of doing it, too, which I think git.or.cz
uses: it "layers" forks in a hierarchy. So if I fork torvalds/linux.git,
then I get my own repo that uses torvalds/linux as an alternate. And if
somebody forks my repo, then I'm their alternate, and they recursively
depend on torvalds/linux. So each fork basically layers a slice of its
own pack on top of the parent.

This is all from recollections of past discussions (which were sadly not
on the list -- I don't know if they've written up their scheme anywhere
public), so I may have some details wrong. But I think that their
repacking is done hierarchically, too: any objects which the root fork
might drop get migrated up to the children instead, and so forth, until
the leaf nodes can actually throw away objects.

The big problem with this is that Git tends to behave better when
objects are in the same pack:

  1. We don't bother looking for new deltas within the same pack,
     whereas a clone of a fork may actually try to find new deltas
     between the layers.

  2. Reachability bitmaps can't cross pack boundaries (due to the way
     they're implemented, but also the current on-disk format). So you
     can only bitmap the root repo, not any of the other layers.

> I feel like a whitepaper on "how we deal with bajillions of forks at
> GitHub" would be nice. :) I was previously told that it's unlikely such
> paper could be written due to so many custom-built things at GH, but I
> would be very happy if that turned out not to be the case.

We have a few engineering blog posts on the subject, like:

  https://githubengineering.com/counting-objects/
  https://githubengineering.com/introducing-dgit/
  https://githubengineering.com/building-resilience-in-spokes/

but we haven't done a very good job of keeping that up. I think a
summary whitepaper would interesting. Maybe one day...:)

-Peff

Reply via email to