On Wed, Feb 01, 2017 at 10:06:15AM -0800, Junio C Hamano wrote:
> > If you _can_ do that latter part, and you take "I only care about
> > resumability" to the simplest extreme, you'd probably end up with a
> > protocol more like:
> >
> > Client: I need a packfile with this want/have
> > Server: OK, here it is; its opaque id is XYZ.
> > ... connection interrupted ...
> > Client: It's me again. I have up to byte N of pack XYZ
> > Server: OK, resuming
> > [or: I don't have XYZ anymore; start from scratch]
> >
> > Then generating XYZ and generating that bundle are basically the same
> > task.
>
> The above allows a simple and naive implementation of generating a
> packstream and "tee"ing it to a spool file to be kept while sending
> to the first client that asks XYZ.
>
> The story I heard from folks who run git servers at work for Android
> and other projects, however, is that they rarely see two requests
> with want/have that result in an identical XYZ, unless "have" is an
> empty set (aka "clone"). In a busy repository, between two clone
> requests relatively close together, somebody would be pushing, so
> you'd need many XYZs in your spool even if you want to support only
> the "clone" case.
Yeah, I agree a tag "XYZ" does not cover all cases, especially for
fetches.
We do caching at GitHub based on the sha1(want+have+options) tag, and it
does catch quite a lot of parallelism, but not all. It catches most
clones, and many fetches that are done by "thundering herds" of similar
clients.
One thing you could do with such a pure "resume XYZ" tag is to represent
the generated pack _without_ replicating the actual object bytes, but
take shortcuts by basing particular bits on the on-disk packfile. Just
enough to serve a deterministic packfile for the same want/have bits.
For instance, if the server knew that XYZ meant
- send bytes m through n of packfile p, then...
- send the object at position i of packfile p, as a delta against the
object at position j of packfile q
- ...and so on
Then you could store very small "instruction sheets" for each XYZ that
rely on the data in the packfiles. If those packfiles go away (e.g., due
to a repack) that invalidates all of your current XYZ tags. That's OK as
long as this is an optimization, not a correctness requirement.
I haven't actually built anything like this, though, so I don't have a
complete language for the instruction sheets, nor numbers on how big
they would be for average cases.
> So in the real life, I think that the exchange needs to be more
> like this:
>
> C: I need a packfile with this want/have
> ... C/S negotiate what "have"s are common ...
> S: Sorry, but our negitiation indicates that you are way too
> behind. I'll send you a packfile that brings you up to a
> slightly older set of "want", so pretend that you asked for
> these slightly older "want"s instead. The opaque id of that
> packfile is XYZ. After getting XYZ, come back to me with
> your original set of "want"s. You would give me more recent
> "have" in that request.
> ... connection interrupted ...
> C: It's me again. I have up to byte N of pack XYZ
> S: OK, resuming (or: I do not have it anymore, start from scratch)
> ... after 0 or more iterations C fully receives and digests XYZ ...
>
> and then the above will iterate until the server does not have to
> say "Sorry but you are way too behind" and returns a packfile
> without having to tweak the "want".
Yes, I think that is a reasonable variant. The client knows about
seeding, but the XYZ conversation continues to happen inside the git
protocol. So it loses flexibility versus a true CDN redirection, but it
would "just work" when the server/client both understand the feature,
without the server admin having to set up a separate bundle-over-http
infrastructure.
-Peff