Re: Resumable git clone?

Junio C Hamano Tue, 01 Mar 2016 22:32:10 -0800

Al Viro <v...@zeniv.linux.org.uk> writes:

> FWIW, I wasn't proposing to recreate the remaining bits of that _pack_;
> just do the normal pull with one addition: start with sending the list
> of sha1 of objects you are about to send and let the recepient reply
> with "I already have <set of sha1>, don't bother with those".  And exclude
> those from the transfer.


I did a quick-and-dirty unscientific experiment.

I had a clone of Linus's repository that was about a week old, whose
tip was at 4de8ebef (Merge tag 'trace-fixes-v4.5-rc5' of
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace,
2016-02-22).  To bring it up to date (i.e. a pull about a week's
worth of progress) to f691b77b (Merge branch 'for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs, 2016-03-01):

    $ git rev-list --objects 4de8ebef..f691b77b1fc | wc -l
    1396
    $ git rev-parse 4de8ebef..f691b77b1fc |
      git pack-objects --revs --delta-base-offset --stdout |
      wc -c
    2444127

So in order to salvage some transfer out of 2.4MB, the hypothetical
Al protocol would first have the upload-pack give 20*1396 = 28kB
object names to fetch-pack; no matter how fetch-pack encodes its
preference, its answer would be less than 28kB.  We would likely to
design this part of the new protocol in line with the existing part
and use textual object names, so let's round them up to 100kB.

That is quite small, even if you are on a crappy connection that you
need to retry 5 times, the additional overhead to negotiate the list
of objects alone would be 0.5MB (or less than 20% of the real
transfer).

That is quite interesting [*1*].

For the approach to be practical, you would have to write a program
that reads from a truncated packfile and writes a new packfile,
excising deltas that lack their bases, to salvage objects from a
half-transferred packfile; it is however unclear how involved the
code would get.

It is probably OK for a tiny pack that has only 1400 objects--we
could just pass the early part through unpack-objects and let it die
when it hits EOF, but for a "resumable clone", I do not think you
can afford to unpack 4.6M objects in the kernel repository into
loose objects.

The approach of course requires the server end to spend 5 times as
many cycles as usual in order to help a client that retries 5 times.

On the other hand, the resumable "clone" we were discussing by
allowing the server to respond with a slightly older bundle or a
pack and then asking the client to fill the latest bits by a
follow-up fetch targets to reduce the load of the server side (the
"slightly older" part can be offloaded to CDN).  It is a happy side
effect that material offloaded to CDN can more easily obtained via
HTTPS that is trivially resumable ;-)

I think your "I've got these already" extention may be worth trying,
and it is definitely better than the "let's make sure the server end
creates byte-for-byte identical pack stream, and discard the early
part without sending it to the network", and it may help resuming a
small incremental fetch, but I do not think it is advisable to use
it for a full clone, given that it is very likely that we would be
adding the "offload 'clone' to CDN" kind.  Even though I can foresee
both kinds to co-exist, I do not think it is practical to offer it
for resuming multi-hour cloning of the kernel repository (or worse,
Android repositories) over a trans-Pacific link, for example.


[Footnote]

*1* To update v4.5-rc1 to today's HEAD involves 10809 objects, and
    the pack data takes 14955728 bytes.  That translates to ~440kB
    needed to advertise a list of textual object names to salvage
    object transfer of 15MB.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Resumable git clone?

Reply via email to