Re: Multi-threaded 'git clone'

2015-02-17 Thread Junio C Hamano
On Tue, Feb 17, 2015 at 3:32 PM, Junio C Hamano  wrote:

A few typofixes and clarifications.

> *4* The scheme in *3* can be extended to bring the fetcher
> step-wise.  If the server's state was X when the fetcher last

"bring the fetcher up-to-date step-wise", or "update the fetcher step-wise".

> contacted it, and since then the server received multiple pushes
> and has two snapshots of states, Y and Z, then the exchange may
> go like this:
>
> fetcher: I am interested in refs/heads/* and refs/tags/* and I
>  have your state X.
>
> server:  Here is the incremental difference to the refs and the
>  end result should hash to Y.  Here comes the pack data
>  to bring you up to date.
>
> fetcher: (after receiving, unpacking and updating the
>  remote-tracking refs) Thanks.  Do you have more?
>
> server:  Yes, here is the incremental difference to the refs and the
>  end result should hash to Z.  Here comes the pack data
>  to bring you up to date.
>
> fetcher: (after receiving, unpacking and updating the
>  remote-tracking refs) Thanks.  Do you have more?
>
> server:  No, you are now fully up to date with me.  Bye.

The initial part of this exchange may go like this, if the state the
fetcher grabbed the last time from this server is even older than X:

  fetcher: I am interested in refs/heads/* and refs/tags/* and I have
your state W (or "I know I am too old that I do not know what
you call that state").

  server: Sorry, I do not know what W is. Please be more specific.

  fetcher: Here are the refs and their objects: refs/heads/master points
at commit A, refs/heads/maint points at commit B, ...

  server: (goes and checks and finds out that fetcher is behind X).
OK, I'll compute a custom pack to bring you up to date with
one of my states. Here is the incremental difference to the refs,
and the end result should hash to X. Here comes the pack data.

  fetcher: (after receiving, unpacking and updating the
remote-tracking refs) Thanks. Do you have more?

After that, the server would update this client to state Y and then
state Z as above.

Needless to say, this would naturally extend to a case where you
follow only a single branch (you would say "I am interested in your
refs/heads/dev" with a wildcard that matches exactly that branch).
Of course, depending on the access pattern by common project
participants, the server side may choose what set of refs to prepare
such snapshots and uncommon requests may always be served by
the traditional object enumeration codepath.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multi-threaded 'git clone'

2015-02-17 Thread Junio C Hamano
Martin Fick  writes:

> Sorry for the long winded rant. I suspect that some variation of all
> my suggestions have already been suggested, but maybe they will
> rekindle some older, now useful thoughts, or inspire some new ones.
> And maybe some of these are better to pursue then more parallelism?

We avoid doing a grand design document without having some prototype
implementation, but I think the limitation of the current protocol
has become apparent enough that we should do something about it, and
we should do it in a way that different implementations of Git can
all implement.

I think "multi-threaded clone" is a wrong title for this discussion,
in that the user does not care if it is done by multi-threading the
current logic or in any other way.  The user just wants a faster
clone.

In addition, the current "fetch" protocol has the following problems
that limit us:

 - It is not easy to make it resumable, because we recompute every
   time.  This is especially problematic for the initial fetch aka
   "clone" as we will be talking about a large transfer [*1*].

 - The protocol extension has a fairly low length limit [*2*].

 - Because the protocol exchange starts by the server side
   advertising all its refs, even when the fetcher is interested in
   a single ref, the initial overhead is nontrivial, especially when
   you are doing a small incremental update.  The worst case is an
   auto-builder that polls every five minutes, even when there is no
   new commits to be fetched [*3*].

 - Because we recompute every time, taking into account of what the
   fetcher has, in addition to what the fetcher obtained earlier
   from us in order to reduce the transferred bytes, the payload for
   incremental updates become tailor-made for each fetch and cannot
   be easily reused [*4*].

I'd like to see a new protocol that lets us overcome the above
limitations (did I miss others? I am sure people can help here)
sometime this year.



[Footnotes]

*1* The "first fetch this bundle from elsewhere and then come back
here for incremental updates" raised earlier in this thread may
be a way to alleviate this, as the large bundle can be served
from a static file.

*2* An earlier "this symbolic ref points at that concrete ref"
attempt failed because of this and we only talk about HEAD.

*3* A new "fetch" protocol must avoid this "one side blindly gives a
large message as the first thing".  I have been toying with the
idea of making the fetcher talk first, by declaring "I am
interested in your refs that match refs/heads/* or refs/tags/*,
and I have a superset of objects that are reachable from the
set of refs' values X you gave me earlier", where X is a small
token generated by hashing the output from "git ls-remote $there
refs/heads/* refs/tags/*".  In the best case where the server
understands what X is and has a cached pack data, it can then
send:

- differences in the refs that match the wildcards (e.g. "Back
  then at X I did not have refs/heads/next but now I do and it
  points at this commit.  My refs/heads/master is now at that
  commit.  I no longer have refs/heads/pu.  Everything else in
  the refs/ hierarchy you are interested in is the same as state
  X").

- The new name of the state Y (again, the hashed value of the
  output from "git ls-remote $there refs/heads/* refs/tags/*")
  to make sure the above differences can be verified at the
  receiving end.

- the cached pack data that contains all necessary objects
  between X and Y.

Note that the above would work if and only if we accept that it
is OK to send objects between the remote tracking branches the
fetcher has (i.e. the objects it last fetched from the server)
and the current tips of branches the server has, without
optimizing by taking into account that some commits in that set
may have already been obtained by the fetcher from a
third-party.

If the server does not recognize state X (after all it is just a
SHA-1 hash value, so the server cannot recreate the set of refs
and their values from it unless it remembers), the exchange
would have to degenerate to the traditional transfer.

The server would want to recognize the result of hashing an
empty string, though.  The fetcher is saying "I have nothing"
in that case.


*4* The scheme in *3* can be extended to bring the fetcher
step-wise.  If the server's state was X when the fetcher last
contacted it, and since then the server received multiple pushes
and has two snapshots of states, Y and Z, then the exchange may
go like this:

fetcher: I am interested in refs/heads/* and refs/tags/* and I
 have your state X.

server:  Here is the incremental difference to the refs and the
 end result should hash to Y.  Here comes the pack data
 to bring you up to date.

fetcher: (after receiving, un

Re: Multi-threaded 'git clone'

2015-02-16 Thread Martin Fick
There currently is a thread on the Gerrit list about how much faster cloning 
can be when using Gerrit/jgit GCed packs with bitmaps versus C git GCed packs 
with bitmaps.

Some differences outlined are that jgit seems to have more bitmaps, it creates 
one for every refs/heads, is C git doing that?  Another difference seems to be 
that jgit creates two packs, splitting stuff not reachable from refs/heads into 
its own pack.  This makes a clone have zero CPU server side in the pristine 
case.  In the Gerrit use case, this second "unreachable" packfile can be 
sizeable, I wonder if there are other use cases where this might also be the 
case (and this slowing down clones for C git GCed repos)?

If there is not a lot of parallelism left to squeak out, perhaps a focus with 
better returns is trying to do whatever is possible to make all clones (and 
potentially any fetch use case deemed important on a particular server) have 
zero CPU?  Depending on what a server's primary mission is, I could envision 
certain admins willing to sacrifice significant amounts of disk space to speed 
up their fetches.  Perhaps some more extreme thinking (such as what must have 
led to bitmaps) is worth brainstorming about to improve server use cases?

What if an admin were willing to sacrifice a packfile for every use case he 
deemed important, could git be made to support that easily?  For example, maybe 
the admin considers a clone or a fetch from master to be important, could zero 
percent CPU be achieved regularly for those two use cases?  Cloning is possible 
if the repository were repacked in the jgit style after any push to a head.  Is 
it worth exploring ways of making GC efficient enough to make this feasible?  
Can bitmaps be leveraged to make repacking faster?  I believe that at least 
reachability checking could potentially be improved with bitmaps? Are there 
potentially any ways to make better deltification reuse during repacking (not 
bitmap related), by somehow reversing or translating deltas to new objects that 
were just received, without actually recalculating them, but yet still getting 
most objects deltified against the newest objects (achieving the same packs as 
git GC would achieve today, but faster)? What other pieces need to be improved 
to make repacking faster?

As for the single branch fetch case, could this somehow be improved by 
allocating one or more packfiles to this use case?  The simplest single branch 
fetch use case is likely someone doing a git init followed by a single branch 
fetch.  I think the android repo tool can be used in this way, so this may 
actually be a common use case?  With a packfile dedicated to this branch, git 
should be able to just stream it out without any CPU.  But I think git would 
need to know this packfile exists to be able to use it.  It would be nice if 
bitmaps could help here, but I believe bitmaps can so far only be used for one 
packfile.  I understand that making bitmaps span multiple packfiles would be 
very complicated, but maybe it would not be so hard to support bitmaps on 
multiple packfiles if each of these were "self contained"?  By self contained I 
mean that all objects referenced by objects in the packfile were contained in 
that packfile.

What other still unimplemented caching techniques could be used to improve 
clone/fetch use cases? 

- Shallow clones (dedicate a special packfile to this, what about another 
bitmap format that only maps objects in a single tree to help this)?

- Small fetches (simple branch FF updates), I suspect these are fast enough, 
but if not, maybe caching some thin packs (that could result in zero CPU 
requests for many clients) would be useful?  Maybe spread these out 
exponentially over time so that many will be available for recent updates and 
fewer for older updates?  I know git normally throws away thin packs after 
receiving them and resolving them, but if it kept them around (maybe in a 
special directory), it seems that they could be useful for updating other 
clients with zero CPU?  A thin pack cache might be something really easy to 
manage based on file timestamps, an admin may simply need to set a max cache 
size.  But how can git know what thin packs it has, and what they would be 
useful for, name them with their start and ending shas?

Sorry for the long winded rant. I suspect that some variation of all my 
suggestions have already been suggested, but maybe they will rekindle some 
older, now useful thoughts, or inspire some new ones.  And maybe some of these 
are better to pursue then more parallelism?

-Martin

Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a 
Linux Foundation Collaborative ProjectOn Feb 16, 2015 8:47 AM, Jeff King 
 wrote:
>
> On Mon, Feb 16, 2015 at 07:31:33AM -0800, David Lang wrote: 
>
> > >Then the server streams the data to the client. It might do some light 
> > >work transforming the data as it comes off the disk, but

Re: Multi-threaded 'git clone'

2015-02-16 Thread Shawn Pearce
On Mon, Feb 16, 2015 at 10:43 AM, Junio C Hamano  wrote:
> Jeff King  writes:
>
>> ... And the whole output is checksummed by a single sha1
>> over the whole stream that comes at the end.
>>
>> I think the most feasible thing would be to quickly spool it to a
>> server on the LAN, and then use an existing fetch-in-parallel tool
>> to grab it from there over the WAN.
>
> One possibility would be for the server to prepare a static bundle
> file to bootstrap all the "clone" clients with and publish it on a
> CDN.  A protocol extension would tell the client where to download
> the bundle from, the client can then grab the bundle to clone from
> it to become "slightly stale but mostly up to date", and then do a
> usual incremental update with the server after that to be complete.
>
> The server would update the bundle used to bootstrap clone clients
> periodically in order to keep the incrementals to the minimum, and
> would make sure their bitmap is anchored at the tips of bundles to
> minimize the object counting load during the incremental phase.
>
> I think "repo" used by folks who do AOSP does something similar to
> that by scripting around "git clone".  I'd imagine that they would
> be happy to see if "git clone" did all that inside.

Yes, the "repo" tool used by Android uses curl to download a
previously cached $URL/clone.bundle using resumable HTTP. For Android
the file is only updated ~every 6 months at major releases and is
easily cached by CDNs and HTTP proxy servers.

This is spooled to a temporary file on disk then unpacked using `git
fetch $path/clone.bundle refs/heads/*:refs/remotes/origin/*`.
Afterwards a normal git fetch is run to bring the new clone current
with the server, picking up any delta that happened since the bundle
was created and cached.

The Android Git servers at android.googlesource.com just recognize
*/clone.bundle GET requests and issue 302 redirects to the CDN farm
that actually stores and serves the precreated bundle files.

We really want to see this in stock git clone for HTTP transports, as
other projects like Chromium want to use it for their ~3 GiB
repository. Being able to build the bulk of the repo every few months
and serve it out using a CDN to bootstrap new clients would really
help developers on slower or flaky network connections.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multi-threaded 'git clone'

2015-02-16 Thread Jeff King
On Tue, Feb 17, 2015 at 06:16:39AM +0700, Duy Nguyen wrote:

> On Mon, Feb 16, 2015 at 10:47 PM, Jeff King  wrote:
> > Each clone generates the pack on the fly
> > based on what's on disk and streams it out. It should _usually_ be the
> > same, but there's nothing to guarantee byte-for-byte equality between
> > invocations.
> 
> It's usually _not_ the same. I tried when I wanted to produce stable
> packs. The first condition is single-threaded pack-objects. Otherwise
> thread scheduler could make object order unpredictable.

True. If you keep your server repositories fully packed, that eliminates
the delta search (and/or makes it feasible to turn pack.threads to 1 to
make it deterministic). But any change in the repository (e.g., somebody
else pushing, even to a ref you are not fetching) can cause unexpected
changes in the bytes.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multi-threaded 'git clone'

2015-02-16 Thread Duy Nguyen
On Mon, Feb 16, 2015 at 10:47 PM, Jeff King  wrote:
> Each clone generates the pack on the fly
> based on what's on disk and streams it out. It should _usually_ be the
> same, but there's nothing to guarantee byte-for-byte equality between
> invocations.

It's usually _not_ the same. I tried when I wanted to produce stable
packs. The first condition is single-threaded pack-objects. Otherwise
thread scheduler could make object order unpredictable.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multi-threaded 'git clone'

2015-02-16 Thread Junio C Hamano
Jeff King  writes:

> ... And the whole output is checksummed by a single sha1
> over the whole stream that comes at the end.
>
> I think the most feasible thing would be to quickly spool it to a
> server on the LAN, and then use an existing fetch-in-parallel tool
> to grab it from there over the WAN.

One possibility would be for the server to prepare a static bundle
file to bootstrap all the "clone" clients with and publish it on a
CDN.  A protocol extension would tell the client where to download
the bundle from, the client can then grab the bundle to clone from
it to become "slightly stale but mostly up to date", and then do a
usual incremental update with the server after that to be complete.

The server would update the bundle used to bootstrap clone clients
periodically in order to keep the incrementals to the minimum, and
would make sure their bitmap is anchored at the tips of bundles to
minimize the object counting load during the incremental phase.

I think "repo" used by folks who do AOSP does something similar to
that by scripting around "git clone".  I'd imagine that they would
be happy to see if "git clone" did all that inside.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multi-threaded 'git clone'

2015-02-16 Thread Jeff King
On Mon, Feb 16, 2015 at 07:31:33AM -0800, David Lang wrote:

> >Then the server streams the data to the client. It might do some light
> >work transforming the data as it comes off the disk, but most of it is
> >just blitted straight from disk, and the network is the bottleneck.
> 
> Depending on how close to full the WAN link is, it may be possible to
> improve this with multiple connections (again, referencing bbcp), but
> there's also the question of if it's worth trying to use the entire WAN for
> a single user. The vast majority of the time the server is doing more than
> one thing and would rather let any individual user wait a bit and service
> the other users.

Yeah, I have seen clients that make multiple TCP connections to each
request a chunk of a file in parallel. The short answer is that this is
going to be very hard with git. Each clone generates the pack on the fly
based on what's on disk and streams it out. It should _usually_ be the
same, but there's nothing to guarantee byte-for-byte equality between
invocations. So you'd have to multiplex all of the connections into the
same server process. And even then it's hard; that process knows its
going to send you byte the bytes for object X, but it doesn't know at
exactly which offset until it gets there, which makes sending things out
of order tricky. And the whole output is checksummed by a single sha1
over the whole stream that comes at the end.

I think the most feasible thing would be to quickly spool it to a server
on the LAN, and then use an existing fetch-in-parallel tool to grab it
from there over the WAN.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multi-threaded 'git clone'

2015-02-16 Thread David Lang

On Mon, 16 Feb 2015, Jeff King wrote:


On Mon, Feb 16, 2015 at 05:31:13AM -0800, David Lang wrote:


I think it's an interesting question to look at, but before you start
looking at changing the architecture of the current code, I would suggest
doing a bit more analisys of the problem to see if the bottleneck is really
where you think it is.

First measure, then optimize :-)


Yes, very much so. Fortunately some people have already done some of
this work. :)


nice summary


Then the server streams the data to the client. It might do some light
work transforming the data as it comes off the disk, but most of it is
just blitted straight from disk, and the network is the bottleneck.


Depending on how close to full the WAN link is, it may be possible to improve 
this with multiple connections (again, referencing bbcp), but there's also the 
question of if it's worth trying to use the entire WAN for a single user. The 
vast majority of the time the server is doing more than one thing and would 
rather let any individual user wait a bit and service the other users.


David Lang
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multi-threaded 'git clone'

2015-02-16 Thread Jeff King
On Mon, Feb 16, 2015 at 05:31:13AM -0800, David Lang wrote:

> I think it's an interesting question to look at, but before you start
> looking at changing the architecture of the current code, I would suggest
> doing a bit more analisys of the problem to see if the bottleneck is really
> where you think it is.
> 
> First measure, then optimize :-)

Yes, very much so. Fortunately some people have already done some of
this work. :)

On the server side of a clone, the things that must be done before
sending any data are:

  1. Count up all of the objects that must be sent by traversing the
 object graph.

  2. Find any pairs for delta compression (this is the "Compressing
 objects" phase of the progress reporting).

Step (1) naively takes 30-45 seconds for a kernel repo. However, with
reachability bitmaps, it's instant-ish. I just did a clone from
kernel.org, and it looks like they've turned on bitmaps.

For step (2), git will reuse deltas that already exist in the on-disk
packfile, and will not consider new deltas between objects that are
already in the same pack (because we would already have considered them
when packing in the first place). So the key for servers is to keep
things pretty packed. My kernel.org clone shows that they could probably
stand to repack torvalds/linux.git, but it's not too terrible.

This part is multithreaded, so what work we do happens in parallel. But
note that some servers may turn pack.threads down to 1 (since their many
CPUs are kept busy by multiple requests, rather than trying to finish a
single one).

Then the server streams the data to the client. It might do some light
work transforming the data as it comes off the disk, but most of it is
just blitted straight from disk, and the network is the bottleneck.

On the client side, the incoming data streams into an index-pack
process. For each full object it sees, it hashes and records the name of
the object as it comes in. For deltas, it queues them for resolution
after the complete pack arrives.

Once the full pack arrives, then it resolves all of the deltas. This
part is also multithreaded. If you check out "top" during the "resolving
deltas" phase of the clone, you should see multiple cores in use.

So I don't think there is any room for "just multithread it" in this
process. The CPU intensive bits are already multithreaded. There may be
room for optimizing that, though (e.g., reducing lock contention or
similar).

It would also be possible to resolve deltas while the pack is streaming
in, rather than waiting until the whole thing arrives. That's not
possible in all cases (an object may be a delta against a base that
comes later in the pack), but in practice git puts bases before their
deltas. However, it's overall less efficient, because you may end up
walking through the same parts of the delta chain more than once. For
example, imagine you see a stream of objects A, B, C, D. You get B and
see that it's a delta against A. So you resolve it, hash the object, and
are good. Now you see C, which is a delta against B. To generate C, you
have to compute B again. Now you get to D, which is another delta
against B. So now we compute B again.

You can get around this somewhat with a cache of intermediate object
contents, but of course there may be hundreds or thousands of chains
like this in use at once, so you're going to end up with some cache
misses.

What index-pack does instead is to wait until it has all of the objects,
then finds A and says "what objects use A as a base?". Then it computes
B, hashes it, and says "what objects use B as a base?". And finds C and
D, after which it nows it can drop the intermediate result B.

So that's less work over all, though in some workloads it may finish
faster if you were to stream it (because your many processors are
sitting idle while we are blocked on network bandwidth). So that's a
potential area of exploration.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multi-threaded 'git clone'

2015-02-16 Thread David Lang

On Mon, 16 Feb 2015, Koosha Khajehmoogahi wrote:


Cloning huge repositories like Linux kernel takes considerable amount
of time. Is it possible to incorporate a multi-threaded simultaneous
connections functionality for cloning? To what extent do we need to
change the architecture of the current code and how large would be the
scope of the work? That just seems an interesting idea to me and would
liked to share it with the community.


They key question is what is it that takes the time in clonding and can that be 
multi-threaded.


If it's the netwrok traffic that takes the most time, where is the bottleneck?

Is it in the server software assembling what will be sent? Is it in the 
receiving software processing it? If so, multiple threads could help.


Is it in network bandwidth? If so doing multiple connections won't help much. 
TCP connections favour a few connections passing a lot of data rather than many 
connections passing a little. The one place where multiple connections can help 
is when you have non-congestion induced packet loss as a lost packet on a 
connection will cause the throughput of that connection to drop (if the drop is 
due to congestion, this is TCP working as designed, throttling back to match the 
available bandwidth). This can be a significant effect if you have a very high 
bandwidth, high latency connection (think multiple Gb on international 
connections), but for lower bandwidth connections it's much less of a factor. 
You can look at projects like bbcp


I think it's an interesting question to look at, but before you start looking at 
changing the architecture of the current code, I would suggest doing a bit more 
analisys of the problem to see if the bottleneck is really where you think it 
is.


First measure, then optimize :-)

David Lang
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Multi-threaded 'git clone'

2015-02-16 Thread Koosha Khajehmoogahi
Greetings,

Cloning huge repositories like Linux kernel takes considerable amount
of time. Is it possible to incorporate a multi-threaded simultaneous
connections functionality for cloning? To what extent do we need to
change the architecture of the current code and how large would be the
scope of the work? That just seems an interesting idea to me and would
liked to share it with the community.

Regards

-- 
Koosha
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html