Re: Multi-threaded 'git clone'
On Tue, Feb 17, 2015 at 3:32 PM, Junio C Hamano wrote: A few typofixes and clarifications. > *4* The scheme in *3* can be extended to bring the fetcher > step-wise. If the server's state was X when the fetcher last "bring the fetcher up-to-date step-wise", or "update the fetcher step-wise". > contacted it, and since then the server received multiple pushes > and has two snapshots of states, Y and Z, then the exchange may > go like this: > > fetcher: I am interested in refs/heads/* and refs/tags/* and I > have your state X. > > server: Here is the incremental difference to the refs and the > end result should hash to Y. Here comes the pack data > to bring you up to date. > > fetcher: (after receiving, unpacking and updating the > remote-tracking refs) Thanks. Do you have more? > > server: Yes, here is the incremental difference to the refs and the > end result should hash to Z. Here comes the pack data > to bring you up to date. > > fetcher: (after receiving, unpacking and updating the > remote-tracking refs) Thanks. Do you have more? > > server: No, you are now fully up to date with me. Bye. The initial part of this exchange may go like this, if the state the fetcher grabbed the last time from this server is even older than X: fetcher: I am interested in refs/heads/* and refs/tags/* and I have your state W (or "I know I am too old that I do not know what you call that state"). server: Sorry, I do not know what W is. Please be more specific. fetcher: Here are the refs and their objects: refs/heads/master points at commit A, refs/heads/maint points at commit B, ... server: (goes and checks and finds out that fetcher is behind X). OK, I'll compute a custom pack to bring you up to date with one of my states. Here is the incremental difference to the refs, and the end result should hash to X. Here comes the pack data. fetcher: (after receiving, unpacking and updating the remote-tracking refs) Thanks. Do you have more? After that, the server would update this client to state Y and then state Z as above. Needless to say, this would naturally extend to a case where you follow only a single branch (you would say "I am interested in your refs/heads/dev" with a wildcard that matches exactly that branch). Of course, depending on the access pattern by common project participants, the server side may choose what set of refs to prepare such snapshots and uncommon requests may always be served by the traditional object enumeration codepath. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multi-threaded 'git clone'
Martin Fick writes: > Sorry for the long winded rant. I suspect that some variation of all > my suggestions have already been suggested, but maybe they will > rekindle some older, now useful thoughts, or inspire some new ones. > And maybe some of these are better to pursue then more parallelism? We avoid doing a grand design document without having some prototype implementation, but I think the limitation of the current protocol has become apparent enough that we should do something about it, and we should do it in a way that different implementations of Git can all implement. I think "multi-threaded clone" is a wrong title for this discussion, in that the user does not care if it is done by multi-threading the current logic or in any other way. The user just wants a faster clone. In addition, the current "fetch" protocol has the following problems that limit us: - It is not easy to make it resumable, because we recompute every time. This is especially problematic for the initial fetch aka "clone" as we will be talking about a large transfer [*1*]. - The protocol extension has a fairly low length limit [*2*]. - Because the protocol exchange starts by the server side advertising all its refs, even when the fetcher is interested in a single ref, the initial overhead is nontrivial, especially when you are doing a small incremental update. The worst case is an auto-builder that polls every five minutes, even when there is no new commits to be fetched [*3*]. - Because we recompute every time, taking into account of what the fetcher has, in addition to what the fetcher obtained earlier from us in order to reduce the transferred bytes, the payload for incremental updates become tailor-made for each fetch and cannot be easily reused [*4*]. I'd like to see a new protocol that lets us overcome the above limitations (did I miss others? I am sure people can help here) sometime this year. [Footnotes] *1* The "first fetch this bundle from elsewhere and then come back here for incremental updates" raised earlier in this thread may be a way to alleviate this, as the large bundle can be served from a static file. *2* An earlier "this symbolic ref points at that concrete ref" attempt failed because of this and we only talk about HEAD. *3* A new "fetch" protocol must avoid this "one side blindly gives a large message as the first thing". I have been toying with the idea of making the fetcher talk first, by declaring "I am interested in your refs that match refs/heads/* or refs/tags/*, and I have a superset of objects that are reachable from the set of refs' values X you gave me earlier", where X is a small token generated by hashing the output from "git ls-remote $there refs/heads/* refs/tags/*". In the best case where the server understands what X is and has a cached pack data, it can then send: - differences in the refs that match the wildcards (e.g. "Back then at X I did not have refs/heads/next but now I do and it points at this commit. My refs/heads/master is now at that commit. I no longer have refs/heads/pu. Everything else in the refs/ hierarchy you are interested in is the same as state X"). - The new name of the state Y (again, the hashed value of the output from "git ls-remote $there refs/heads/* refs/tags/*") to make sure the above differences can be verified at the receiving end. - the cached pack data that contains all necessary objects between X and Y. Note that the above would work if and only if we accept that it is OK to send objects between the remote tracking branches the fetcher has (i.e. the objects it last fetched from the server) and the current tips of branches the server has, without optimizing by taking into account that some commits in that set may have already been obtained by the fetcher from a third-party. If the server does not recognize state X (after all it is just a SHA-1 hash value, so the server cannot recreate the set of refs and their values from it unless it remembers), the exchange would have to degenerate to the traditional transfer. The server would want to recognize the result of hashing an empty string, though. The fetcher is saying "I have nothing" in that case. *4* The scheme in *3* can be extended to bring the fetcher step-wise. If the server's state was X when the fetcher last contacted it, and since then the server received multiple pushes and has two snapshots of states, Y and Z, then the exchange may go like this: fetcher: I am interested in refs/heads/* and refs/tags/* and I have your state X. server: Here is the incremental difference to the refs and the end result should hash to Y. Here comes the pack data to bring you up to date. fetcher: (after receiving, un
Re: Multi-threaded 'git clone'
There currently is a thread on the Gerrit list about how much faster cloning can be when using Gerrit/jgit GCed packs with bitmaps versus C git GCed packs with bitmaps. Some differences outlined are that jgit seems to have more bitmaps, it creates one for every refs/heads, is C git doing that? Another difference seems to be that jgit creates two packs, splitting stuff not reachable from refs/heads into its own pack. This makes a clone have zero CPU server side in the pristine case. In the Gerrit use case, this second "unreachable" packfile can be sizeable, I wonder if there are other use cases where this might also be the case (and this slowing down clones for C git GCed repos)? If there is not a lot of parallelism left to squeak out, perhaps a focus with better returns is trying to do whatever is possible to make all clones (and potentially any fetch use case deemed important on a particular server) have zero CPU? Depending on what a server's primary mission is, I could envision certain admins willing to sacrifice significant amounts of disk space to speed up their fetches. Perhaps some more extreme thinking (such as what must have led to bitmaps) is worth brainstorming about to improve server use cases? What if an admin were willing to sacrifice a packfile for every use case he deemed important, could git be made to support that easily? For example, maybe the admin considers a clone or a fetch from master to be important, could zero percent CPU be achieved regularly for those two use cases? Cloning is possible if the repository were repacked in the jgit style after any push to a head. Is it worth exploring ways of making GC efficient enough to make this feasible? Can bitmaps be leveraged to make repacking faster? I believe that at least reachability checking could potentially be improved with bitmaps? Are there potentially any ways to make better deltification reuse during repacking (not bitmap related), by somehow reversing or translating deltas to new objects that were just received, without actually recalculating them, but yet still getting most objects deltified against the newest objects (achieving the same packs as git GC would achieve today, but faster)? What other pieces need to be improved to make repacking faster? As for the single branch fetch case, could this somehow be improved by allocating one or more packfiles to this use case? The simplest single branch fetch use case is likely someone doing a git init followed by a single branch fetch. I think the android repo tool can be used in this way, so this may actually be a common use case? With a packfile dedicated to this branch, git should be able to just stream it out without any CPU. But I think git would need to know this packfile exists to be able to use it. It would be nice if bitmaps could help here, but I believe bitmaps can so far only be used for one packfile. I understand that making bitmaps span multiple packfiles would be very complicated, but maybe it would not be so hard to support bitmaps on multiple packfiles if each of these were "self contained"? By self contained I mean that all objects referenced by objects in the packfile were contained in that packfile. What other still unimplemented caching techniques could be used to improve clone/fetch use cases? - Shallow clones (dedicate a special packfile to this, what about another bitmap format that only maps objects in a single tree to help this)? - Small fetches (simple branch FF updates), I suspect these are fast enough, but if not, maybe caching some thin packs (that could result in zero CPU requests for many clients) would be useful? Maybe spread these out exponentially over time so that many will be available for recent updates and fewer for older updates? I know git normally throws away thin packs after receiving them and resolving them, but if it kept them around (maybe in a special directory), it seems that they could be useful for updating other clients with zero CPU? A thin pack cache might be something really easy to manage based on file timestamps, an admin may simply need to set a max cache size. But how can git know what thin packs it has, and what they would be useful for, name them with their start and ending shas? Sorry for the long winded rant. I suspect that some variation of all my suggestions have already been suggested, but maybe they will rekindle some older, now useful thoughts, or inspire some new ones. And maybe some of these are better to pursue then more parallelism? -Martin Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative ProjectOn Feb 16, 2015 8:47 AM, Jeff King wrote: > > On Mon, Feb 16, 2015 at 07:31:33AM -0800, David Lang wrote: > > > >Then the server streams the data to the client. It might do some light > > >work transforming the data as it comes off the disk, but
Re: Multi-threaded 'git clone'
On Mon, Feb 16, 2015 at 10:43 AM, Junio C Hamano wrote: > Jeff King writes: > >> ... And the whole output is checksummed by a single sha1 >> over the whole stream that comes at the end. >> >> I think the most feasible thing would be to quickly spool it to a >> server on the LAN, and then use an existing fetch-in-parallel tool >> to grab it from there over the WAN. > > One possibility would be for the server to prepare a static bundle > file to bootstrap all the "clone" clients with and publish it on a > CDN. A protocol extension would tell the client where to download > the bundle from, the client can then grab the bundle to clone from > it to become "slightly stale but mostly up to date", and then do a > usual incremental update with the server after that to be complete. > > The server would update the bundle used to bootstrap clone clients > periodically in order to keep the incrementals to the minimum, and > would make sure their bitmap is anchored at the tips of bundles to > minimize the object counting load during the incremental phase. > > I think "repo" used by folks who do AOSP does something similar to > that by scripting around "git clone". I'd imagine that they would > be happy to see if "git clone" did all that inside. Yes, the "repo" tool used by Android uses curl to download a previously cached $URL/clone.bundle using resumable HTTP. For Android the file is only updated ~every 6 months at major releases and is easily cached by CDNs and HTTP proxy servers. This is spooled to a temporary file on disk then unpacked using `git fetch $path/clone.bundle refs/heads/*:refs/remotes/origin/*`. Afterwards a normal git fetch is run to bring the new clone current with the server, picking up any delta that happened since the bundle was created and cached. The Android Git servers at android.googlesource.com just recognize */clone.bundle GET requests and issue 302 redirects to the CDN farm that actually stores and serves the precreated bundle files. We really want to see this in stock git clone for HTTP transports, as other projects like Chromium want to use it for their ~3 GiB repository. Being able to build the bulk of the repo every few months and serve it out using a CDN to bootstrap new clients would really help developers on slower or flaky network connections. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multi-threaded 'git clone'
On Tue, Feb 17, 2015 at 06:16:39AM +0700, Duy Nguyen wrote: > On Mon, Feb 16, 2015 at 10:47 PM, Jeff King wrote: > > Each clone generates the pack on the fly > > based on what's on disk and streams it out. It should _usually_ be the > > same, but there's nothing to guarantee byte-for-byte equality between > > invocations. > > It's usually _not_ the same. I tried when I wanted to produce stable > packs. The first condition is single-threaded pack-objects. Otherwise > thread scheduler could make object order unpredictable. True. If you keep your server repositories fully packed, that eliminates the delta search (and/or makes it feasible to turn pack.threads to 1 to make it deterministic). But any change in the repository (e.g., somebody else pushing, even to a ref you are not fetching) can cause unexpected changes in the bytes. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multi-threaded 'git clone'
On Mon, Feb 16, 2015 at 10:47 PM, Jeff King wrote: > Each clone generates the pack on the fly > based on what's on disk and streams it out. It should _usually_ be the > same, but there's nothing to guarantee byte-for-byte equality between > invocations. It's usually _not_ the same. I tried when I wanted to produce stable packs. The first condition is single-threaded pack-objects. Otherwise thread scheduler could make object order unpredictable. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multi-threaded 'git clone'
Jeff King writes: > ... And the whole output is checksummed by a single sha1 > over the whole stream that comes at the end. > > I think the most feasible thing would be to quickly spool it to a > server on the LAN, and then use an existing fetch-in-parallel tool > to grab it from there over the WAN. One possibility would be for the server to prepare a static bundle file to bootstrap all the "clone" clients with and publish it on a CDN. A protocol extension would tell the client where to download the bundle from, the client can then grab the bundle to clone from it to become "slightly stale but mostly up to date", and then do a usual incremental update with the server after that to be complete. The server would update the bundle used to bootstrap clone clients periodically in order to keep the incrementals to the minimum, and would make sure their bitmap is anchored at the tips of bundles to minimize the object counting load during the incremental phase. I think "repo" used by folks who do AOSP does something similar to that by scripting around "git clone". I'd imagine that they would be happy to see if "git clone" did all that inside. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multi-threaded 'git clone'
On Mon, Feb 16, 2015 at 07:31:33AM -0800, David Lang wrote: > >Then the server streams the data to the client. It might do some light > >work transforming the data as it comes off the disk, but most of it is > >just blitted straight from disk, and the network is the bottleneck. > > Depending on how close to full the WAN link is, it may be possible to > improve this with multiple connections (again, referencing bbcp), but > there's also the question of if it's worth trying to use the entire WAN for > a single user. The vast majority of the time the server is doing more than > one thing and would rather let any individual user wait a bit and service > the other users. Yeah, I have seen clients that make multiple TCP connections to each request a chunk of a file in parallel. The short answer is that this is going to be very hard with git. Each clone generates the pack on the fly based on what's on disk and streams it out. It should _usually_ be the same, but there's nothing to guarantee byte-for-byte equality between invocations. So you'd have to multiplex all of the connections into the same server process. And even then it's hard; that process knows its going to send you byte the bytes for object X, but it doesn't know at exactly which offset until it gets there, which makes sending things out of order tricky. And the whole output is checksummed by a single sha1 over the whole stream that comes at the end. I think the most feasible thing would be to quickly spool it to a server on the LAN, and then use an existing fetch-in-parallel tool to grab it from there over the WAN. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multi-threaded 'git clone'
On Mon, 16 Feb 2015, Jeff King wrote: On Mon, Feb 16, 2015 at 05:31:13AM -0800, David Lang wrote: I think it's an interesting question to look at, but before you start looking at changing the architecture of the current code, I would suggest doing a bit more analisys of the problem to see if the bottleneck is really where you think it is. First measure, then optimize :-) Yes, very much so. Fortunately some people have already done some of this work. :) nice summary Then the server streams the data to the client. It might do some light work transforming the data as it comes off the disk, but most of it is just blitted straight from disk, and the network is the bottleneck. Depending on how close to full the WAN link is, it may be possible to improve this with multiple connections (again, referencing bbcp), but there's also the question of if it's worth trying to use the entire WAN for a single user. The vast majority of the time the server is doing more than one thing and would rather let any individual user wait a bit and service the other users. David Lang -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multi-threaded 'git clone'
On Mon, Feb 16, 2015 at 05:31:13AM -0800, David Lang wrote: > I think it's an interesting question to look at, but before you start > looking at changing the architecture of the current code, I would suggest > doing a bit more analisys of the problem to see if the bottleneck is really > where you think it is. > > First measure, then optimize :-) Yes, very much so. Fortunately some people have already done some of this work. :) On the server side of a clone, the things that must be done before sending any data are: 1. Count up all of the objects that must be sent by traversing the object graph. 2. Find any pairs for delta compression (this is the "Compressing objects" phase of the progress reporting). Step (1) naively takes 30-45 seconds for a kernel repo. However, with reachability bitmaps, it's instant-ish. I just did a clone from kernel.org, and it looks like they've turned on bitmaps. For step (2), git will reuse deltas that already exist in the on-disk packfile, and will not consider new deltas between objects that are already in the same pack (because we would already have considered them when packing in the first place). So the key for servers is to keep things pretty packed. My kernel.org clone shows that they could probably stand to repack torvalds/linux.git, but it's not too terrible. This part is multithreaded, so what work we do happens in parallel. But note that some servers may turn pack.threads down to 1 (since their many CPUs are kept busy by multiple requests, rather than trying to finish a single one). Then the server streams the data to the client. It might do some light work transforming the data as it comes off the disk, but most of it is just blitted straight from disk, and the network is the bottleneck. On the client side, the incoming data streams into an index-pack process. For each full object it sees, it hashes and records the name of the object as it comes in. For deltas, it queues them for resolution after the complete pack arrives. Once the full pack arrives, then it resolves all of the deltas. This part is also multithreaded. If you check out "top" during the "resolving deltas" phase of the clone, you should see multiple cores in use. So I don't think there is any room for "just multithread it" in this process. The CPU intensive bits are already multithreaded. There may be room for optimizing that, though (e.g., reducing lock contention or similar). It would also be possible to resolve deltas while the pack is streaming in, rather than waiting until the whole thing arrives. That's not possible in all cases (an object may be a delta against a base that comes later in the pack), but in practice git puts bases before their deltas. However, it's overall less efficient, because you may end up walking through the same parts of the delta chain more than once. For example, imagine you see a stream of objects A, B, C, D. You get B and see that it's a delta against A. So you resolve it, hash the object, and are good. Now you see C, which is a delta against B. To generate C, you have to compute B again. Now you get to D, which is another delta against B. So now we compute B again. You can get around this somewhat with a cache of intermediate object contents, but of course there may be hundreds or thousands of chains like this in use at once, so you're going to end up with some cache misses. What index-pack does instead is to wait until it has all of the objects, then finds A and says "what objects use A as a base?". Then it computes B, hashes it, and says "what objects use B as a base?". And finds C and D, after which it nows it can drop the intermediate result B. So that's less work over all, though in some workloads it may finish faster if you were to stream it (because your many processors are sitting idle while we are blocked on network bandwidth). So that's a potential area of exploration. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multi-threaded 'git clone'
On Mon, 16 Feb 2015, Koosha Khajehmoogahi wrote: Cloning huge repositories like Linux kernel takes considerable amount of time. Is it possible to incorporate a multi-threaded simultaneous connections functionality for cloning? To what extent do we need to change the architecture of the current code and how large would be the scope of the work? That just seems an interesting idea to me and would liked to share it with the community. They key question is what is it that takes the time in clonding and can that be multi-threaded. If it's the netwrok traffic that takes the most time, where is the bottleneck? Is it in the server software assembling what will be sent? Is it in the receiving software processing it? If so, multiple threads could help. Is it in network bandwidth? If so doing multiple connections won't help much. TCP connections favour a few connections passing a lot of data rather than many connections passing a little. The one place where multiple connections can help is when you have non-congestion induced packet loss as a lost packet on a connection will cause the throughput of that connection to drop (if the drop is due to congestion, this is TCP working as designed, throttling back to match the available bandwidth). This can be a significant effect if you have a very high bandwidth, high latency connection (think multiple Gb on international connections), but for lower bandwidth connections it's much less of a factor. You can look at projects like bbcp I think it's an interesting question to look at, but before you start looking at changing the architecture of the current code, I would suggest doing a bit more analisys of the problem to see if the bottleneck is really where you think it is. First measure, then optimize :-) David Lang -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Multi-threaded 'git clone'
Greetings, Cloning huge repositories like Linux kernel takes considerable amount of time. Is it possible to incorporate a multi-threaded simultaneous connections functionality for cloning? To what extent do we need to change the architecture of the current code and how large would be the scope of the work? That just seems an interesting idea to me and would liked to share it with the community. Regards -- Koosha -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html