Re: Proposal for "fetch-any-blob Git protocol" and server design

Jonathan Tan Wed, 26 Apr 2017 15:52:13 -0700

On 04/21/2017 09:41 AM, Kevin David wrote:

Hi Jonathan,


Sorry for the delayed response - other work got in the way, unfortunately!


No worries!

I am envisioning (1a) as described in Jeff Hostetler's e-mail [1] ("a
pre-command or hook to identify needed blobs and pre-fetch them before
allowing the actual command to start"), so a Git command would typically
make a single request that contains all the blobs required, but my
proposal can also handle (1c) ('"fault" them in as necessary in
read_object() while the command is running and without any pre-fetch
(either synchronously or asynchronously and with/without a helper
process)').

Even if we decided to go with single-blob requests and responses, it is
still important to send them as packfiles, so that the server can serve
them directly from its compressed storage without first having to
uncompress them.

[1] 
https://public-inbox.org/git/1488999039-37631-1-git-send-email-...@jeffhostetler.com/


Ah, I missed this. If we think we can build meaningfully-sized requests via 
(1a) and (1b),
then I agree - packs are optimal.

However, if single-blob requests/responses dominate, I have a few concerns, 
mostly from
experience playing with something similar:
* Regardless of the size of the object or number returned, the client will need 
to
  `index-pack` the result and create a corresponding `.idx` file, which requires
  decompression to construct (right?)

Yes, although I think that loose objects would need to be decompressedtoo to verify the SHA-1.

* Unless the client's repository is repacked aggressively, we'll pollute the
  `.git\objects\pack` directory with little indexes (somewhere around 4KiB 
minimum)
  and packfiles rapidly, which would degrade performance. One rudimentary 
workaround
  would be to loosen these packs on the client if they were under a certain
  size/object count. I think fetch does this already?

In either case, shifting the pack decompression/loose object recompression 
problem
to the client instead of the server is probably a good principle, but in our 
case it
simply wasn't fast enough to serve single blobs interactively (e.g. opening a 
source
file you don't have locally). I'm hopeful that the proposed partial clone 
solutions you
referenced would reduce the frequency of this being required.

I wrote up a proposal for how Git should handle missing blobs, includinga hook that, when invoked, is allowed to write packed or loose objects[1]. I was thinking along similar lines too about the many-packfileissue, and that was part of the reason why [1] was written the way it is.

I think that, by default, the wire protocol should transmit packsbecause it causes less load on the server (as you describe) and itsupports batch requests. Also, this protocol with single-blob fetchingmight be usable as-is for projects that only exclude large blobs (sothere are relatively few of them), and will be more and more usable aswe migrate the more commonly used Git commands to provide batchrequests. However, the ability to customize the hook allows users to useanother download scheme more suited to their use cases (for example, inyours, where you can provision servers that are topologically close toyour clients).


[1] <20170426221346.25337-1-jonathanta...@google.com>

Being a bit more clever about packing objects (e.g. splitting blobs out from 
commits
and trees) improved this a bit, but we still hit a bottlenecks from what 
appeared to
be a large number of memory-mapping operations on a ~140GiB packfile of blobs.

Each `pack-objects` process would consume approximately one CPU core for the
duration of the request. It's possible that further splitting of these large 
blob packs
would have improved performance in some scenarios, but that would increase the
amount of pack-index lookups necessary to find a single object.


I'm not very experienced with mmap, but I thought that memory-mapping a
large file in itself does not incur much of a performance penalty (if
any) - it is the accesses that count. I experimented with 15,000 and
150,000 MiB files and mmap and they seem to be handled quite well. Also,
how many objects are "pack-objects" packing here?


Back when we took this approach, it was ~4000 blob objects at a time.
Perhaps we were being bitten by the Windows implementation of git_mmap[3]?.
When I profiled the 4k blob scenario, the majority of CPU and wall time was
spent in MapViewOfFileEx, which looks like it could mean accesses as well.

[3] https://github.com/git/git/blob/master/compat/win32mmap.c

Hmm...I don't have much experience with this, but this may be somethingto keep in mind.

This seems like a clever way to avoid the canonical 
`/info/refs?service=git-upload-pack`
capability negotiation on every call. However, using error handling to fallback 
seems
slightly wonky to me. Hopefully users are incentivized to upgrade their clients.


By "error handling to fallback", do you mean in my proposal or in a
possible future one (assuming my proposal is implemented)? I don't think
my proposal requires any error handling to fallback (since only new
clients can clone partially - old clients will just clone totally and
obliviously), but I acknowledge that this proposal does not mean that
any future proposal can be done without requiring error handling to
fallback.


Right, I was talking about the possible future one - more around the
concept of back-compat in the event of any protocol changes. I don't want
to spend too much time focusing on what we might want in the future, but a
thought I just had: what about versioning as a part of the URL? For example,
`/server-endpoint?version=1.0`. This could also enable breaking changes for
existing commands.

That could work, although we could put a version number in the commandname (e.g. "fetch-any-blob-2") too.

Ah, interesting. We solved the many-refs problem using a different approach -
basically, limiting what the server returns based on who the user is and 
preferences
they set via our web interface or an API. I've been promised by a few 
colleagues that
we'll have more to share here soon... With your proposal, how does the client 
choose
which refs they want to fetch?

The "want" lines can now take names (including glob characters). A listof refs (names and hashes) that correspond to those "want"s will bereturned by the server right before the packfile.

(So the client doesn't "choose" in the sense that it has a list ofoptions and it picks a subset of them - it chooses by sending one ormore strings that the server will match.)

I took a look at look at your patch series. I'm nowhere near qualified to give 
feedback
given my lack of experience with the core git implementation and C in general, 
but it
looks reasonable. Hoping we can come up with a better name than 
"server-endpoint" though :)


Thanks, and I'm open to suggestions for the name. :-)

Just to keep the discussion interesting, I'll throw an alternative out there 
that's
worked well for us. As I understand it, the HTTP-based dumb transfer protocol
supports returning objects in loose object format, but only if they already 
exist
in loose format.

Extending this to have the remote provide these objects via a "dumb" protocol
when they are packed as well - i.e. the server would "loosens" them upon 
request -
is basically what we do and it works quite well for low-latency clients. To 
further improve
performance at the cost of complexity, we've added caching at the memory and 
disk layer
for these loose objects in the same format we send to the client.

There's a clear tradeoff here - the servers must have adequate disk and/or 
memory to store
these loose objects in optimal format. In addition, the higher the latency is 
to the remote,
the worse this solution will perform. Fortunately, in our case, network 
topology allows us to
put these caching proxies close enough to clients for it not to matter.


This does make sense in the situation you describe, but (as you said) I
don't think we can guarantee this in the majority of situations. I think
some sort of batching (like the (1a) solution I talked about near the
start of this e-mail) and serving packed data from packed storage should
form the baseline, and any situation-specific optimizations (e.g.
serving unpacked data from topologically-close servers) can be
additional steps.


Agreed - our choice of solution was largely driven by our lack of ability to 
batch,
performance issues with `pack-objects` at scale, and ability to deploy the
previously-mentioned proxies very close to clients. If we can solve the first
problem (batching), optimizing for the hopefully-less-common single blob at a 
time
scenario becomes less important. If we assume that happens, I think your 
proposal
looks sound overall.


Thanks.

Re: Proposal for "fetch-any-blob Git protocol" and server design

Reply via email to