On Feb 19, 2015 5:42 PM, David Turner <dtur...@twopensource.com> wrote:
>
> On Fri, 2015-02-20 at 06:38 +0700, Duy Nguyen wrote: 
> > >    * 'git push'? 
> > 
> > This one is not affected by how deep your repo's history is, or how 
> > wide your tree is, so should be quick.. 
> > 
> > Ah the number of refs may affect both git-push and git-pull. I think 
> > Stefan knows better than I in this area. 
>
> I can tell you that this is a bit of a problem for us at Twitter.  We 
> have over 100k refs, which adds ~20MiB of downstream traffic to every 
> push. 
>
> I added a hack to improve this locally inside Twitter: The client sends 
> a bloom filter of shas that it believes that the server knows about; the 
> server sends only the sha of master and any refs that are not in the 
> bloom filter.  The client  uses its local version of the servers' refs 
> as if they had just been sent.  This means that some packs will be 
> suboptimal, due to false positives in the bloom filter leading some new 
> refs to not be sent.  Also, if there were a repack between the pull and 
> the push, some refs might have been deleted on the server; we repack 
> rarely enough and pull frequently enough that this is hopefully not an 
> issue. 
>
> We're still testing to see if this works.  But due to the number of 
> assumptions it makes, it's probably not that great an idea for general 
> use. 

Good to hear that others are starting to experiment with solutions to this 
problem!  I hope to hear more updates on this.

I have a prototype of a simpler, and
I believe more robust solution, but aimed at a smaller use case I think.  On 
connecting, the client sends a sha of all its refs/shas as defined by a 
refspec, which it also sends to the server, which it believes the server might 
have the same refs/shas values for.  The server can then calculate the value of 
its refs/shas which meet the same refspec, and then omit sending those refs if 
the "verification" sha matches, and instead send only a confirmation that they 
matched (along with any refs outside of the refspec).  On a match, the client 
can inject the local values of the refs which met the refspec and be guaranteed 
that they match the server's values.

This optimization is aimed at the worst case scenario (and is thus the 
potentially best case "compression"), when the client and server match for all 
refs (a refs/* refspec)  This is something that happens often on Gerrit server 
startup, when it verifies that its mirrors are up-to-date.  One reason I chose 
this as a starting optimization, is because I think it is one use case which 
will actually not benefit from "fixing" the git protocol to only send relevant 
refs since all the refs are in fact relevant here! So something like this will 
likely be needed in any future git protocol in order for it to be efficient for 
this use case.  And I believe this use case is likely to stick around.

With a minor tweak, this optimization should work when replicating actual 
expected updates also by excluding the expected updating refs from the 
verification so that the server always sends their values since they will 
likely not match and would wreck the optimization.  However, for this use case 
it is not clear whether it is actually even worth caring about the non updating 
refs?  In theory the knowledge of the non updating refs can potentially reduce 
the amount of data transmitted, but I suspect that as the ref count increases, 
this has diminishing returns and mostly ends up chewing up CPU and memory in a 
vain attempt to reduce network traffic.

Please do keep us up-to-date of your results,

-Martin


Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a 
Linux Foundation Collaborative 
ProjectN�����r��y����b�X��ǧv�^�)޺{.n�+����ا���ܨ}���Ơz�&j:+v�������zZ+��+zf���h���~����i���z��w���?�����&�)ߢf

Reply via email to