RE: [RFC] Add support for downloading blobs on demand

2017-02-23 Thread Ben Peart
I've completed the work of switching our read_object proposal to use a 
background process (refactored from the LFS code) and have extricated it
from the rest of our GVFS fork so that it can be examined/tested
separately.  It is currently based on a Git For Windows fork that I've
pushed to GitHub for anyone who is interested in viewing it at:

https://github.com/benpeart/git/tree/read-object-process

After some additional conversations with Christian, we're working to
combine our RFC/patch series into a single solution that should meet the
requirements of both.

The combined solution needs to have an "info" function which requests
info about a single object instead of a "have" function which must
return information on all objects the ODB knows as this doesn't scale
when the number of objects is large.  

This means the "info" call has to be fast so spawning a process on every
call won't work.  The background process with a versioned interface that
allows you to negotiate capabilities should solve this problem.

Ben

> -Original Message-
> From: Ben Peart [mailto:peart...@gmail.com]
> Sent: Tuesday, February 7, 2017 1:21 PM
> To: 'Christian Couder' <christian.cou...@gmail.com>
> Cc: 'Jeff King' <p...@peff.net>; 'git' <git@vger.kernel.org>; 'Johannes
> Schindelin' <johannes.schinde...@gmx.de>; Ben Peart
> <benpe...@microsoft.com>
> Subject: RE: [RFC] Add support for downloading blobs on demand
> 
> No worries about a late response, I'm sure this is the start of a long
> conversation. :)
> 
> > -Original Message-
> > From: Christian Couder [mailto:christian.cou...@gmail.com]
> > Sent: Sunday, February 5, 2017 9:04 AM
> > To: Ben Peart <peart...@gmail.com>
> > Cc: Jeff King <p...@peff.net>; git <git@vger.kernel.org>; Johannes
> > Schindelin <johannes.schinde...@gmx.de>
> > Subject: Re: [RFC] Add support for downloading blobs on demand
> >
> > (Sorry for the late reply and thanks to Dscho for pointing me to this
> > thread.)
> >
> > On Tue, Jan 17, 2017 at 10:50 PM, Ben Peart <peart...@gmail.com>
> wrote:
> > >> From: Jeff King [mailto:p...@peff.net] On Fri, Jan 13, 2017 at
> > >> 10:52:53AM -0500, Ben Peart wrote:
> > >>
> > >> > Clone and fetch will pass a  --lazy-clone  flag (open to a better
> > >> > name
> > >> > here) similar to  --depth  that instructs the server to only
> > >> > return commits and trees and to ignore blobs.
> > >> >
> > >> > Later during git operations like checkout, when a blob cannot be
> > >> > found after checking all the regular places (loose, pack,
> > >> > alternates, etc), git will download the missing object and place
> > >> > it into the local object store (currently as a loose object) then
> > >> > resume the
> > operation.
> > >>
> > >> Have you looked at the "external odb" patches I wrote a while ago,
> > >> and which Christian has been trying to resurrect?
> > >>
> > >>
> > >>
> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpub
> > >> li
> > >> c-
> > >> inbox.org%2Fgit%2F20161130210420.15982-1-
> > >>
> >
> chriscool%40tuxfamily.org%2F=02%7C01%7CBen.Peart%40microsoft.c
> > >>
> >
> om%7C9596d3bf32564f123e0c08d43f08a9e1%7C72f988bf86f141af91ab2d7c
> > >>
> >
> d011db47%7C1%7C0%7C636202753822020527=a6%2BGOAQoRhjFoxS
> > >> vftY8JZAVUssmrXuDZ9OBy3xqNZk%3D=0
> > >>
> > >> This is a similar approach, though I pushed the policy for "how do
> > >> you get the objects" out into an external script. One advantage
> > >> there is that large objects could easily be fetched from another
> > >> source entirely (e.g., S3 or equivalent) rather than the repo itself.
> > >>
> > >> The downside is that it makes things more complicated, because a
> > >> push or a fetch now involves three parties (server, client, and the
> > >> alternate object store). So questions like "do I have all the
> > >> objects I need" are hard to reason about.
> > >>
> > >> If you assume that there's going to be _some_ central Git repo
> > >> which has all of the objects, you might as well fetch from there
> > >> (and do it over normal git protocols). And that simplifies things a
> > >> bit, at the cost of
> > being less flexible.
> > >
> > > We looked quite a bit at the ext

RE: [RFC] Add support for downloading blobs on demand

2017-02-07 Thread Ben Peart
Thanks Jakub.  

Just so you are aware, this isn't a separate effort, it actually is the same 
effort as the GVFS effort from Microsoft.  For pragmatic reasons, we 
implemented the lazy clone support and on demand object downloading in our own 
codebase (GVFS) first and are now are working to move it into git natively so 
that it will be available everywhere git is available.  This RFC is just one 
step in that process.

As we mentioned at Git Merge, we looked into Mercurial but settled on Git as 
our version control solution.  We are, however, in active communication with 
the team from Facebook to share ideas.

Ben

> -Original Message-
> From: Jakub Narębski [mailto:jna...@gmail.com]
> Sent: Tuesday, February 7, 2017 4:57 PM
> To: Ben Peart <peart...@gmail.com>; 'Christian Couder'
> <christian.cou...@gmail.com>
> Cc: 'Jeff King' <p...@peff.net>; 'git' <git@vger.kernel.org>; 'Johannes
> Schindelin' <johannes.schinde...@gmx.de>; Ben Peart
> <benpe...@microsoft.com>
> Subject: Re: [RFC] Add support for downloading blobs on demand
> 
> I'd like to point to two (or rather one and a half) solutions that I got 
> aware of
> when watching streaming of "Git Merge 2017"[0].  There should be here
> people who were there; and hopefully video of those presentations and
> slides / notes would be soon available.
> 
> [0]: http://git-merge.com/
> 
> First tool that I'd like to point to is Git Virtual File System, or GVFS in 
> short
> (which unfortunately shares abbreviation with GNOME Virtual File System).
> 
> The presentation was "Scaling Git at Microsoft" by Saeed Noursalehi,
> Microsoft.  You can read about this solution in ArsTechnica article[1], and on
> Microsoft blog[2].  The code (or early version of thereof) is also 
> available[3] -
> I wonder why on GitHub and not Codeplex...
> 
> [1]: https://arstechnica.com/information-technology/2017/02/microsoft-
> hosts-the-windows-source-in-a-monstrous-300gb-git-repository/
> [2]:
> https://blogs.msdn.microsoft.com/visualstudioalm/2017/02/03/announcing-
> gvfs-git-virtual-file-system/
> [3]: https://github.com/Microsoft/GVFS
> 
> 
> The second presentation that might be of some interest is "Scaling Mercurial
> at Facebook: Insights from the Other Side" by Durham Goode, Facebook.
> The code is supposedly available as open-source; though I don't know how
> useful their 'blob storage' solution would be of use for your problem.
> 
> 
> HTH
> --
> Jakub Narębski




Re: [RFC] Add support for downloading blobs on demand

2017-02-07 Thread Jakub Narębski
I'd like to point to two (or rather one and a half) solutions that I got
aware of when watching streaming of "Git Merge 2017"[0].  There should
be here people who were there; and hopefully video of those presentations
and slides / notes would be soon available.

[0]: http://git-merge.com/

First tool that I'd like to point to is Git Virtual File System, or
GVFS in short (which unfortunately shares abbreviation with GNOME Virtual
File System).

The presentation was "Scaling Git at Microsoft" by Saeed Noursalehi, 
Microsoft.  You can read about this solution in ArsTechnica article[1],
and on Microsoft blog[2].  The code (or early version of thereof) is
also available[3] - I wonder why on GitHub and not Codeplex...

[1]: 
https://arstechnica.com/information-technology/2017/02/microsoft-hosts-the-windows-source-in-a-monstrous-300gb-git-repository/
[2]: 
https://blogs.msdn.microsoft.com/visualstudioalm/2017/02/03/announcing-gvfs-git-virtual-file-system/
[3]: https://github.com/Microsoft/GVFS


The second presentation that might be of some interest is "Scaling
Mercurial at Facebook: Insights from the Other Side" by Durham Goode,
Facebook.  The code is supposedly available as open-source; though
I don't know how useful their 'blob storage' solution would be of use
for your problem.


HTH
-- 
Jakub Narębski



RE: [RFC] Add support for downloading blobs on demand

2017-02-07 Thread Ben Peart
No worries about a late response, I'm sure this is the start of a long 
conversation. :)

> -Original Message-
> From: Christian Couder [mailto:christian.cou...@gmail.com]
> Sent: Sunday, February 5, 2017 9:04 AM
> To: Ben Peart <peart...@gmail.com>
> Cc: Jeff King <p...@peff.net>; git <git@vger.kernel.org>; Johannes Schindelin
> <johannes.schinde...@gmx.de>
> Subject: Re: [RFC] Add support for downloading blobs on demand
> 
> (Sorry for the late reply and thanks to Dscho for pointing me to this thread.)
> 
> On Tue, Jan 17, 2017 at 10:50 PM, Ben Peart <peart...@gmail.com> wrote:
> >> From: Jeff King [mailto:p...@peff.net] On Fri, Jan 13, 2017 at
> >> 10:52:53AM -0500, Ben Peart wrote:
> >>
> >> > Clone and fetch will pass a  --lazy-clone  flag (open to a better
> >> > name
> >> > here) similar to  --depth  that instructs the server to only return
> >> > commits and trees and to ignore blobs.
> >> >
> >> > Later during git operations like checkout, when a blob cannot be
> >> > found after checking all the regular places (loose, pack,
> >> > alternates, etc), git will download the missing object and place it
> >> > into the local object store (currently as a loose object) then resume the
> operation.
> >>
> >> Have you looked at the "external odb" patches I wrote a while ago,
> >> and which Christian has been trying to resurrect?
> >>
> >>
> >> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpubli
> >> c-
> >> inbox.org%2Fgit%2F20161130210420.15982-1-
> >>
> chriscool%40tuxfamily.org%2F=02%7C01%7CBen.Peart%40microsoft.c
> >>
> om%7C9596d3bf32564f123e0c08d43f08a9e1%7C72f988bf86f141af91ab2d7c
> >>
> d011db47%7C1%7C0%7C636202753822020527=a6%2BGOAQoRhjFoxS
> >> vftY8JZAVUssmrXuDZ9OBy3xqNZk%3D=0
> >>
> >> This is a similar approach, though I pushed the policy for "how do
> >> you get the objects" out into an external script. One advantage there
> >> is that large objects could easily be fetched from another source
> >> entirely (e.g., S3 or equivalent) rather than the repo itself.
> >>
> >> The downside is that it makes things more complicated, because a push
> >> or a fetch now involves three parties (server, client, and the
> >> alternate object store). So questions like "do I have all the objects
> >> I need" are hard to reason about.
> >>
> >> If you assume that there's going to be _some_ central Git repo which
> >> has all of the objects, you might as well fetch from there (and do it
> >> over normal git protocols). And that simplifies things a bit, at the cost 
> >> of
> being less flexible.
> >
> > We looked quite a bit at the external odb patches, as well as lfs and
> > even using alternates.  They all share a common downside that you must
> > maintain a separate service that contains _some_ of the files.
> 
> Pushing the policy for "how do you get the objects" out into an external
> helper doesn't mean that the external helper cannot use the main service.
> The external helper is still free to do whatever it wants including calling 
> the
> main service if it thinks it's better.

That is a good point and you're correct, that means you can avoid having to 
build out multiple services.

> 
> > These
> > files must also be versioned, replicated, backed up and the service
> > itself scaled out to handle the load.  As you mentioned, having
> > multiple services involved increases flexability but it also increases
> > the complexity and decreases the reliability of the overall version
> > control service.
> 
> About reliability, I think it depends a lot on the use case. If you want to 
> get
> very big files over an unreliable connection, it can better if you send those 
> big
> files over a restartable protocol and service like HTTP/S on a regular web
> server.
> 

My primary concern about reliability was the multiplicative effect of making 
multiple requests across multiple servers to complete a single request.  Having 
putting this all in a single service like you suggested above brings us back to 
parity on the complexity.

> > For operational simplicity, we opted to go with a design that uses a
> > single, central git repo which has _all_ the objects and to focus on
> > enhancing it to handle large numbers of files efficiently.  This
> > allows us to focus our efforts on a great git service and to avoid
> > having to build out these other services.
> 
>

Re: [RFC] Add support for downloading blobs on demand

2017-02-05 Thread Christian Couder
(Sorry for the late reply and thanks to Dscho for pointing me to this thread.)

On Tue, Jan 17, 2017 at 10:50 PM, Ben Peart  wrote:
>> From: Jeff King [mailto:p...@peff.net]
>> On Fri, Jan 13, 2017 at 10:52:53AM -0500, Ben Peart wrote:
>>
>> > Clone and fetch will pass a  --lazy-clone  flag (open to a better name
>> > here) similar to  --depth  that instructs the server to only return
>> > commits and trees and to ignore blobs.
>> >
>> > Later during git operations like checkout, when a blob cannot be found
>> > after checking all the regular places (loose, pack, alternates, etc),
>> > git will download the missing object and place it into the local
>> > object store (currently as a loose object) then resume the operation.
>>
>> Have you looked at the "external odb" patches I wrote a while ago, and
>> which Christian has been trying to resurrect?
>>
>>   https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpublic-
>> inbox.org%2Fgit%2F20161130210420.15982-1-
>> chriscool%40tuxfamily.org%2F=02%7C01%7CBen.Peart%40microsoft.c
>> om%7C9596d3bf32564f123e0c08d43f08a9e1%7C72f988bf86f141af91ab2d7c
>> d011db47%7C1%7C0%7C636202753822020527=a6%2BGOAQoRhjFoxS
>> vftY8JZAVUssmrXuDZ9OBy3xqNZk%3D=0
>>
>> This is a similar approach, though I pushed the policy for "how do you get 
>> the
>> objects" out into an external script. One advantage there is that large 
>> objects
>> could easily be fetched from another source entirely (e.g., S3 or equivalent)
>> rather than the repo itself.
>>
>> The downside is that it makes things more complicated, because a push or a
>> fetch now involves three parties (server, client, and the alternate object
>> store). So questions like "do I have all the objects I need" are hard to 
>> reason
>> about.
>>
>> If you assume that there's going to be _some_ central Git repo which has all
>> of the objects, you might as well fetch from there (and do it over normal git
>> protocols). And that simplifies things a bit, at the cost of being less 
>> flexible.
>
> We looked quite a bit at the external odb patches, as well as lfs and
> even using alternates.  They all share a common downside that you must
> maintain a separate service that contains _some_ of the files.

Pushing the policy for "how do you get the objects" out into an
external helper doesn't mean that the external helper cannot use the
main service.
The external helper is still free to do whatever it wants including
calling the main service if it thinks it's better.

> These
> files must also be versioned, replicated, backed up and the service
> itself scaled out to handle the load.  As you mentioned, having multiple
> services involved increases flexability but it also increases the
> complexity and decreases the reliability of the overall version control
> service.

About reliability, I think it depends a lot on the use case. If you
want to get very big files over an unreliable connection, it can
better if you send those big files over a restartable protocol and
service like HTTP/S on a regular web server.

> For operational simplicity, we opted to go with a design that uses a
> single, central git repo which has _all_ the objects and to focus on
> enhancing it to handle large numbers of files efficiently.  This allows
> us to focus our efforts on a great git service and to avoid having to
> build out these other services.

Ok, but I don't think it prevents you from using at least some of the
same mechanisms that the external odb series is using.
And reducing the number of mechanisms in Git itself is great for its
maintainability and simplicity.

>> > To prevent git from accidentally downloading all missing blobs, some
>> > git operations are updated to be aware of the potential for missing blobs.
>> > The most obvious being check_connected which will return success as if
>> > everything in the requested commits is available locally.
>>
>> Actually, Git is pretty good about trying not to access blobs when it doesn't
>> need to. The important thing is that you know enough about the blobs to
>> fulfill has_sha1_file() and sha1_object_info() requests without actually
>> fetching the data.
>>
>> So the client definitely needs to have some list of which objects exist, and
>> which it _could_ get if it needed to.

Yeah, and the external odb series handles that already, thanks to
Peff's initial work.

>> The one place you'd probably want to tweak things is in the diff code, as a
>> single "git log -Sfoo" would fault in all of the blobs.
>
> It is an interesting idea to explore how we could be smarter about
> preventing blobs from faulting in if we had enough info to fulfill
> has_sha1_file() and sha1_object_info().  Given we also heavily prune the
> working directory using sparse-checkout, this hasn't been our top focus
> but it is certainly something worth looking into.

The external odb series doesn't handle preventing blobs from faulting
in yet, so this could be a common problem.

[...]

>> One big hurdle 

RE: [RFC] Add support for downloading blobs on demand

2017-01-18 Thread Ben Peart
We actually pursued trying to make submodules work for some time and 
even built tooling around trying to work around some of the issues we 
ran into (not repo.py but along a similar line) before determining that 
we would be better served by having a single repo and solving the scale 
issues.  I don't want to rehash the arguments for/against a single repo 
- suffice it to say, we have opted for a single large repo. :)

Thanks,

Ben
> -Original Message-
> From: Stefan Beller [mailto:sbel...@google.com]
> Sent: Tuesday, January 17, 2017 5:24 PM
> To: Martin Fick <mf...@codeaurora.org>
> Cc: Ben Peart <peart...@gmail.com>; Shawn Pearce
> <spea...@spearce.org>; git <git@vger.kernel.org>;
> benpe...@microsoft.com
> Subject: Re: [RFC] Add support for downloading blobs on demand
> 
> On Tue, Jan 17, 2017 at 2:05 PM, Martin Fick <mf...@codeaurora.org>
> wrote:
> > On Tuesday, January 17, 2017 04:50:13 PM Ben Peart wrote:
> >> While large files can be a real problem, our biggest issue today is
> >> having a lot (millions!) of source files when any individual
> >> developer only needs a small percentage of them.  Git with 3+ million
> >> local files just doesn't perform well.
> >
> > Honestly, this sounds like a problem better dealt with by using git
> > subtree or git submodules, have you considered that?
> >
> > -Martin
> >
> 
> I cannot speak for subtrees as I have very little knowledge on them.
> But there you also have the problem that *someone* has to have a whole
> tree? (e.g. the build bot)
> 
> submodules however comes with a couple of things attached, both positive
> as well as negative points:
> 
> * it offers ACLs along the way. ($user may not be allowed to
>   clone all submodules, but only those related to the work)
> * The conceptual understanding of git just got a lot harder.
>   ("Yo dawg, I heard you like git, so I put git repos inside
>   other git repos"), it is not easy to come up with reasonable
>   defaults for all usecases, so the everyday user still has to
>   have some understanding of submodules.
> * typical cheap in-tree operations may become very expensive:
>   -> moving a file from one location to another (in another
>  submodule) adds overhead, no rename detection.
> * We are actively working on submodules, so there is
>   some momentum going already.
> * our experiments with Android show that e.g. fetching (even
>   if you have all of Android) becomes a lot faster for everyday
>   usage as only a few repositories change each day). This
>   comparision was against the repo tool, that we currently
>   use for Android. I do not know how it would compare against
>   single repo Git, as having such a large repository seemed
>   complicated.
> * the support for submodules in Git is already there, though
>   not polished. The positive side is to have already a good base,
>   the negative side is to have support current use cases.
> 
> Stefan



Re: [RFC] Add support for downloading blobs on demand

2017-01-17 Thread Stefan Beller
On Tue, Jan 17, 2017 at 2:05 PM, Martin Fick  wrote:
> On Tuesday, January 17, 2017 04:50:13 PM Ben Peart wrote:
>> While large files can be a real problem, our biggest issue
>> today is having a lot (millions!) of source files when
>> any individual developer only needs a small percentage of
>> them.  Git with 3+ million local files just doesn't
>> perform well.
>
> Honestly, this sounds like a problem better dealt with by
> using git subtree or git submodules, have you considered
> that?
>
> -Martin
>

I cannot speak for subtrees as I have very little knowledge on them.
But there you also have the problem that *someone* has to have a
whole tree? (e.g. the build bot)

submodules however comes with a couple of things attached, both
positive as well as negative points:

* it offers ACLs along the way. ($user may not be allowed to
  clone all submodules, but only those related to the work)
* The conceptual understanding of git just got a lot harder.
  ("Yo dawg, I heard you like git, so I put git repos inside
  other git repos"), it is not easy to come up with reasonable
  defaults for all usecases, so the everyday user still has to
  have some understanding of submodules.
* typical cheap in-tree operations may become very expensive:
  -> moving a file from one location to another (in another
 submodule) adds overhead, no rename detection.
* We are actively working on submodules, so there is
  some momentum going already.
* our experiments with Android show that e.g. fetching (even
  if you have all of Android) becomes a lot faster for everyday
  usage as only a few repositories change each day). This
  comparision was against the repo tool, that we currently
  use for Android. I do not know how it would compare against
  single repo Git, as having such a large repository seemed
  complicated.
* the support for submodules in Git is already there, though
  not polished. The positive side is to have already a good base,
  the negative side is to have support current use cases.

Stefan


Re: [RFC] Add support for downloading blobs on demand

2017-01-17 Thread Martin Fick
On Tuesday, January 17, 2017 04:50:13 PM Ben Peart wrote:
> While large files can be a real problem, our biggest issue
> today is having a lot (millions!) of source files when
> any individual developer only needs a small percentage of
> them.  Git with 3+ million local files just doesn't
> perform well.

Honestly, this sounds like a problem better dealt with by 
using git subtree or git submodules, have you considered 
that?

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation



RE: [RFC] Add support for downloading blobs on demand

2017-01-17 Thread Ben Peart
Thanks for the encouragement, support, and good ideas to look into.

Ben

> -Original Message-
> From: Shawn Pearce [mailto:spea...@spearce.org]
> Sent: Friday, January 13, 2017 4:07 PM
> To: Ben Peart <peart...@gmail.com>
> Cc: git <git@vger.kernel.org>; benpe...@microsoft.com
> Subject: Re: [RFC] Add support for downloading blobs on demand
> 
> On Fri, Jan 13, 2017 at 7:52 AM, Ben Peart <peart...@gmail.com> wrote:
> >
> > Goal
> > 
> >
> > To be able to better handle repos with many files that any individual
> > developer doesn’t need it would be nice if clone/fetch only brought
> > down those files that were actually needed.
> >
> > To enable that, we are proposing adding a flag to clone/fetch that
> > will instruct the server to limit the objects it sends to commits and
> > trees and to not send any blobs.
> >
> > When git performs an operation that requires a blob that isn’t
> > currently available locally, it will download the missing blob and add
> > it to the local object store.
> 
> Interesting. This is also an area I want to work on with my team at $DAY_JOB.
> Repositories are growing along multiple dimensions, and developers or
> editors don't always need all blobs for all time available locally to 
> successfully
> perform their work.
> 
> > Design
> > ~~
> >
> > Clone and fetch will pass a “--lazy-clone” flag (open to a better name
> > here) similar to “--depth” that instructs the server to only return
> > commits and trees and to ignore blobs.
> 
> My group at $DAY_JOB hasn't talked about it yet, but I want to add a
> protocol capability that lets clone/fetch ask only for blobs smaller than a
> specified byte count. This could be set to a reasonable text file size (e.g. 
> <= 5
> MiB) to predominately download only source files and text documentation,
> omitting larger binaries.
> 
> If the limit was set to 0, its the same as your idea to ignore all blobs.
> 

This is an interesting idea that may be an easier way to help mitigate 
the cost of very large files.  While our primary issue today is the 
sheer number of files, I'm sure at some point we'll run into issues with 
file size as well.  

> > Later during git operations like checkout, when a blob cannot be found
> > after checking all the regular places (loose, pack, alternates, etc),
> > git will download the missing object and place it into the local
> > object store (currently as a loose object) then resume the operation.
> 
> Right. I'd like to have this object retrieval be inside the native Git wire
> protocol, reusing the remote configuration and authentication setup. That
> requires expanding the server side of the protocol implementation slightly
> allowing any reachable object to be retrieved by SHA-1 alone. Bitmap indexes
> can significantly reduce the computational complexity for the server.
> 

Agree.  

> > To prevent git from accidentally downloading all missing blobs, some
> > git operations are updated to be aware of the potential for missing blobs.
> > The most obvious being check_connected which will return success as if
> > everything in the requested commits is available locally.
> 
> This ... sounds risky for the developer, as the repository may be corrupt due
> to a missing object, and the user cannot determine it.
> 
> Would it be reasonable for the server to return a list of SHA-1s it knows
> should exist, but has omitted due to the blob threshold (above), and the
> local repository store this in a binary searchable file?
> During connectivity checking its assumed OK if an object is not present in the
> object store, but is listed in this omitted objects file.
> 

Corrupt repos due to missing blobs must be pretty rare as I've never 
seen anyone report that error but for this and other reasons (see Peff's 
suggestion on how to minimize downloading unnecessary blobs) having this 
data could be valuable.  I'll add it to the list of things to look into.

> > To minimize the impact on the server, the existing dumb HTTP protocol
> > endpoint “objects/” can be used to retrieve the individual
> > missing blobs when needed.
> 
> I'd prefer this to be in the native wire protocol, where the objects are in 
> pack
> format (which unfortunately differs from loose format). I assume servers
> would combine many objects into pack files, potentially isolating large
> uncompressable binaries into their own packs, stored separately from
> commits/trees/small-text-blobs.
> 
> I get the value of this being in HTTP, where HTTP caching inside proxies can
> be leveraged to reduce master server load. I wonder if the native wire
> protocol could be taught to 

RE: [RFC] Add support for downloading blobs on demand

2017-01-17 Thread Ben Peart
Thanks for the thoughtful response.  No need to appologize for the 
length, it's a tough problem to solve so I don't expect it to be handled 
with a single, short email. :)

> -Original Message-
> From: Jeff King [mailto:p...@peff.net]
> Sent: Tuesday, January 17, 2017 1:43 PM
> To: Ben Peart <peart...@gmail.com>
> Cc: git@vger.kernel.org; Ben Peart <ben.pe...@microsoft.com>
> Subject: Re: [RFC] Add support for downloading blobs on demand
> 
> This is an issue I've thought a lot about. So apologies in advance that this
> response turned out a bit long. :)
> 
> On Fri, Jan 13, 2017 at 10:52:53AM -0500, Ben Peart wrote:
> 
> > Design
> > ~~
> >
> > Clone and fetch will pass a  --lazy-clone  flag (open to a better name
> > here) similar to  --depth  that instructs the server to only return
> > commits and trees and to ignore blobs.
> >
> > Later during git operations like checkout, when a blob cannot be found
> > after checking all the regular places (loose, pack, alternates, etc),
> > git will download the missing object and place it into the local
> > object store (currently as a loose object) then resume the operation.
> 
> Have you looked at the "external odb" patches I wrote a while ago, and
> which Christian has been trying to resurrect?
> 
>   https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpublic-
> inbox.org%2Fgit%2F20161130210420.15982-1-
> chriscool%40tuxfamily.org%2F=02%7C01%7CBen.Peart%40microsoft.c
> om%7C9596d3bf32564f123e0c08d43f08a9e1%7C72f988bf86f141af91ab2d7c
> d011db47%7C1%7C0%7C636202753822020527=a6%2BGOAQoRhjFoxS
> vftY8JZAVUssmrXuDZ9OBy3xqNZk%3D=0
> 
> This is a similar approach, though I pushed the policy for "how do you get the
> objects" out into an external script. One advantage there is that large 
> objects
> could easily be fetched from another source entirely (e.g., S3 or equivalent)
> rather than the repo itself.
> 
> The downside is that it makes things more complicated, because a push or a
> fetch now involves three parties (server, client, and the alternate object
> store). So questions like "do I have all the objects I need" are hard to 
> reason
> about.
> 
> If you assume that there's going to be _some_ central Git repo which has all
> of the objects, you might as well fetch from there (and do it over normal git
> protocols). And that simplifies things a bit, at the cost of being less 
> flexible.
> 

We looked quite a bit at the external odb patches, as well as lfs and 
even using alternates.  They all share a common downside that you must 
maintain a separate service that contains _some_ of the files.  These 
files must also be versioned, replicated, backed up and the service 
itself scaled out to handle the load.  As you mentioned, having multiple 
services involved increases flexability but it also increases the 
complexity and decreases the reliability of the overall version control 
service.  

For operational simplicity, we opted to go with a design that uses a 
single, central git repo which has _all_ the objects and to focus on 
enhancing it to handle large numbers of files efficiently.  This allows 
us to focus our efforts on a great git service and to avoid having to 
build out these other services.

> > To prevent git from accidentally downloading all missing blobs, some
> > git operations are updated to be aware of the potential for missing blobs.
> > The most obvious being check_connected which will return success as if
> > everything in the requested commits is available locally.
> 
> Actually, Git is pretty good about trying not to access blobs when it doesn't
> need to. The important thing is that you know enough about the blobs to
> fulfill has_sha1_file() and sha1_object_info() requests without actually
> fetching the data.
> 
> So the client definitely needs to have some list of which objects exist, and
> which it _could_ get if it needed to.
> 
> The one place you'd probably want to tweak things is in the diff code, as a
> single "git log -Sfoo" would fault in all of the blobs.
> 

It is an interesting idea to explore how we could be smarter about 
preventing blobs from faulting in if we had enough info to fulfill 
has_sha1_file() and sha1_object_info().  Given we also heavily prune the 
working directory using sparse-checkout, this hasn't been our top focus 
but it is certainly something worth looking into.

> > To minimize the impact on the server, the existing dumb HTTP protocol
> > endpoint  objects/  can be used to retrieve the individual
> > missing blobs when needed.
> 
> This is going to behave badly on well-packed repositories, because there isn't
> a good way to fetch a single object. The be

Re: [RFC] Add support for downloading blobs on demand

2017-01-17 Thread Jeff King
This is an issue I've thought a lot about. So apologies in advance that
this response turned out a bit long. :)

On Fri, Jan 13, 2017 at 10:52:53AM -0500, Ben Peart wrote:

> Design
> ~~
> 
> Clone and fetch will pass a �--lazy-clone� flag (open to a better name 
> here) similar to �--depth� that instructs the server to only return 
> commits and trees and to ignore blobs.
> 
> Later during git operations like checkout, when a blob cannot be found
> after checking all the regular places (loose, pack, alternates, etc), 
> git will download the missing object and place it into the local object 
> store (currently as a loose object) then resume the operation.

Have you looked at the "external odb" patches I wrote a while ago, and
which Christian has been trying to resurrect?

  http://public-inbox.org/git/20161130210420.15982-1-chrisc...@tuxfamily.org/

This is a similar approach, though I pushed the policy for "how do you
get the objects" out into an external script. One advantage there is
that large objects could easily be fetched from another source entirely
(e.g., S3 or equivalent) rather than the repo itself.

The downside is that it makes things more complicated, because a push or
a fetch now involves three parties (server, client, and the alternate
object store). So questions like "do I have all the objects I need" are
hard to reason about.

If you assume that there's going to be _some_ central Git repo which has
all of the objects, you might as well fetch from there (and do it over
normal git protocols). And that simplifies things a bit, at the cost of
being less flexible.

> To prevent git from accidentally downloading all missing blobs, some git
> operations are updated to be aware of the potential for missing blobs.  
> The most obvious being check_connected which will return success as if 
> everything in the requested commits is available locally.

Actually, Git is pretty good about trying not to access blobs when it
doesn't need to. The important thing is that you know enough about the
blobs to fulfill has_sha1_file() and sha1_object_info() requests without
actually fetching the data.

So the client definitely needs to have some list of which objects exist,
and which it _could_ get if it needed to.

The one place you'd probably want to tweak things is in the diff code,
as a single "git log -Sfoo" would fault in all of the blobs.

> To minimize the impact on the server, the existing dumb HTTP protocol 
> endpoint �objects/� can be used to retrieve the individual missing
> blobs when needed.

This is going to behave badly on well-packed repositories, because there
isn't a good way to fetch a single object. The best case (which is not
implemented at all in Git) is that you grab the pack .idx, then grab
"slices" of the pack corresponding to specific objects, including
hunting down delta bases.

But then next time the server repacks, you have to throw away your .idx
file. And those can be big. The .idx for linux.git is 135MB. You really
wouldn't want to do an incremental fetch of 1MB worth of objects and
have to grab the whole .idx just to figure out which bytes you needed.

You can solve this by replacing the dumb-http server with a smart one
that actually serves up the individual objects as if they were truly
sitting on the filesystem. But then you haven't really minimized impact
on the server, and you might as well teach the smart protocols to do
blob fetches.


One big hurdle to this approach, no matter the protocol, is how you are
going to handle deltas. Right now, a git client tells the server "I have
this commit, but I want this other one". And the server knows which
objects the client has from the first, and which it needs from the
second. Moreover, it knows that it can send objects in delta form
directly from disk if the other side has the delta base.

So what happens in this system? We know we don't need to send any blobs
in a regular fetch, because the whole idea is that we only send blobs on
demand. So we wait for the client to ask us for blob A. But then what do
we send? If we send the whole blob without deltas, we're going to waste
a lot of bandwidth.

The on-disk size of all of the blobs in linux.git is ~500MB. The actual
data size is ~48GB. Some of that is from zlib, which you get even for
non-deltas. But the rest of it is from the delta compression. I don't
think it's feasible to give that up, at least not for "normal" source
repos like linux.git (more on that in a minute).

So ideally you do want to send deltas. But how do you know which objects
the other side already has, which you can use as a delta base? Sending
the list of "here are the blobs I have" doesn't scale. Just the sha1s
start to add up, especially when you are doing incremental fetches.

I think this sort of things performs a lot better when you just focus on
large objects. Because they don't tend to delta well anyway, and the
savings are much bigger by avoiding ones you don't want. So a directive
like "don't 

Re: [RFC] Add support for downloading blobs on demand

2017-01-13 Thread Shawn Pearce
On Fri, Jan 13, 2017 at 7:52 AM, Ben Peart  wrote:
>
> Goal
> 
>
> To be able to better handle repos with many files that any individual
> developer doesn’t need it would be nice if clone/fetch only brought down
> those files that were actually needed.
>
> To enable that, we are proposing adding a flag to clone/fetch that will
> instruct the server to limit the objects it sends to commits and trees
> and to not send any blobs.
>
> When git performs an operation that requires a blob that isn’t currently
> available locally, it will download the missing blob and add it to the
> local object store.

Interesting. This is also an area I want to work on with my team at
$DAY_JOB. Repositories are growing along multiple dimensions, and
developers or editors don't always need all blobs for all time
available locally to successfully perform their work.

> Design
> ~~
>
> Clone and fetch will pass a “--lazy-clone” flag (open to a better name
> here) similar to “--depth” that instructs the server to only return
> commits and trees and to ignore blobs.

My group at $DAY_JOB hasn't talked about it yet, but I want to add a
protocol capability that lets clone/fetch ask only for blobs smaller
than a specified byte count. This could be set to a reasonable text
file size (e.g. <= 5 MiB) to predominately download only source files
and text documentation, omitting larger binaries.

If the limit was set to 0, its the same as your idea to ignore all blobs.

> Later during git operations like checkout, when a blob cannot be found
> after checking all the regular places (loose, pack, alternates, etc),
> git will download the missing object and place it into the local object
> store (currently as a loose object) then resume the operation.

Right. I'd like to have this object retrieval be inside the native Git
wire protocol, reusing the remote configuration and authentication
setup. That requires expanding the server side of the protocol
implementation slightly allowing any reachable object to be retrieved
by SHA-1 alone. Bitmap indexes can significantly reduce the
computational complexity for the server.

> To prevent git from accidentally downloading all missing blobs, some git
> operations are updated to be aware of the potential for missing blobs.
> The most obvious being check_connected which will return success as if
> everything in the requested commits is available locally.

This ... sounds risky for the developer, as the repository may be
corrupt due to a missing object, and the user cannot determine it.

Would it be reasonable for the server to return a list of SHA-1s it
knows should exist, but has omitted due to the blob threshold (above),
and the local repository store this in a binary searchable file?
During connectivity checking its assumed OK if an object is not
present in the object store, but is listed in this omitted objects
file.

> To minimize the impact on the server, the existing dumb HTTP protocol
> endpoint “objects/” can be used to retrieve the individual missing
> blobs when needed.

I'd prefer this to be in the native wire protocol, where the objects
are in pack format (which unfortunately differs from loose format). I
assume servers would combine many objects into pack files, potentially
isolating large uncompressable binaries into their own packs, stored
separately from commits/trees/small-text-blobs.

I get the value of this being in HTTP, where HTTP caching inside
proxies can be leveraged to reduce master server load. I wonder if the
native wire protocol could be taught to use a variation of an HTTP GET
that includes the object SHA-1 in the URL line, to retrieve a
one-object pack file.

> Performance considerations
> ~~
>
> We found that downloading commits and trees on demand had a significant
> negative performance impact.  In addition, many git commands assume all
> commits and trees are available locally so they quickly got pulled down
> anyway.  Even in very large repos the commits and trees are relatively
> small so bringing them down with the initial commit and subsequent fetch
> commands was reasonable.
>
> After cloning, the developer can use sparse-checkout to limit the set of
> files to the subset they need (typically only 1-10% in these large
> repos).  This allows the initial checkout to only download the set of
> files actually needed to complete their task.  At any point, the
> sparse-checkout file can be updated to include additional files which
> will be fetched transparently on demand.
>
> Typical source files are relatively small so the overhead of connecting
> and authenticating to the server for a single file at a time is
> substantial.  As a result, having a long running process that is started
> with the first request and can cache connection information between
> requests is a significant performance win.

Junio and I talked years ago (offline, sorry no mailing list archive)
about "narrow checkout", which is the idea of the