RE: [RFC] Add support for downloading blobs on demand
I've completed the work of switching our read_object proposal to use a background process (refactored from the LFS code) and have extricated it from the rest of our GVFS fork so that it can be examined/tested separately. It is currently based on a Git For Windows fork that I've pushed to GitHub for anyone who is interested in viewing it at: https://github.com/benpeart/git/tree/read-object-process After some additional conversations with Christian, we're working to combine our RFC/patch series into a single solution that should meet the requirements of both. The combined solution needs to have an "info" function which requests info about a single object instead of a "have" function which must return information on all objects the ODB knows as this doesn't scale when the number of objects is large. This means the "info" call has to be fast so spawning a process on every call won't work. The background process with a versioned interface that allows you to negotiate capabilities should solve this problem. Ben > -Original Message- > From: Ben Peart [mailto:peart...@gmail.com] > Sent: Tuesday, February 7, 2017 1:21 PM > To: 'Christian Couder' > Cc: 'Jeff King' ; 'git' ; 'Johannes > Schindelin' ; Ben Peart > > Subject: RE: [RFC] Add support for downloading blobs on demand > > No worries about a late response, I'm sure this is the start of a long > conversation. :) > > > -Original Message- > > From: Christian Couder [mailto:christian.cou...@gmail.com] > > Sent: Sunday, February 5, 2017 9:04 AM > > To: Ben Peart > > Cc: Jeff King ; git ; Johannes > > Schindelin > > Subject: Re: [RFC] Add support for downloading blobs on demand > > > > (Sorry for the late reply and thanks to Dscho for pointing me to this > > thread.) > > > > On Tue, Jan 17, 2017 at 10:50 PM, Ben Peart > wrote: > > >> From: Jeff King [mailto:p...@peff.net] On Fri, Jan 13, 2017 at > > >> 10:52:53AM -0500, Ben Peart wrote: > > >> > > >> > Clone and fetch will pass a --lazy-clone flag (open to a better > > >> > name > > >> > here) similar to --depth that instructs the server to only > > >> > return commits and trees and to ignore blobs. > > >> > > > >> > Later during git operations like checkout, when a blob cannot be > > >> > found after checking all the regular places (loose, pack, > > >> > alternates, etc), git will download the missing object and place > > >> > it into the local object store (currently as a loose object) then > > >> > resume the > > operation. > > >> > > >> Have you looked at the "external odb" patches I wrote a while ago, > > >> and which Christian has been trying to resurrect? > > >> > > >> > > >> > https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpub > > >> li > > >> c- > > >> inbox.org%2Fgit%2F20161130210420.15982-1- > > >> > > > chriscool%40tuxfamily.org%2F&data=02%7C01%7CBen.Peart%40microsoft.c > > >> > > > om%7C9596d3bf32564f123e0c08d43f08a9e1%7C72f988bf86f141af91ab2d7c > > >> > > > d011db47%7C1%7C0%7C636202753822020527&sdata=a6%2BGOAQoRhjFoxS > > >> vftY8JZAVUssmrXuDZ9OBy3xqNZk%3D&reserved=0 > > >> > > >> This is a similar approach, though I pushed the policy for "how do > > >> you get the objects" out into an external script. One advantage > > >> there is that large objects could easily be fetched from another > > >> source entirely (e.g., S3 or equivalent) rather than the repo itself. > > >> > > >> The downside is that it makes things more complicated, because a > > >> push or a fetch now involves three parties (server, client, and the > > >> alternate object store). So questions like "do I have all the > > >> objects I need" are hard to reason about. > > >> > > >> If you assume that there's going to be _some_ central Git repo > > >> which has all of the objects, you might as well fetch from there > > >> (and do it over normal git protocols). And that simplifies things a > > >> bit, at the cost of > > being less flexible. > > > > > > We looked quite a bit at the external odb patches, as well as lfs > > > and even using alternates. They all share a common downside that > > > you must maintain a separate service that contains
RE: [RFC] Add support for downloading blobs on demand
Thanks Jakub. Just so you are aware, this isn't a separate effort, it actually is the same effort as the GVFS effort from Microsoft. For pragmatic reasons, we implemented the lazy clone support and on demand object downloading in our own codebase (GVFS) first and are now are working to move it into git natively so that it will be available everywhere git is available. This RFC is just one step in that process. As we mentioned at Git Merge, we looked into Mercurial but settled on Git as our version control solution. We are, however, in active communication with the team from Facebook to share ideas. Ben > -Original Message- > From: Jakub Narębski [mailto:jna...@gmail.com] > Sent: Tuesday, February 7, 2017 4:57 PM > To: Ben Peart ; 'Christian Couder' > > Cc: 'Jeff King' ; 'git' ; 'Johannes > Schindelin' ; Ben Peart > > Subject: Re: [RFC] Add support for downloading blobs on demand > > I'd like to point to two (or rather one and a half) solutions that I got > aware of > when watching streaming of "Git Merge 2017"[0]. There should be here > people who were there; and hopefully video of those presentations and > slides / notes would be soon available. > > [0]: http://git-merge.com/ > > First tool that I'd like to point to is Git Virtual File System, or GVFS in > short > (which unfortunately shares abbreviation with GNOME Virtual File System). > > The presentation was "Scaling Git at Microsoft" by Saeed Noursalehi, > Microsoft. You can read about this solution in ArsTechnica article[1], and on > Microsoft blog[2]. The code (or early version of thereof) is also > available[3] - > I wonder why on GitHub and not Codeplex... > > [1]: https://arstechnica.com/information-technology/2017/02/microsoft- > hosts-the-windows-source-in-a-monstrous-300gb-git-repository/ > [2]: > https://blogs.msdn.microsoft.com/visualstudioalm/2017/02/03/announcing- > gvfs-git-virtual-file-system/ > [3]: https://github.com/Microsoft/GVFS > > > The second presentation that might be of some interest is "Scaling Mercurial > at Facebook: Insights from the Other Side" by Durham Goode, Facebook. > The code is supposedly available as open-source; though I don't know how > useful their 'blob storage' solution would be of use for your problem. > > > HTH > -- > Jakub Narębski
Re: [RFC] Add support for downloading blobs on demand
I'd like to point to two (or rather one and a half) solutions that I got aware of when watching streaming of "Git Merge 2017"[0]. There should be here people who were there; and hopefully video of those presentations and slides / notes would be soon available. [0]: http://git-merge.com/ First tool that I'd like to point to is Git Virtual File System, or GVFS in short (which unfortunately shares abbreviation with GNOME Virtual File System). The presentation was "Scaling Git at Microsoft" by Saeed Noursalehi, Microsoft. You can read about this solution in ArsTechnica article[1], and on Microsoft blog[2]. The code (or early version of thereof) is also available[3] - I wonder why on GitHub and not Codeplex... [1]: https://arstechnica.com/information-technology/2017/02/microsoft-hosts-the-windows-source-in-a-monstrous-300gb-git-repository/ [2]: https://blogs.msdn.microsoft.com/visualstudioalm/2017/02/03/announcing-gvfs-git-virtual-file-system/ [3]: https://github.com/Microsoft/GVFS The second presentation that might be of some interest is "Scaling Mercurial at Facebook: Insights from the Other Side" by Durham Goode, Facebook. The code is supposedly available as open-source; though I don't know how useful their 'blob storage' solution would be of use for your problem. HTH -- Jakub Narębski
RE: [RFC] Add support for downloading blobs on demand
No worries about a late response, I'm sure this is the start of a long conversation. :) > -Original Message- > From: Christian Couder [mailto:christian.cou...@gmail.com] > Sent: Sunday, February 5, 2017 9:04 AM > To: Ben Peart > Cc: Jeff King ; git ; Johannes Schindelin > > Subject: Re: [RFC] Add support for downloading blobs on demand > > (Sorry for the late reply and thanks to Dscho for pointing me to this thread.) > > On Tue, Jan 17, 2017 at 10:50 PM, Ben Peart wrote: > >> From: Jeff King [mailto:p...@peff.net] On Fri, Jan 13, 2017 at > >> 10:52:53AM -0500, Ben Peart wrote: > >> > >> > Clone and fetch will pass a --lazy-clone flag (open to a better > >> > name > >> > here) similar to --depth that instructs the server to only return > >> > commits and trees and to ignore blobs. > >> > > >> > Later during git operations like checkout, when a blob cannot be > >> > found after checking all the regular places (loose, pack, > >> > alternates, etc), git will download the missing object and place it > >> > into the local object store (currently as a loose object) then resume the > operation. > >> > >> Have you looked at the "external odb" patches I wrote a while ago, > >> and which Christian has been trying to resurrect? > >> > >> > >> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpubli > >> c- > >> inbox.org%2Fgit%2F20161130210420.15982-1- > >> > chriscool%40tuxfamily.org%2F&data=02%7C01%7CBen.Peart%40microsoft.c > >> > om%7C9596d3bf32564f123e0c08d43f08a9e1%7C72f988bf86f141af91ab2d7c > >> > d011db47%7C1%7C0%7C636202753822020527&sdata=a6%2BGOAQoRhjFoxS > >> vftY8JZAVUssmrXuDZ9OBy3xqNZk%3D&reserved=0 > >> > >> This is a similar approach, though I pushed the policy for "how do > >> you get the objects" out into an external script. One advantage there > >> is that large objects could easily be fetched from another source > >> entirely (e.g., S3 or equivalent) rather than the repo itself. > >> > >> The downside is that it makes things more complicated, because a push > >> or a fetch now involves three parties (server, client, and the > >> alternate object store). So questions like "do I have all the objects > >> I need" are hard to reason about. > >> > >> If you assume that there's going to be _some_ central Git repo which > >> has all of the objects, you might as well fetch from there (and do it > >> over normal git protocols). And that simplifies things a bit, at the cost > >> of > being less flexible. > > > > We looked quite a bit at the external odb patches, as well as lfs and > > even using alternates. They all share a common downside that you must > > maintain a separate service that contains _some_ of the files. > > Pushing the policy for "how do you get the objects" out into an external > helper doesn't mean that the external helper cannot use the main service. > The external helper is still free to do whatever it wants including calling > the > main service if it thinks it's better. That is a good point and you're correct, that means you can avoid having to build out multiple services. > > > These > > files must also be versioned, replicated, backed up and the service > > itself scaled out to handle the load. As you mentioned, having > > multiple services involved increases flexability but it also increases > > the complexity and decreases the reliability of the overall version > > control service. > > About reliability, I think it depends a lot on the use case. If you want to > get > very big files over an unreliable connection, it can better if you send those > big > files over a restartable protocol and service like HTTP/S on a regular web > server. > My primary concern about reliability was the multiplicative effect of making multiple requests across multiple servers to complete a single request. Having putting this all in a single service like you suggested above brings us back to parity on the complexity. > > For operational simplicity, we opted to go with a design that uses a > > single, central git repo which has _all_ the objects and to focus on > > enhancing it to handle large numbers of files efficiently. This > > allows us to focus our efforts on a great git service and to avoid > > having to build out these other services. > > Ok, but I don't think it prevents you from using at least some of t
Re: [RFC] Add support for downloading blobs on demand
(Sorry for the late reply and thanks to Dscho for pointing me to this thread.) On Tue, Jan 17, 2017 at 10:50 PM, Ben Peart wrote: >> From: Jeff King [mailto:p...@peff.net] >> On Fri, Jan 13, 2017 at 10:52:53AM -0500, Ben Peart wrote: >> >> > Clone and fetch will pass a --lazy-clone flag (open to a better name >> > here) similar to --depth that instructs the server to only return >> > commits and trees and to ignore blobs. >> > >> > Later during git operations like checkout, when a blob cannot be found >> > after checking all the regular places (loose, pack, alternates, etc), >> > git will download the missing object and place it into the local >> > object store (currently as a loose object) then resume the operation. >> >> Have you looked at the "external odb" patches I wrote a while ago, and >> which Christian has been trying to resurrect? >> >> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpublic- >> inbox.org%2Fgit%2F20161130210420.15982-1- >> chriscool%40tuxfamily.org%2F&data=02%7C01%7CBen.Peart%40microsoft.c >> om%7C9596d3bf32564f123e0c08d43f08a9e1%7C72f988bf86f141af91ab2d7c >> d011db47%7C1%7C0%7C636202753822020527&sdata=a6%2BGOAQoRhjFoxS >> vftY8JZAVUssmrXuDZ9OBy3xqNZk%3D&reserved=0 >> >> This is a similar approach, though I pushed the policy for "how do you get >> the >> objects" out into an external script. One advantage there is that large >> objects >> could easily be fetched from another source entirely (e.g., S3 or equivalent) >> rather than the repo itself. >> >> The downside is that it makes things more complicated, because a push or a >> fetch now involves three parties (server, client, and the alternate object >> store). So questions like "do I have all the objects I need" are hard to >> reason >> about. >> >> If you assume that there's going to be _some_ central Git repo which has all >> of the objects, you might as well fetch from there (and do it over normal git >> protocols). And that simplifies things a bit, at the cost of being less >> flexible. > > We looked quite a bit at the external odb patches, as well as lfs and > even using alternates. They all share a common downside that you must > maintain a separate service that contains _some_ of the files. Pushing the policy for "how do you get the objects" out into an external helper doesn't mean that the external helper cannot use the main service. The external helper is still free to do whatever it wants including calling the main service if it thinks it's better. > These > files must also be versioned, replicated, backed up and the service > itself scaled out to handle the load. As you mentioned, having multiple > services involved increases flexability but it also increases the > complexity and decreases the reliability of the overall version control > service. About reliability, I think it depends a lot on the use case. If you want to get very big files over an unreliable connection, it can better if you send those big files over a restartable protocol and service like HTTP/S on a regular web server. > For operational simplicity, we opted to go with a design that uses a > single, central git repo which has _all_ the objects and to focus on > enhancing it to handle large numbers of files efficiently. This allows > us to focus our efforts on a great git service and to avoid having to > build out these other services. Ok, but I don't think it prevents you from using at least some of the same mechanisms that the external odb series is using. And reducing the number of mechanisms in Git itself is great for its maintainability and simplicity. >> > To prevent git from accidentally downloading all missing blobs, some >> > git operations are updated to be aware of the potential for missing blobs. >> > The most obvious being check_connected which will return success as if >> > everything in the requested commits is available locally. >> >> Actually, Git is pretty good about trying not to access blobs when it doesn't >> need to. The important thing is that you know enough about the blobs to >> fulfill has_sha1_file() and sha1_object_info() requests without actually >> fetching the data. >> >> So the client definitely needs to have some list of which objects exist, and >> which it _could_ get if it needed to. Yeah, and the external odb series handles that already, thanks to Peff's initial work. >> The one place you'd probably want to tweak things is in the diff code, as a >> single "git log -Sfoo" would fault in all of the blobs. > > It is an interesting idea to explore how we could be smarter about > preventing blobs from faulting in if we had enough info to fulfill > has_sha1_file() and sha1_object_info(). Given we also heavily prune the > working directory using sparse-checkout, this hasn't been our top focus > but it is certainly something worth looking into. The external odb series doesn't handle preventing blobs from faulting in yet, so this could be a common problem. [...] >> One big hurdle to
RE: [RFC] Add support for downloading blobs on demand
We actually pursued trying to make submodules work for some time and even built tooling around trying to work around some of the issues we ran into (not repo.py but along a similar line) before determining that we would be better served by having a single repo and solving the scale issues. I don't want to rehash the arguments for/against a single repo - suffice it to say, we have opted for a single large repo. :) Thanks, Ben > -Original Message- > From: Stefan Beller [mailto:sbel...@google.com] > Sent: Tuesday, January 17, 2017 5:24 PM > To: Martin Fick > Cc: Ben Peart ; Shawn Pearce > ; git ; > benpe...@microsoft.com > Subject: Re: [RFC] Add support for downloading blobs on demand > > On Tue, Jan 17, 2017 at 2:05 PM, Martin Fick > wrote: > > On Tuesday, January 17, 2017 04:50:13 PM Ben Peart wrote: > >> While large files can be a real problem, our biggest issue today is > >> having a lot (millions!) of source files when any individual > >> developer only needs a small percentage of them. Git with 3+ million > >> local files just doesn't perform well. > > > > Honestly, this sounds like a problem better dealt with by using git > > subtree or git submodules, have you considered that? > > > > -Martin > > > > I cannot speak for subtrees as I have very little knowledge on them. > But there you also have the problem that *someone* has to have a whole > tree? (e.g. the build bot) > > submodules however comes with a couple of things attached, both positive > as well as negative points: > > * it offers ACLs along the way. ($user may not be allowed to > clone all submodules, but only those related to the work) > * The conceptual understanding of git just got a lot harder. > ("Yo dawg, I heard you like git, so I put git repos inside > other git repos"), it is not easy to come up with reasonable > defaults for all usecases, so the everyday user still has to > have some understanding of submodules. > * typical cheap in-tree operations may become very expensive: > -> moving a file from one location to another (in another > submodule) adds overhead, no rename detection. > * We are actively working on submodules, so there is > some momentum going already. > * our experiments with Android show that e.g. fetching (even > if you have all of Android) becomes a lot faster for everyday > usage as only a few repositories change each day). This > comparision was against the repo tool, that we currently > use for Android. I do not know how it would compare against > single repo Git, as having such a large repository seemed > complicated. > * the support for submodules in Git is already there, though > not polished. The positive side is to have already a good base, > the negative side is to have support current use cases. > > Stefan
Re: [RFC] Add support for downloading blobs on demand
On Tue, Jan 17, 2017 at 2:05 PM, Martin Fick wrote: > On Tuesday, January 17, 2017 04:50:13 PM Ben Peart wrote: >> While large files can be a real problem, our biggest issue >> today is having a lot (millions!) of source files when >> any individual developer only needs a small percentage of >> them. Git with 3+ million local files just doesn't >> perform well. > > Honestly, this sounds like a problem better dealt with by > using git subtree or git submodules, have you considered > that? > > -Martin > I cannot speak for subtrees as I have very little knowledge on them. But there you also have the problem that *someone* has to have a whole tree? (e.g. the build bot) submodules however comes with a couple of things attached, both positive as well as negative points: * it offers ACLs along the way. ($user may not be allowed to clone all submodules, but only those related to the work) * The conceptual understanding of git just got a lot harder. ("Yo dawg, I heard you like git, so I put git repos inside other git repos"), it is not easy to come up with reasonable defaults for all usecases, so the everyday user still has to have some understanding of submodules. * typical cheap in-tree operations may become very expensive: -> moving a file from one location to another (in another submodule) adds overhead, no rename detection. * We are actively working on submodules, so there is some momentum going already. * our experiments with Android show that e.g. fetching (even if you have all of Android) becomes a lot faster for everyday usage as only a few repositories change each day). This comparision was against the repo tool, that we currently use for Android. I do not know how it would compare against single repo Git, as having such a large repository seemed complicated. * the support for submodules in Git is already there, though not polished. The positive side is to have already a good base, the negative side is to have support current use cases. Stefan
Re: [RFC] Add support for downloading blobs on demand
On Tuesday, January 17, 2017 04:50:13 PM Ben Peart wrote: > While large files can be a real problem, our biggest issue > today is having a lot (millions!) of source files when > any individual developer only needs a small percentage of > them. Git with 3+ million local files just doesn't > perform well. Honestly, this sounds like a problem better dealt with by using git subtree or git submodules, have you considered that? -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
RE: [RFC] Add support for downloading blobs on demand
Thanks for the encouragement, support, and good ideas to look into. Ben > -Original Message- > From: Shawn Pearce [mailto:spea...@spearce.org] > Sent: Friday, January 13, 2017 4:07 PM > To: Ben Peart > Cc: git ; benpe...@microsoft.com > Subject: Re: [RFC] Add support for downloading blobs on demand > > On Fri, Jan 13, 2017 at 7:52 AM, Ben Peart wrote: > > > > Goal > > > > > > To be able to better handle repos with many files that any individual > > developer doesn’t need it would be nice if clone/fetch only brought > > down those files that were actually needed. > > > > To enable that, we are proposing adding a flag to clone/fetch that > > will instruct the server to limit the objects it sends to commits and > > trees and to not send any blobs. > > > > When git performs an operation that requires a blob that isn’t > > currently available locally, it will download the missing blob and add > > it to the local object store. > > Interesting. This is also an area I want to work on with my team at $DAY_JOB. > Repositories are growing along multiple dimensions, and developers or > editors don't always need all blobs for all time available locally to > successfully > perform their work. > > > Design > > ~~ > > > > Clone and fetch will pass a “--lazy-clone” flag (open to a better name > > here) similar to “--depth” that instructs the server to only return > > commits and trees and to ignore blobs. > > My group at $DAY_JOB hasn't talked about it yet, but I want to add a > protocol capability that lets clone/fetch ask only for blobs smaller than a > specified byte count. This could be set to a reasonable text file size (e.g. > <= 5 > MiB) to predominately download only source files and text documentation, > omitting larger binaries. > > If the limit was set to 0, its the same as your idea to ignore all blobs. > This is an interesting idea that may be an easier way to help mitigate the cost of very large files. While our primary issue today is the sheer number of files, I'm sure at some point we'll run into issues with file size as well. > > Later during git operations like checkout, when a blob cannot be found > > after checking all the regular places (loose, pack, alternates, etc), > > git will download the missing object and place it into the local > > object store (currently as a loose object) then resume the operation. > > Right. I'd like to have this object retrieval be inside the native Git wire > protocol, reusing the remote configuration and authentication setup. That > requires expanding the server side of the protocol implementation slightly > allowing any reachable object to be retrieved by SHA-1 alone. Bitmap indexes > can significantly reduce the computational complexity for the server. > Agree. > > To prevent git from accidentally downloading all missing blobs, some > > git operations are updated to be aware of the potential for missing blobs. > > The most obvious being check_connected which will return success as if > > everything in the requested commits is available locally. > > This ... sounds risky for the developer, as the repository may be corrupt due > to a missing object, and the user cannot determine it. > > Would it be reasonable for the server to return a list of SHA-1s it knows > should exist, but has omitted due to the blob threshold (above), and the > local repository store this in a binary searchable file? > During connectivity checking its assumed OK if an object is not present in the > object store, but is listed in this omitted objects file. > Corrupt repos due to missing blobs must be pretty rare as I've never seen anyone report that error but for this and other reasons (see Peff's suggestion on how to minimize downloading unnecessary blobs) having this data could be valuable. I'll add it to the list of things to look into. > > To minimize the impact on the server, the existing dumb HTTP protocol > > endpoint “objects/” can be used to retrieve the individual > > missing blobs when needed. > > I'd prefer this to be in the native wire protocol, where the objects are in > pack > format (which unfortunately differs from loose format). I assume servers > would combine many objects into pack files, potentially isolating large > uncompressable binaries into their own packs, stored separately from > commits/trees/small-text-blobs. > > I get the value of this being in HTTP, where HTTP caching inside proxies can > be leveraged to reduce master server load. I wonder if the native wire > protocol could be taught to use a variation of an HTTP
RE: [RFC] Add support for downloading blobs on demand
Thanks for the thoughtful response. No need to appologize for the length, it's a tough problem to solve so I don't expect it to be handled with a single, short email. :) > -Original Message- > From: Jeff King [mailto:p...@peff.net] > Sent: Tuesday, January 17, 2017 1:43 PM > To: Ben Peart > Cc: git@vger.kernel.org; Ben Peart > Subject: Re: [RFC] Add support for downloading blobs on demand > > This is an issue I've thought a lot about. So apologies in advance that this > response turned out a bit long. :) > > On Fri, Jan 13, 2017 at 10:52:53AM -0500, Ben Peart wrote: > > > Design > > ~~ > > > > Clone and fetch will pass a --lazy-clone flag (open to a better name > > here) similar to --depth that instructs the server to only return > > commits and trees and to ignore blobs. > > > > Later during git operations like checkout, when a blob cannot be found > > after checking all the regular places (loose, pack, alternates, etc), > > git will download the missing object and place it into the local > > object store (currently as a loose object) then resume the operation. > > Have you looked at the "external odb" patches I wrote a while ago, and > which Christian has been trying to resurrect? > > https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpublic- > inbox.org%2Fgit%2F20161130210420.15982-1- > chriscool%40tuxfamily.org%2F&data=02%7C01%7CBen.Peart%40microsoft.c > om%7C9596d3bf32564f123e0c08d43f08a9e1%7C72f988bf86f141af91ab2d7c > d011db47%7C1%7C0%7C636202753822020527&sdata=a6%2BGOAQoRhjFoxS > vftY8JZAVUssmrXuDZ9OBy3xqNZk%3D&reserved=0 > > This is a similar approach, though I pushed the policy for "how do you get the > objects" out into an external script. One advantage there is that large > objects > could easily be fetched from another source entirely (e.g., S3 or equivalent) > rather than the repo itself. > > The downside is that it makes things more complicated, because a push or a > fetch now involves three parties (server, client, and the alternate object > store). So questions like "do I have all the objects I need" are hard to > reason > about. > > If you assume that there's going to be _some_ central Git repo which has all > of the objects, you might as well fetch from there (and do it over normal git > protocols). And that simplifies things a bit, at the cost of being less > flexible. > We looked quite a bit at the external odb patches, as well as lfs and even using alternates. They all share a common downside that you must maintain a separate service that contains _some_ of the files. These files must also be versioned, replicated, backed up and the service itself scaled out to handle the load. As you mentioned, having multiple services involved increases flexability but it also increases the complexity and decreases the reliability of the overall version control service. For operational simplicity, we opted to go with a design that uses a single, central git repo which has _all_ the objects and to focus on enhancing it to handle large numbers of files efficiently. This allows us to focus our efforts on a great git service and to avoid having to build out these other services. > > To prevent git from accidentally downloading all missing blobs, some > > git operations are updated to be aware of the potential for missing blobs. > > The most obvious being check_connected which will return success as if > > everything in the requested commits is available locally. > > Actually, Git is pretty good about trying not to access blobs when it doesn't > need to. The important thing is that you know enough about the blobs to > fulfill has_sha1_file() and sha1_object_info() requests without actually > fetching the data. > > So the client definitely needs to have some list of which objects exist, and > which it _could_ get if it needed to. > > The one place you'd probably want to tweak things is in the diff code, as a > single "git log -Sfoo" would fault in all of the blobs. > It is an interesting idea to explore how we could be smarter about preventing blobs from faulting in if we had enough info to fulfill has_sha1_file() and sha1_object_info(). Given we also heavily prune the working directory using sparse-checkout, this hasn't been our top focus but it is certainly something worth looking into. > > To minimize the impact on the server, the existing dumb HTTP protocol > > endpoint objects/ can be used to retrieve the individual > > missing blobs when needed. > > This is going to behave badly on well-packed repositories, because there isn't > a good way to fetch a si
Re: [RFC] Add support for downloading blobs on demand
This is an issue I've thought a lot about. So apologies in advance that this response turned out a bit long. :) On Fri, Jan 13, 2017 at 10:52:53AM -0500, Ben Peart wrote: > Design > ~~ > > Clone and fetch will pass a �--lazy-clone� flag (open to a better name > here) similar to �--depth� that instructs the server to only return > commits and trees and to ignore blobs. > > Later during git operations like checkout, when a blob cannot be found > after checking all the regular places (loose, pack, alternates, etc), > git will download the missing object and place it into the local object > store (currently as a loose object) then resume the operation. Have you looked at the "external odb" patches I wrote a while ago, and which Christian has been trying to resurrect? http://public-inbox.org/git/20161130210420.15982-1-chrisc...@tuxfamily.org/ This is a similar approach, though I pushed the policy for "how do you get the objects" out into an external script. One advantage there is that large objects could easily be fetched from another source entirely (e.g., S3 or equivalent) rather than the repo itself. The downside is that it makes things more complicated, because a push or a fetch now involves three parties (server, client, and the alternate object store). So questions like "do I have all the objects I need" are hard to reason about. If you assume that there's going to be _some_ central Git repo which has all of the objects, you might as well fetch from there (and do it over normal git protocols). And that simplifies things a bit, at the cost of being less flexible. > To prevent git from accidentally downloading all missing blobs, some git > operations are updated to be aware of the potential for missing blobs. > The most obvious being check_connected which will return success as if > everything in the requested commits is available locally. Actually, Git is pretty good about trying not to access blobs when it doesn't need to. The important thing is that you know enough about the blobs to fulfill has_sha1_file() and sha1_object_info() requests without actually fetching the data. So the client definitely needs to have some list of which objects exist, and which it _could_ get if it needed to. The one place you'd probably want to tweak things is in the diff code, as a single "git log -Sfoo" would fault in all of the blobs. > To minimize the impact on the server, the existing dumb HTTP protocol > endpoint �objects/� can be used to retrieve the individual missing > blobs when needed. This is going to behave badly on well-packed repositories, because there isn't a good way to fetch a single object. The best case (which is not implemented at all in Git) is that you grab the pack .idx, then grab "slices" of the pack corresponding to specific objects, including hunting down delta bases. But then next time the server repacks, you have to throw away your .idx file. And those can be big. The .idx for linux.git is 135MB. You really wouldn't want to do an incremental fetch of 1MB worth of objects and have to grab the whole .idx just to figure out which bytes you needed. You can solve this by replacing the dumb-http server with a smart one that actually serves up the individual objects as if they were truly sitting on the filesystem. But then you haven't really minimized impact on the server, and you might as well teach the smart protocols to do blob fetches. One big hurdle to this approach, no matter the protocol, is how you are going to handle deltas. Right now, a git client tells the server "I have this commit, but I want this other one". And the server knows which objects the client has from the first, and which it needs from the second. Moreover, it knows that it can send objects in delta form directly from disk if the other side has the delta base. So what happens in this system? We know we don't need to send any blobs in a regular fetch, because the whole idea is that we only send blobs on demand. So we wait for the client to ask us for blob A. But then what do we send? If we send the whole blob without deltas, we're going to waste a lot of bandwidth. The on-disk size of all of the blobs in linux.git is ~500MB. The actual data size is ~48GB. Some of that is from zlib, which you get even for non-deltas. But the rest of it is from the delta compression. I don't think it's feasible to give that up, at least not for "normal" source repos like linux.git (more on that in a minute). So ideally you do want to send deltas. But how do you know which objects the other side already has, which you can use as a delta base? Sending the list of "here are the blobs I have" doesn't scale. Just the sha1s start to add up, especially when you are doing incremental fetches. I think this sort of things performs a lot better when you just focus on large objects. Because they don't tend to delta well anyway, and the savings are much bigger by avoiding ones you don't want. So a directive like "don't bothe
Re: [RFC] Add support for downloading blobs on demand
On Fri, Jan 13, 2017 at 7:52 AM, Ben Peart wrote: > > Goal > > > To be able to better handle repos with many files that any individual > developer doesn’t need it would be nice if clone/fetch only brought down > those files that were actually needed. > > To enable that, we are proposing adding a flag to clone/fetch that will > instruct the server to limit the objects it sends to commits and trees > and to not send any blobs. > > When git performs an operation that requires a blob that isn’t currently > available locally, it will download the missing blob and add it to the > local object store. Interesting. This is also an area I want to work on with my team at $DAY_JOB. Repositories are growing along multiple dimensions, and developers or editors don't always need all blobs for all time available locally to successfully perform their work. > Design > ~~ > > Clone and fetch will pass a “--lazy-clone” flag (open to a better name > here) similar to “--depth” that instructs the server to only return > commits and trees and to ignore blobs. My group at $DAY_JOB hasn't talked about it yet, but I want to add a protocol capability that lets clone/fetch ask only for blobs smaller than a specified byte count. This could be set to a reasonable text file size (e.g. <= 5 MiB) to predominately download only source files and text documentation, omitting larger binaries. If the limit was set to 0, its the same as your idea to ignore all blobs. > Later during git operations like checkout, when a blob cannot be found > after checking all the regular places (loose, pack, alternates, etc), > git will download the missing object and place it into the local object > store (currently as a loose object) then resume the operation. Right. I'd like to have this object retrieval be inside the native Git wire protocol, reusing the remote configuration and authentication setup. That requires expanding the server side of the protocol implementation slightly allowing any reachable object to be retrieved by SHA-1 alone. Bitmap indexes can significantly reduce the computational complexity for the server. > To prevent git from accidentally downloading all missing blobs, some git > operations are updated to be aware of the potential for missing blobs. > The most obvious being check_connected which will return success as if > everything in the requested commits is available locally. This ... sounds risky for the developer, as the repository may be corrupt due to a missing object, and the user cannot determine it. Would it be reasonable for the server to return a list of SHA-1s it knows should exist, but has omitted due to the blob threshold (above), and the local repository store this in a binary searchable file? During connectivity checking its assumed OK if an object is not present in the object store, but is listed in this omitted objects file. > To minimize the impact on the server, the existing dumb HTTP protocol > endpoint “objects/” can be used to retrieve the individual missing > blobs when needed. I'd prefer this to be in the native wire protocol, where the objects are in pack format (which unfortunately differs from loose format). I assume servers would combine many objects into pack files, potentially isolating large uncompressable binaries into their own packs, stored separately from commits/trees/small-text-blobs. I get the value of this being in HTTP, where HTTP caching inside proxies can be leveraged to reduce master server load. I wonder if the native wire protocol could be taught to use a variation of an HTTP GET that includes the object SHA-1 in the URL line, to retrieve a one-object pack file. > Performance considerations > ~~ > > We found that downloading commits and trees on demand had a significant > negative performance impact. In addition, many git commands assume all > commits and trees are available locally so they quickly got pulled down > anyway. Even in very large repos the commits and trees are relatively > small so bringing them down with the initial commit and subsequent fetch > commands was reasonable. > > After cloning, the developer can use sparse-checkout to limit the set of > files to the subset they need (typically only 1-10% in these large > repos). This allows the initial checkout to only download the set of > files actually needed to complete their task. At any point, the > sparse-checkout file can be updated to include additional files which > will be fetched transparently on demand. > > Typical source files are relatively small so the overhead of connecting > and authenticating to the server for a single file at a time is > substantial. As a result, having a long running process that is started > with the first request and can cache connection information between > requests is a significant performance win. Junio and I talked years ago (offline, sorry no mailing list archive) about "narrow checkout", which is the idea of the client being able to as