Re: [PATCH v2 4/4] bundle v3: the beginning
On Thu, Jun 09, 2016 at 03:53:26PM +0700, Duy Nguyen wrote: > > Yes. To me, this was always about punting large blobs from the clones. > > Basically the way git-lfs and other tools work, but without munging your > > history permanently. > > Makes sense. If we keep all trees and commits locally, pack v4 still > has a chance to rise! Yeah, I don't think anything here precludes pack v4. > > I don't know if Christian had other cases in mind (like the many-files > > case, which I think is better served by something like narrow clones). > > Although for git-gc or git-fsck, I guess we need special support > anyway not to download large blobs unnecessarily. Not sure if git-gc > can already do that now. All I remember is git-repack can still be > used to make a repo independent from odb alternates. We probably want > to avoid that. git-fsck definitely should verify that large remote > blobs are good without downloading them (a new "fsck" command to > external odb, maybe). I think git-gc should work out of the box; you'd want to use "repack -l", which git-gc passes already. Fsck would be OK as long as you didn't actually load blobs. We have --connectivity-only for that, but of course it isn't the default. You'd probably want the default mode to fsck local blobs, but _not_ to fault in external blobs (but have an option to fault them all in if you really wanted to be sure you have everything). -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] bundle v3: the beginning
On Wed, Jun 8, 2016 at 11:19 PM, Jeff Kingwrote: > On Wed, Jun 08, 2016 at 05:44:06PM +0700, Duy Nguyen wrote: > >> On Wed, Jun 8, 2016 at 3:23 AM, Jeff King wrote: >> > Because this "external odb" essentially acts as a git alternate, we >> > would hit it only when we couldn't find an object through regular means. >> > Git would then make the object available in the usual on-disk format >> > (probably as a loose object). >> >> This means git-gc (and all things that do rev-list --objects --all) >> would download at least all trees and commits? Or will we have special >> treatment for those commands? > > Yes. To me, this was always about punting large blobs from the clones. > Basically the way git-lfs and other tools work, but without munging your > history permanently. Makes sense. If we keep all trees and commits locally, pack v4 still has a chance to rise! > I don't know if Christian had other cases in mind (like the many-files > case, which I think is better served by something like narrow clones). Although for git-gc or git-fsck, I guess we need special support anyway not to download large blobs unnecessarily. Not sure if git-gc can already do that now. All I remember is git-repack can still be used to make a repo independent from odb alternates. We probably want to avoid that. git-fsck definitely should verify that large remote blobs are good without downloading them (a new "fsck" command to external odb, maybe). -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] bundle v3: the beginning
On Wed, Jun 08, 2016 at 11:05:20AM -0700, Junio C Hamano wrote: > > Likewise, I'm not sure if "get" should be allowed to return contents > > that don't match the sha1. > > Yes, this is what I was getting at. It would be ideal to come up > with a way to do the large-blob offload without resorting to hacks > (like LFS and annex where "the same object contents will always > result in the same object name" is deliberately broken), and "object > name must match what the data hashes down to" is a basic requirement > for that. I meant to elaborate here more, but it looks like I didn't. One thing that an external odb command might want to be able to do is say "I _do_ have that object, but it would be expensive or impossible to get right now, so I will give you a placeholder" (e.g., you are just trying to run "git log" while on an airplane, and you would not want to die() because you cannot fetch some blob). But the right way is not to have "get" send content that does not match the requested sha1. It needs to make git aware that the object is a placeholder, so git does not do stupid things like write the bogus content into a loose object. The right way may be as simple as the external odb returning a non-zero exit code, and git fills in the placeholder data itself (or dies, possibly, depending on what the user asks it to do). One of the reasons I worked up that initial external-odb patch was because I knew that before we settled on a definite interface, we would have to try it out and see what odd corner cases came up. E.g., when do we fault in objects in a way that's annoying to the user? Which code paths need to handle "we do have this object available, but you can't see it right now, so what kind of fallback can we do?". Etc. Unfortunately I never actually did any of that testing. ;) -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] bundle v3: the beginning
Jeff Kingwrites: > This interface comes from my earlier patches, so I'll try to shed a > little light on the decisions I made there. > > Because this "external odb" essentially acts as a git alternate, we > would hit it only when we couldn't find an object through regular means. > Git would then make the object available in the usual on-disk format > (probably as a loose object). > > So in most processes, we would not need to consult the odb command at > all. And when we do, the first thing would be to get its "have" list, > which would at most run once per process. > > So the per-object cost is really calling "get", and my assumption there > was that the cost of actually retrieving the object over the network > would dwarf the fork/exec cost. OK, presented that way, the design makes sense (I do not know if Christian's (revised) design and implementation does or not, though, as I haven't seen it). As "check for non-existence" is important and costly, grabbing "have" once is a good strategy, just like we open the .idx files of available packfiles. >> > - " have": the command should output the sha1, size and >> > type of all the objects the external ODB contains, one object per >> > line. >> >> Why size and type at this point is needed by the clients? That is >> more expensive to compute than just a bare list of object names. > > Yes, but it lets get avoid doing a lot of "get" operations. OK, so it is more like having richer information in pack-v4 index ;-) >> > - " put ": the command should then read >> > from stdin an object and store it in the external ODB. >> >> Is ODB required to sanity check that matches what the data >> hashes down to? > > I think that would be up to the ODB, but it does seem like a good idea. > > Likewise, I'm not sure if "get" should be allowed to return contents > that don't match the sha1. Yes, this is what I was getting at. It would be ideal to come up with a way to do the large-blob offload without resorting to hacks (like LFS and annex where "the same object contents will always result in the same object name" is deliberately broken), and "object name must match what the data hashes down to" is a basic requirement for that. Thanks. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] bundle v3: the beginning
On Wed, Jun 08, 2016 at 05:44:06PM +0700, Duy Nguyen wrote: > On Wed, Jun 8, 2016 at 3:23 AM, Jeff Kingwrote: > > Because this "external odb" essentially acts as a git alternate, we > > would hit it only when we couldn't find an object through regular means. > > Git would then make the object available in the usual on-disk format > > (probably as a loose object). > > This means git-gc (and all things that do rev-list --objects --all) > would download at least all trees and commits? Or will we have special > treatment for those commands? Yes. To me, this was always about punting large blobs from the clones. Basically the way git-lfs and other tools work, but without munging your history permanently. I don't know if Christian had other cases in mind (like the many-files case, which I think is better served by something like narrow clones). -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] bundle v3: the beginning
On Wed, Jun 8, 2016 at 3:23 AM, Jeff Kingwrote: > Because this "external odb" essentially acts as a git alternate, we > would hit it only when we couldn't find an object through regular means. > Git would then make the object available in the usual on-disk format > (probably as a loose object). This means git-gc (and all things that do rev-list --objects --all) would download at least all trees and commits? Or will we have special treatment for those commands? -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] bundle v3: the beginning
On Tue, Jun 07, 2016 at 03:19:46PM +0200, Christian Couder wrote: > > But there are lots of cases where the server might want to tell > > the client that don't involve bundles at all. > > The idea is also that anytime the server needs to send external ODB > data to the client, it would ask its own external ODB to prepare a > kind of bundle with that data and use the bundle v3 mechanism to send > it. > That may need the bundle v3 mechanism to be extended, but I don't see > in which cases it would not work. Ah, I see we do not have the same underlying mental model. I think the external odb is purely the _client's_ business. The server does not have to have an external odb at all, and does not need to know about the client's. The client is responsible for telling the server during the git protocol anything it would need to know (like "do not bother sending objects over 50MB; I can get them elsewhere"). This makes the problem much more complicated, but it is more flexible and decentralized. > >a. The receiving side of a connection (e.g., a fetch client) > > somehow has out-of-band access to some objects. How does it > > tell the other side "do not bother sending me these objects; I > > can get them in another way"? > > I don't see a difference with regular objects that the fetch client > already has. If it already has some regular objects, a way to tell the > server "don't bother sending me these objects" is useful already and > it should be possible to use it to tell the server that there is no > need to send some objects stored in the external ODB too. The way to do that with normal objects is by finding shared commit tips, and assuming the normal git repository property of "if you have X, you have all of the objects reachable from X". This whole idea is essentially creating "holes" in that property. You can enumerate all of the holes, but I am not sure that scales well. We get a lot of efficiency by communicating only ref tips during the negotiation, and not individual object names. > Also something like this is needed for shallow clones and narrow > clones anyway. Yes, and I don't think it scales well there, either. A single shallow cutoff works OK. But if you repeatedly shallow-fetch into a repository, you end up with a patchwork of disconnected "islands" of history. The CPU required on the server side to serve those fetch requests is much greater than what would normally be needed. You can't use things like reachability bitmaps, and you have to open up the trees for each island to see which objects the other side actually has. > >b. The receiving side of a connection has out-of-band access to > > some objects. Some of these will be expensive to get (e.g., > > requiring a large download), and some may be fast (e.g., > > they've already been fetched to a local cache). How do we tell > > the sending side not to assume we have cheap access to these > > objects (e.g., for use as a delta base)? > > I don't think we need to tell the sending side we have cheap access or > not to some objects. > If the objects are managed by the external ODB, it's the external ODB > on the server and on the client that will manage these objects. They > should not be used as delta bases. > Perhaps there is no mechanism to say that some objects (basically all > external ODB managed objects) should not be used as delta bases, but > that could be added. Yes, I agree that _if_ the server can access the list of objects available in the external odb, this becomes much easier. I'm just not convinced that level of coupling is a good idea. Note that the server would also want to take this into account during repacking, as otherwise you end up with fetches that are very expensive to serve (you want to send X which is a delta based on Y, but you know that Y is available via the external odb, and therefore should not be used as a base. So you have to throw out the delta for X and either send it whole or compute a new one. That's much more expensive than blitting the delta from disk, which is what a normal clone would do). -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] bundle v3: the beginning
On Tue, Jun 07, 2016 at 12:23:40PM -0700, Junio C Hamano wrote: > Christian Couderwrites: > > > Git can store its objects only in the form of loose objects in > > separate files or packed objects in a pack file. > > To be able to better handle some kind of objects, for example big > > blobs, it would be nice if Git could store its objects in other object > > databases (ODB). > > > > To do that, this patch series makes it possible to register commands, > > using "odb..command" config variables, to access external > > ODBs. Each specified command will then be called the following ways: > > Hopefully it is done via a cheap RPC instead of forking/execing the > command for each and every object lookup. This interface comes from my earlier patches, so I'll try to shed a little light on the decisions I made there. Because this "external odb" essentially acts as a git alternate, we would hit it only when we couldn't find an object through regular means. Git would then make the object available in the usual on-disk format (probably as a loose object). So in most processes, we would not need to consult the odb command at all. And when we do, the first thing would be to get its "have" list, which would at most run once per process. So the per-object cost is really calling "get", and my assumption there was that the cost of actually retrieving the object over the network would dwarf the fork/exec cost. I also waffled on having git cache the output of " have" in some fast-lookup format to save even the single fork/exec. But I figured that was something that could be added later if needed. You'll note that this is sort of a "fault-in" model. Another model would be to treat external odb updates similar to fetches. I.e., we touch the network only during a special update operation, and then try to work locally with whatever the external odb has. IMHO this policy could actually be up to the external odb itself (i.e., its "have" command could serve from a local cache if it likes). > > - " have": the command should output the sha1, size and > > type of all the objects the external ODB contains, one object per > > line. > > Why size and type at this point is needed by the clients? That is > more expensive to compute than just a bare list of object names. Yes, but it lets get avoid doing a lot of "get" operations. For example, in a regular diff without binary-diffs enabled, we can automatically determine that a diff will be considered binary based purely on the size of the objects (related to core.bigfilethreshold). So if we know the sizes, we can run "git log -p" without faulting-in each of the objects just to say "woah, that looks binary". One can accomplish this with .gitattributes, too, of course, but the size thing just works out of the box. There are other places where it will come in handy, too. E.g., fscking a tree object you have, you want to make sure that the object referred to with mode 100644 is actually a blob. I also don't think the cost to compute size and type on the server is all that important. Yes, if you're backing your external odb with a git repository that runs "git cat-file" on the fly, it is more expensive. But in practice, I'd expect the server side to create a static manifest and serve it over HTTP (this also gives the benefit of things like ETags). > > - " get ": the command should then read from the > > external ODB the content of the object corresponding to and > > output it on stdout. > > The type and size should be given at this point. I don't think there's a reason not to; I didn't here because it would be redundant with what Git already knows from the "have" manifest it receives above. > > - " put ": the command should then read > > from stdin an object and store it in the external ODB. > > Is ODB required to sanity check that matches what the data > hashes down to? I think that would be up to the ODB, but it does seem like a good idea. Likewise, I'm not sure if "get" should be allowed to return contents that don't match the sha1. That would be fine for things like "diff", but would probably make "fsck" unhappy. > If this thing is primarily to offload large blobs, you might also > want not "get" but "checkout " to bypass Git entirely, > but I haven't thought it through. My mental model is that the external odb gets the object into the local odb, and then you can use the regular streaming-checkout code path. And the local odb serves as your cache. That does mean you might have two copies of each object (one in the odb, and one in the working tree), as opposed to a true cacheless system, which can get away with one. I think you could do that cacheless thing with the interface here, though; the "get" operation can stream, and you can stream directly to the working tree. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [PATCH v2 4/4] bundle v3: the beginning
Christian Couderwrites: > Git can store its objects only in the form of loose objects in > separate files or packed objects in a pack file. > To be able to better handle some kind of objects, for example big > blobs, it would be nice if Git could store its objects in other object > databases (ODB). > > To do that, this patch series makes it possible to register commands, > using "odb..command" config variables, to access external > ODBs. Each specified command will then be called the following ways: Hopefully it is done via a cheap RPC instead of forking/execing the command for each and every object lookup. > - " have": the command should output the sha1, size and > type of all the objects the external ODB contains, one object per > line. Why size and type at this point is needed by the clients? That is more expensive to compute than just a bare list of object names. > - " get ": the command should then read from the > external ODB the content of the object corresponding to and > output it on stdout. The type and size should be given at this point. > - " put ": the command should then read > from stdin an object and store it in the external ODB. Is ODB required to sanity check that matches what the data hashes down to? If this thing is primarily to offload large blobs, you might also want not "get" but "checkout " to bypass Git entirely, but I haven't thought it through. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] bundle v3: the beginning
On Wed, Jun 1, 2016 at 3:37 PM, Duy Nguyenwrote: > On Tue, May 31, 2016 at 8:18 PM, Christian Couder > wrote: I wonder if this mechanism could also be used or extended to clone and fetch an alternate object database. In [1], [2] and [3], and this was also discussed during the Contributor Summit last month, Peff says that he started working on alternate object database support a long time ago, and that the hard part is a protocol extension to tell remotes that you can access some objects in a different way. If a Git client would download a "$name.bndl" v3 bundle file that would have a "data: $URL/alt-odb-$name.odb" extended header, the Git client would just need to download "$URL/alt-odb-$name.odb" and use the alternate object database support on this file. >>> >>> What does this file contain exactly? A list of SHA-1 that can be >>> retrieved from this remote/alternate odb? >> >> It would depend on the external odb. Git could support different >> external odb that have different trade-offs. >> >>> I wonder if we could just >>> git-replace for this marking. The replaced content could contain the >>> uri pointing to the alt odb. >> >> Yeah, interesting! >> That's indeed another possibility that might not need the transfer of >> any external odb. >> >> But in this case it might be cleaner to just have a separate ref hierarchy >> like: >> >> refs/external-odbs/my-ext-odb/ >> >> instead of using the replace one. >> >> Or maybe: >> >> refs/replace/external-odbs/my-ext-odb/ >> >> if we really want to use the replace hierarchy. > > Yep. replace hierarchy crossed my mind. But then I thought about > performance degradation when there are more than one pack (we have to > search through them all for every SHA-1) and discarded it because we > would need to do the same linear search here. I guess we will most > likely have one or two name spaces so it probably won't matter. Yeah. >>> We could optionally contact alt odb to >>> retrieve real content, or just show the replaced/fake data when alt >>> odb is out of reach. >> >> Yeah, I wonder if that really needs the replace mechanism. > > Replace mechanism provides good hook point. But it really depends how > invasive this remote odb is. If a fake content is enough to avoid > breakages up high, git-replace is enough. If you really need to pass > remote odb info up so higher levels can do something more fancy, then > it's insufficient. > >> By the way this makes me wonder if we could implement resumable clone >> using some kind of replace ref. >> >> The client while cloning nearly as usual would download one or more >> special replace refs that would points to objects with links to >> download bundles using standard protocols. >> Just after the clone, the client would read these objects and download >> the bundles from these objects. >> And then it would clone from these bundles. > > I thought we have settled on resumable clone, just waiting for an > implementation :) Doing it your way, you would need to download these > special objects too (in a pack?) and come back download some more > bundles. It would be more efficient to show the bundle uri early and > go download the bundle on the side while you go on to get the > addition/smaller pack that contains the rest. Yeah, something like the bundle v3 mechanism is probably more efficient. Thanks, Christian. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] bundle v3: the beginning
On Wed, Jun 1, 2016 at 12:31 AM, Jeff Kingwrote: > On Fri, May 20, 2016 at 02:39:06PM +0200, Christian Couder wrote: > >> I wonder if this mechanism could also be used or extended to clone and >> fetch an alternate object database. >> >> In [1], [2] and [3], and this was also discussed during the >> Contributor Summit last month, Peff says that he started working on >> alternate object database support a long time ago, and that the hard >> part is a protocol extension to tell remotes that you can access some >> objects in a different way. >> >> If a Git client would download a "$name.bndl" v3 bundle file that >> would have a "data: $URL/alt-odb-$name.odb" extended header, the Git >> client would just need to download "$URL/alt-odb-$name.odb" and use >> the alternate object database support on this file. >> >> This way it would know all it has to know to access the objects in the >> alternate database. The alternate object database may not contain the >> real objects, if they are too big for example, but just files that >> describe how to get the real objects. > > I'm not sure about this strategy. I am also not sure that this is the best strategy, but I think it's worth discussing. > I see two complications: > > 1. I don't think bundles need to be a part of this "external odb" > strategy at all. If I understand correctly, I think you want to use > it as a place to stuff metadata that the server tells the client, > like "by the way, go here if you want another way to access some > objects". Yeah, basically I think it might be possible to use the bundle mechanism to transfer what an external ODB on the client would need to be initialized or updated. > But there are lots of cases where the server might want to tell > the client that don't involve bundles at all. The idea is also that anytime the server needs to send external ODB data to the client, it would ask its own external ODB to prepare a kind of bundle with that data and use the bundle v3 mechanism to send it. That may need the bundle v3 mechanism to be extended, but I don't see in which cases it would not work. > 2. A server pointing the client to another object store is actually > the least interesting bit of the protocol. > > The more interesting cases (to me) are: > >a. The receiving side of a connection (e.g., a fetch client) > somehow has out-of-band access to some objects. How does it > tell the other side "do not bother sending me these objects; I > can get them in another way"? I don't see a difference with regular objects that the fetch client already has. If it already has some regular objects, a way to tell the server "don't bother sending me these objects" is useful already and it should be possible to use it to tell the server that there is no need to send some objects stored in the external ODB too. Also something like this is needed for shallow clones and narrow clones anyway. >b. The receiving side of a connection has out-of-band access to > some objects. Some of these will be expensive to get (e.g., > requiring a large download), and some may be fast (e.g., > they've already been fetched to a local cache). How do we tell > the sending side not to assume we have cheap access to these > objects (e.g., for use as a delta base)? I don't think we need to tell the sending side we have cheap access or not to some objects. If the objects are managed by the external ODB, it's the external ODB on the server and on the client that will manage these objects. They should not be used as delta bases. Perhaps there is no mechanism to say that some objects (basically all external ODB managed objects) should not be used as delta bases, but that could be added. Thanks, Christian. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] bundle v3: the beginning
On Tue, Jun 7, 2016 at 3:46 PM, Christian Couderwrote: >> Any thought on object streaming support? > > No I didn't think about this. In fact I am not sure what this means. > >> It could be a big deal (might >> affect some design decisions). > > Could you elaborate on this? Object streaming api is in streaming.h. Normally objects are small and we can inflate the whole thing in memory before doing anything with them. For really large objects (which I guess is one of the reasons for remote odb) we don't want to do that. It takes lots of memory and you could have objects larger than your physical memory. In some cases when can ignore those objects (e.g. mark them binary and choose not to diff). In some other cases (e.g. checkout), we use streaming interface to process an object while we're inflating it to keep memory usage down. It's easy to add a new streaming backend, once you settle on how remote odb streams stuff. >> I would also think about how pack v4 >> fits in this (e.g. how a tree walker can still walk fast, a big >> promise of pack v4; I suppose if you still maintain "pack" concept >> over external odb then it might work). Not that it really matters. >> Pack v4 is the future, but the future can never be "today" :) > > Sorry I haven't really followed pack v4 and I forgot what it is about. It's a new pack format (and practically vaporware at this point) that promises much faster access when you need to walk through trees and commits (think rev-list --objects --all, or git-blame). Because we are (or I am) still not sure if pack v4 will ever get to the state where it can be merged to git.git, I think it's ok for you to ignore it too if you want. You can read more about the format here [1] and go even further back to [2] when Nicolas teased us with the pack size (smaller, which is a nice side effect). The potential issue with pack v4 is, the tree walker (struct tree_desc and related funcs in walk-tree.h) needs to know about pack v4 in order to walk fast. Current tree walker does not care if an object is packed (using what format) at all. Remote odb for pack v4 must have some way that allows to read pack data directly, something close to "mmap", it's not just about an api to "get me the canonical content of this object". [1] http://article.gmane.org/gmane.comp.version-control.git/234012 [2] http://article.gmane.org/gmane.comp.version-control.git/233038 -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] bundle v3: the beginning
On Tue, Jun 07, 2016 at 10:46:07AM +0200, Christian Couder wrote: > The high level overview of the patch series I would like to send > really soon now could go like this: > > --- > Git can store its objects only in the form of loose objects in > separate files or packed objects in a pack file. > To be able to better handle some kind of objects, for example big > blobs, it would be nice if Git could store its objects in other object > databases (ODB). > > To do that, this patch series makes it possible to register commands, > using "odb..command" config variables, to access external > ODBs. Each specified command will then be called the following ways: > > - " have": the command should output the sha1, size and > type of all the objects the external ODB contains, one object per > line. > - " get ": the command should then read from the > external ODB the content of the object corresponding to and > output it on stdout. > - " put ": the command should then read > from stdin an object and store it in the external ODB. (disclaimer: I didn't look at the patch series) Does this mean you're going to fork/exec() a new for each of these? It would probably be better if it was "batched", where the executable is invoked once and the commands are passed to its stdin. Mike -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] bundle v3: the beginning
On Wed, Jun 1, 2016 at 4:00 PM, Duy Nguyenwrote: > On Tue, May 31, 2016 at 8:18 PM, Christian Couder > wrote: [3] http://thread.gmane.org/gmane.comp.version-control.git/202902/focus=203020 >>> >>> This points to https://github.com/peff/git/commits/jk/external-odb >>> which is dead. Jeff, do you still have it somewhere, or is it not >>> worth looking at anymore? >> >> I have rebased, fixed and improved it a bit. I added write support for >> blobs. But the result is not very clean right now. >> I was going to send a RFC patch series after cleaning the result, but >> as you ask, here are some links to some branches: >> >> - https://github.com/chriscool/git/commits/gl-external-odb3 (the >> updated patches from Peff, plus 2 small patches from me) >> - https://github.com/chriscool/git/commits/gl-external-odb7 (the same >> as above, plus a number of WIP patches to add blob write support) > > Thanks. I had a super quick look. It would be nice if you could give a > high level overview on this (if you're going to spend a lot more time on it). Sorry about the late answer. Here is a new series after some cleanup: https://github.com/chriscool/git/commits/gl-external-odb12 The high level overview of the patch series I would like to send really soon now could go like this: --- Git can store its objects only in the form of loose objects in separate files or packed objects in a pack file. To be able to better handle some kind of objects, for example big blobs, it would be nice if Git could store its objects in other object databases (ODB). To do that, this patch series makes it possible to register commands, using "odb..command" config variables, to access external ODBs. Each specified command will then be called the following ways: - " have": the command should output the sha1, size and type of all the objects the external ODB contains, one object per line. - " get ": the command should then read from the external ODB the content of the object corresponding to and output it on stdout. - " put ": the command should then read from stdin an object and store it in the external ODB. This RFC patch series does not address the following important parts of a complete solution: - There is no way to transfer external ODB content using Git. - No real external ODB has been interfaced with Git. The tests use another git repo in a separate directory for this purpose which is probably useless in the real world. --- > One random thought, maybe it's better to have a daemon for external > odb right from the start (one for all odbs, or one per odb, I don't > know). It could do fancy stuff like object caching if necessary, and > it can avoid high cost handshake (e.g. via tls) every time a git > process runs and gets one object. Reducing process spawn would > definitely receive a big cheer from Windows crowd. The caching could be done inside Git and I am not sure it's worth optimizing this now. It could also make it more difficult to write support for an external ODB if we required a daemon. Maybe later we can add support for "odb..daemon" if we think that this is worth it. > Any thought on object streaming support? No I didn't think about this. In fact I am not sure what this means. > It could be a big deal (might > affect some design decisions). Could you elaborate on this? > I would also think about how pack v4 > fits in this (e.g. how a tree walker can still walk fast, a big > promise of pack v4; I suppose if you still maintain "pack" concept > over external odb then it might work). Not that it really matters. > Pack v4 is the future, but the future can never be "today" :) Sorry I haven't really followed pack v4 and I forgot what it is about. Thanks, Christian. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] bundle v3: the beginning
On Tue, May 31, 2016 at 8:18 PM, Christian Couderwrote: >>> [3] >>> http://thread.gmane.org/gmane.comp.version-control.git/202902/focus=203020 >> >> This points to https://github.com/peff/git/commits/jk/external-odb >> which is dead. Jeff, do you still have it somewhere, or is it not >> worth looking at anymore? > > I have rebased, fixed and improved it a bit. I added write support for > blobs. But the result is not very clean right now. > I was going to send a RFC patch series after cleaning the result, but > as you ask, here are some links to some branches: > > - https://github.com/chriscool/git/commits/gl-external-odb3 (the > updated patches from Peff, plus 2 small patches from me) > - https://github.com/chriscool/git/commits/gl-external-odb7 (the same > as above, plus a number of WIP patches to add blob write support) Thanks. I had a super quick look. It would be nice if you could give a high level overview on this (if you're going to spend a lot more time on it). One random thought, maybe it's better to have a daemon for external odb right from the start (one for all odbs, or one per odb, I don't know). It could do fancy stuff like object caching if necessary, and it can avoid high cost handshake (e.g. via tls) every time a git process runs and gets one object. Reducing process spawn would definitely receive a big cheer from Windows crowd. Any thought on object streaming support? It could be a big deal (might affect some design decisions). I would also think about how pack v4 fits in this (e.g. how a tree walker can still walk fast, a big promise of pack v4; I suppose if you still maintain "pack" concept over external odb then it might work). Not that it really matters. Pack v4 is the future, but the future can never be "today" :) -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] bundle v3: the beginning
On Tue, May 31, 2016 at 8:18 PM, Christian Couderwrote: >>> I wonder if this mechanism could also be used or extended to clone and >>> fetch an alternate object database. >>> >>> In [1], [2] and [3], and this was also discussed during the >>> Contributor Summit last month, Peff says that he started working on >>> alternate object database support a long time ago, and that the hard >>> part is a protocol extension to tell remotes that you can access some >>> objects in a different way. >>> >>> If a Git client would download a "$name.bndl" v3 bundle file that >>> would have a "data: $URL/alt-odb-$name.odb" extended header, the Git >>> client would just need to download "$URL/alt-odb-$name.odb" and use >>> the alternate object database support on this file. >> >> What does this file contain exactly? A list of SHA-1 that can be >> retrieved from this remote/alternate odb? > > It would depend on the external odb. Git could support different > external odb that have different trade-offs. > >> I wonder if we could just >> git-replace for this marking. The replaced content could contain the >> uri pointing to the alt odb. > > Yeah, interesting! > That's indeed another possibility that might not need the transfer of > any external odb. > > But in this case it might be cleaner to just have a separate ref hierarchy > like: > > refs/external-odbs/my-ext-odb/ > > instead of using the replace one. > > Or maybe: > > refs/replace/external-odbs/my-ext-odb/ > > if we really want to use the replace hierarchy. Yep. replace hierarchy crossed my mind. But then I thought about performance degradation when there are more than one pack (we have to search through them all for every SHA-1) and discarded it because we would need to do the same linear search here. I guess we will most likely have one or two name spaces so it probably won't matter. >> We could optionally contact alt odb to >> retrieve real content, or just show the replaced/fake data when alt >> odb is out of reach. > > Yeah, I wonder if that really needs the replace mechanism. Replace mechanism provides good hook point. But it really depends how invasive this remote odb is. If a fake content is enough to avoid breakages up high, git-replace is enough. If you really need to pass remote odb info up so higher levels can do something more fancy, then it's insufficient. > By the way this makes me wonder if we could implement resumable clone > using some kind of replace ref. > > The client while cloning nearly as usual would download one or more > special replace refs that would points to objects with links to > download bundles using standard protocols. > Just after the clone, the client would read these objects and download > the bundles from these objects. > And then it would clone from these bundles. I thought we have settled on resumable clone, just waiting for an implementation :) Doing it your way, you would need to download these special objects too (in a pack?) and come back download some more bundles. It would be more efficient to show the bundle uri early and go download the bundle on the side while you go on to get the addition/smaller pack that contains the rest. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] bundle v3: the beginning
On Fri, May 20, 2016 at 02:39:06PM +0200, Christian Couder wrote: > I wonder if this mechanism could also be used or extended to clone and > fetch an alternate object database. > > In [1], [2] and [3], and this was also discussed during the > Contributor Summit last month, Peff says that he started working on > alternate object database support a long time ago, and that the hard > part is a protocol extension to tell remotes that you can access some > objects in a different way. > > If a Git client would download a "$name.bndl" v3 bundle file that > would have a "data: $URL/alt-odb-$name.odb" extended header, the Git > client would just need to download "$URL/alt-odb-$name.odb" and use > the alternate object database support on this file. > > This way it would know all it has to know to access the objects in the > alternate database. The alternate object database may not contain the > real objects, if they are too big for example, but just files that > describe how to get the real objects. I'm not sure about this strategy. I see two complications: 1. I don't think bundles need to be a part of this "external odb" strategy at all. If I understand correctly, I think you want to use it as a place to stuff metadata that the server tells the client, like "by the way, go here if you want another way to access some objects". But there are lots of cases where the server might want to tell the client that don't involve bundles at all. 2. A server pointing the client to another object store is actually the least interesting bit of the protocol. The more interesting cases (to me) are: a. The receiving side of a connection (e.g., a fetch client) somehow has out-of-band access to some objects. How does it tell the other side "do not bother sending me these objects; I can get them in another way"? b. The receiving side of a connection has out-of-band access to some objects. Some of these will be expensive to get (e.g., requiring a large download), and some may be fast (e.g., they've already been fetched to a local cache). How do we tell the sending side not to assume we have cheap access to these objects (e.g., for use as a delta base)? So I don't think you want to tie this into bundles due to (1), and I think that bundles would be insufficient anyway because of (2). Or maybe I'm misunderstanding what you propose. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] bundle v3: the beginning
On Tue, May 31, 2016 at 07:43:27PM +0700, Duy Nguyen wrote: > > [3] > > http://thread.gmane.org/gmane.comp.version-control.git/202902/focus=203020 > > This points to https://github.com/peff/git/commits/jk/external-odb > which is dead. Jeff, do you still have it somewhere, or is it not > worth looking at anymore? It's now "jk/external-odb-wip" at the same repo. I wouldn't be surprised if it doesn't even compile, though. I basically rebase my topics daily against Junio's "master", so it may be carried forward, but things marked "-wip" aren't part of my daily git build, and generally don't even get compile-tested (usually if the rebase looks too hairy or awful, I'll drop it completely, though, and I haven't done that here). You're probably better off looking whatever Christian produces. :) -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] bundle v3: the beginning
On Tue, May 31, 2016 at 2:43 PM, Duy Nguyenwrote: > On Fri, May 20, 2016 at 7:39 PM, Christian Couder > wrote: >> I am responding to this 2+ month old email because I am investigating >> adding an alternate object store at the same level as loose and packed >> objects. This alternate object store could be used for large files. I >> am working on this for GitLab. (Yeah, I am working, as a freelance, >> for both Booking.com and GitLab these days.) > > I'm also interested in this from a different angle, narrow clone that > potentially allows to skip download some large blobs (likely old ones > from the past that nobody will bother). Interesting! [...] >> I wonder if this mechanism could also be used or extended to clone and >> fetch an alternate object database. >> >> In [1], [2] and [3], and this was also discussed during the >> Contributor Summit last month, Peff says that he started working on >> alternate object database support a long time ago, and that the hard >> part is a protocol extension to tell remotes that you can access some >> objects in a different way. >> >> If a Git client would download a "$name.bndl" v3 bundle file that >> would have a "data: $URL/alt-odb-$name.odb" extended header, the Git >> client would just need to download "$URL/alt-odb-$name.odb" and use >> the alternate object database support on this file. > > What does this file contain exactly? A list of SHA-1 that can be > retrieved from this remote/alternate odb? It would depend on the external odb. Git could support different external odb that have different trade-offs. > I wonder if we could just > git-replace for this marking. The replaced content could contain the > uri pointing to the alt odb. Yeah, interesting! That's indeed another possibility that might not need the transfer of any external odb. But in this case it might be cleaner to just have a separate ref hierarchy like: refs/external-odbs/my-ext-odb/ instead of using the replace one. Or maybe: refs/replace/external-odbs/my-ext-odb/ if we really want to use the replace hierarchy. > We could optionally contact alt odb to > retrieve real content, or just show the replaced/fake data when alt > odb is out of reach. Yeah, I wonder if that really needs the replace mechanism. > Transferring git-replace is basically ref > exchange, which may be fine if you don't have a lot of objects in this > alt odb. Yeah sure, great idea! By the way this makes me wonder if we could implement resumable clone using some kind of replace ref. The client while cloning nearly as usual would download one or more special replace refs that would points to objects with links to download bundles using standard protocols. Just after the clone, the client would read these objects and download the bundles from these objects. And then it would clone from these bundles. > If you do, well, we need to deal with lots of refs anyway. > This may benefit from it too. > >> [3] >> http://thread.gmane.org/gmane.comp.version-control.git/202902/focus=203020 > > This points to https://github.com/peff/git/commits/jk/external-odb > which is dead. Jeff, do you still have it somewhere, or is it not > worth looking at anymore? I have rebased, fixed and improved it a bit. I added write support for blobs. But the result is not very clean right now. I was going to send a RFC patch series after cleaning the result, but as you ask, here are some links to some branches: - https://github.com/chriscool/git/commits/gl-external-odb3 (the updated patches from Peff, plus 2 small patches from me) - https://github.com/chriscool/git/commits/gl-external-odb7 (the same as above, plus a number of WIP patches to add blob write support) Thanks, Christian. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] bundle v3: the beginning
On Fri, May 20, 2016 at 7:39 PM, Christian Couderwrote: > I am responding to this 2+ month old email because I am investigating > adding an alternate object store at the same level as loose and packed > objects. This alternate object store could be used for large files. I > am working on this for GitLab. (Yeah, I am working, as a freelance, > for both Booking.com and GitLab these days.) I'm also interested in this from a different angle, narrow clone that potentially allows to skip download some large blobs (likely old ones from the past that nobody will bother). > On Wed, Mar 2, 2016 at 9:32 PM, Junio C Hamano wrote: >> The bundle v3 format introduces an ability to have the bundle header >> (which describes what references in the bundled history can be >> fetched, and what objects the receiving repository must have in >> order to unbundle it successfully) in one file, and the bundled pack >> stream data in a separate file. >> >> A v3 bundle file begins with a line with "# v3 git bundle", followed >> by zero or more "extended header" lines, and an empty line, finally >> followed by the list of prerequisites and references in the same >> format as v2 bundle. If it uses the "split bundle" feature, there >> is a "data: $URL" extended header line, and nothing follows the list >> of prerequisites and references. Also, "sha1: " extended header >> line may exist to help validating that the pack stream data matches >> the bundle header. >> >> A typical expected use of a split bundle is to help initial clone >> that involves a huge data transfer, and would go like this: >> >> - Any repository people would clone and fetch from would regularly >>be repacked, and it is expected that there would be a packfile >>without prerequisites that holds all (or at least most) of the >>history of it (call it pack-$name.pack). >> >> - After arranging that packfile to be downloadable over popular >>transfer methods used for serving static files (such as HTTP or >>HTTPS) that are easily resumable as $URL/pack-$name.pack, a v3 >>bundle file (call it $name.bndl) can be prepared with an extended >>header "data: $URL/pack-$name.pack" to point at the download >>location for the packfile, and be served at "$URL/$name.bndl". >> >> - An updated Git client, when trying to "git clone" from such a >>repository, may be redirected to $URL/$name.bndl", which would be >>a tiny text file (when split bundle feature is used). >> >> - The client would then inspect the downloaded $name.bndl, learn >>that the corresponding packfile exists at $URL/pack-$name.pack, >>and downloads it as pack-$name.pack, until the download succeeds. >>This can easily be done with "wget --continue" equivalent over an >>unreliable link. The checksum recorded on the "sha1: " header >>line is expected to be used by this downloader (not written yet). > > I wonder if this mechanism could also be used or extended to clone and > fetch an alternate object database. > > In [1], [2] and [3], and this was also discussed during the > Contributor Summit last month, Peff says that he started working on > alternate object database support a long time ago, and that the hard > part is a protocol extension to tell remotes that you can access some > objects in a different way. > > If a Git client would download a "$name.bndl" v3 bundle file that > would have a "data: $URL/alt-odb-$name.odb" extended header, the Git > client would just need to download "$URL/alt-odb-$name.odb" and use > the alternate object database support on this file. What does this file contain exactly? A list of SHA-1 that can be retrieved from this remote/alternate odb? I wonder if we could just git-replace for this marking. The replaced content could contain the uri pointing to the alt odb. We could optionally contact alt odb to retrieve real content, or just show the replaced/fake data when alt odb is out of reach. Transferring git-replace is basically ref exchange, which may be fine if you don't have a lot of objects in this alt odb. If you do, well, we need to deal with lots of refs anyway. This may benefit from it too. > [3] http://thread.gmane.org/gmane.comp.version-control.git/202902/focus=203020 This points to https://github.com/peff/git/commits/jk/external-odb which is dead. Jeff, do you still have it somewhere, or is it not worth looking at anymore? -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] bundle v3: the beginning
I am responding to this 2+ month old email because I am investigating adding an alternate object store at the same level as loose and packed objects. This alternate object store could be used for large files. I am working on this for GitLab. (Yeah, I am working, as a freelance, for both Booking.com and GitLab these days.) On Wed, Mar 2, 2016 at 9:32 PM, Junio C Hamanowrote: > The bundle v3 format introduces an ability to have the bundle header > (which describes what references in the bundled history can be > fetched, and what objects the receiving repository must have in > order to unbundle it successfully) in one file, and the bundled pack > stream data in a separate file. > > A v3 bundle file begins with a line with "# v3 git bundle", followed > by zero or more "extended header" lines, and an empty line, finally > followed by the list of prerequisites and references in the same > format as v2 bundle. If it uses the "split bundle" feature, there > is a "data: $URL" extended header line, and nothing follows the list > of prerequisites and references. Also, "sha1: " extended header > line may exist to help validating that the pack stream data matches > the bundle header. > > A typical expected use of a split bundle is to help initial clone > that involves a huge data transfer, and would go like this: > > - Any repository people would clone and fetch from would regularly >be repacked, and it is expected that there would be a packfile >without prerequisites that holds all (or at least most) of the >history of it (call it pack-$name.pack). > > - After arranging that packfile to be downloadable over popular >transfer methods used for serving static files (such as HTTP or >HTTPS) that are easily resumable as $URL/pack-$name.pack, a v3 >bundle file (call it $name.bndl) can be prepared with an extended >header "data: $URL/pack-$name.pack" to point at the download >location for the packfile, and be served at "$URL/$name.bndl". > > - An updated Git client, when trying to "git clone" from such a >repository, may be redirected to $URL/$name.bndl", which would be >a tiny text file (when split bundle feature is used). > > - The client would then inspect the downloaded $name.bndl, learn >that the corresponding packfile exists at $URL/pack-$name.pack, >and downloads it as pack-$name.pack, until the download succeeds. >This can easily be done with "wget --continue" equivalent over an >unreliable link. The checksum recorded on the "sha1: " header >line is expected to be used by this downloader (not written yet). I wonder if this mechanism could also be used or extended to clone and fetch an alternate object database. In [1], [2] and [3], and this was also discussed during the Contributor Summit last month, Peff says that he started working on alternate object database support a long time ago, and that the hard part is a protocol extension to tell remotes that you can access some objects in a different way. If a Git client would download a "$name.bndl" v3 bundle file that would have a "data: $URL/alt-odb-$name.odb" extended header, the Git client would just need to download "$URL/alt-odb-$name.odb" and use the alternate object database support on this file. This way it would know all it has to know to access the objects in the alternate database. The alternate object database may not contain the real objects, if they are too big for example, but just files that describe how to get the real objects. > - After fully downloading $name.bndl and pack-$name.pack and >storing them next to each other, the client would clone from the >$name.bndl; this would populate the newly created repository with >reasonably recent history. > > - Then the client can issue "git fetch" against the original >repository to obtain the most recent part of the history created >since the bundle was made. [1] http://thread.gmane.org/gmane.comp.version-control.git/206886/focus=207040 [2] http://thread.gmane.org/gmane.comp.version-control.git/247171 [3] http://thread.gmane.org/gmane.comp.version-control.git/202902/focus=203020 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] bundle v3: the beginning
On Thu, Mar 3, 2016 at 9:57 AM, Junio C Hamanowrote: > Duy Nguyen writes: > >> would it be >> ok if we introduced a minimal resumable download service via >> git-daemon to enable this feature with very little setup? Like >> git-shell, you can only download certain packfiles for this use case >> and nothing else with this service. > > I think it is a matter of priorities. > > A minimalistic site that offers only git-daemon traffic without a > working HTTP server would certainly benefit from such a thing, but > serving static files efficiently over the web is commodity service > these days. Wouldn't it be sufficient to just recommend having a > normal HTTP server serving static files, which should be "very > little setup" in today's world? > > Such a "minimal resumable download service" over the git-daemon > transport still has to reinvent what is already done well by the > HTTP servers and clients (e.g. support of ETag equivalent to make > sure that the client can notice that the underlying data has changed > for a given resource, headers to communicate the total length, > making a range request and responding to it, etc. etc.). > > In addition,, by going the custom protocol route, you wouldn't > benefit from caching HTTP proxies available to the clients. > > So I am not sure if the benefit outweighs the cost. What I had in mind was individuals who just want to publish their work over git://. Right now it's just a matter of running git-daemon and configuring it a bit. If it was me, I wouldn't expect all the bells and whistles that come with http. But I agree that this is low priority, "scratch your own itch" kind of thing. Let's have resumable clone with standard download protocols first, then we'll see. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] bundle v3: the beginning
Duy Nguyenwrites: > would it be > ok if we introduced a minimal resumable download service via > git-daemon to enable this feature with very little setup? Like > git-shell, you can only download certain packfiles for this use case > and nothing else with this service. I think it is a matter of priorities. A minimalistic site that offers only git-daemon traffic without a working HTTP server would certainly benefit from such a thing, but serving static files efficiently over the web is commodity service these days. Wouldn't it be sufficient to just recommend having a normal HTTP server serving static files, which should be "very little setup" in today's world? Such a "minimal resumable download service" over the git-daemon transport still has to reinvent what is already done well by the HTTP servers and clients (e.g. support of ETag equivalent to make sure that the client can notice that the underlying data has changed for a given resource, headers to communicate the total length, making a range request and responding to it, etc. etc.). In addition,, by going the custom protocol route, you wouldn't benefit from caching HTTP proxies available to the clients. So I am not sure if the benefit outweighs the cost. I wouldn't stop you if you really want to do it, but again, it is a matter of priorities. I personally feel that it would be a waste of engineering talent, and it certainly would be a waste of review bandwidth, if you gave priority to this over other more widely useful parts of the system. The procedure to repack should be updated to produce such a base pack with the separate bundle header on the server side, the protocol needs to be updated to allow redirection for "clone" traffic, the logic to decide when to redirect must be designed (e.g. "single branch" clone should not choose a pack/bundle that represents the full repository, but a pack for the branch that was asked), etc. There are still tons of things that need to be done, and it would be a distraction to invent a custom download service nobody other than git-daemon talks before all of the above is done. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] bundle v3: the beginning
On Thu, Mar 3, 2016 at 3:32 AM, Junio C Hamanowrote: > - After arranging that packfile to be downloadable over popular >transfer methods used for serving static files (such as HTTP or >HTTPS) that are easily resumable as $URL/pack-$name.pack, a v3 >bundle file (call it $name.bndl) can be prepared with an extended >header "data: $URL/pack-$name.pack" to point at the download >location for the packfile, and be served at "$URL/$name.bndl". Extra setup to offload things to CDN is great and all. But would it be ok if we introduced a minimal resumable download service via git-daemon to enable this feature with very little setup? Like git-shell, you can only download certain packfiles for this use case and nothing else with this service. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 4/4] bundle v3: the beginning
The bundle v3 format introduces an ability to have the bundle header (which describes what references in the bundled history can be fetched, and what objects the receiving repository must have in order to unbundle it successfully) in one file, and the bundled pack stream data in a separate file. A v3 bundle file begins with a line with "# v3 git bundle", followed by zero or more "extended header" lines, and an empty line, finally followed by the list of prerequisites and references in the same format as v2 bundle. If it uses the "split bundle" feature, there is a "data: $URL" extended header line, and nothing follows the list of prerequisites and references. Also, "sha1: " extended header line may exist to help validating that the pack stream data matches the bundle header. A typical expected use of a split bundle is to help initial clone that involves a huge data transfer, and would go like this: - Any repository people would clone and fetch from would regularly be repacked, and it is expected that there would be a packfile without prerequisites that holds all (or at least most) of the history of it (call it pack-$name.pack). - After arranging that packfile to be downloadable over popular transfer methods used for serving static files (such as HTTP or HTTPS) that are easily resumable as $URL/pack-$name.pack, a v3 bundle file (call it $name.bndl) can be prepared with an extended header "data: $URL/pack-$name.pack" to point at the download location for the packfile, and be served at "$URL/$name.bndl". - An updated Git client, when trying to "git clone" from such a repository, may be redirected to $URL/$name.bndl", which would be a tiny text file (when split bundle feature is used). - The client would then inspect the downloaded $name.bndl, learn that the corresponding packfile exists at $URL/pack-$name.pack, and downloads it as pack-$name.pack, until the download succeeds. This can easily be done with "wget --continue" equivalent over an unreliable link. The checksum recorded on the "sha1: " header line is expected to be used by this downloader (not written yet). - After fully downloading $name.bndl and pack-$name.pack and storing them next to each other, the client would clone from the $name.bndl; this would populate the newly created repository with reasonably recent history. - Then the client can issue "git fetch" against the original repository to obtain the most recent part of the history created since the bundle was made. Signed-off-by: Junio C Hamano--- bundle.c | 103 +- bundle.h | 3 ++ t/t5704-bundle.sh | 64 + 3 files changed, 161 insertions(+), 9 deletions(-) diff --git a/bundle.c b/bundle.c index 32bdb01..480630d 100644 --- a/bundle.c +++ b/bundle.c @@ -10,7 +10,8 @@ #include "refs.h" #include "argv-array.h" -static const char bundle_signature[] = "# v2 git bundle\n"; +static const char bundle_signature_v2[] = "# v2 git bundle\n"; +static const char bundle_signature_v3[] = "# v3 git bundle\n"; static void add_to_ref_list(const unsigned char *sha1, const char *name, struct ref_list *list) @@ -33,16 +34,55 @@ static int parse_bundle_header(int fd, struct bundle_header *header, int quiet) int status = 0; /* The bundle header begins with the signature */ - if (strbuf_getwholeline_fd(, fd, '\n') || - strcmp(buf.buf, bundle_signature)) { + if (strbuf_getwholeline_fd(, fd, '\n')) { + bad_bundle: if (!quiet) - error(_("'%s' does not look like a v2 bundle file"), + error(_("'%s' does not look like a supported bundle file"), header->filename); status = -1; goto abort; } - /* The bundle header ends with an empty line */ + if (!strcmp(buf.buf, bundle_signature_v2)) + header->bundle_version = 2; + else if (!strcmp(buf.buf, bundle_signature_v3)) + header->bundle_version = 3; + else + goto bad_bundle; + + if (header->bundle_version == 3) { + /* +* bundle version v3 has extended headers before the +* list of prerequisites and references. The extended +* headers end with an empty line. +*/ + while (!strbuf_getwholeline_fd(, fd, '\n')) { + const char *cp; + if (buf.len && buf.buf[buf.len - 1] == '\n') + buf.buf[--buf.len] = '\0'; + if (!buf.len) + break; + if (skip_prefix(buf.buf, "data: ", )) { + header->datafile = xstrdup(cp); + continue; +