Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support
(It looks like I did not reply to this other email yet, sorry about this late reply.) On Wed, Jul 12, 2017 at 9:06 PM, Jonathan Tanwrote: > On Tue, 20 Jun 2017 09:54:34 +0200 > Christian Couder wrote: > >> Git can store its objects only in the form of loose objects in >> separate files or packed objects in a pack file. >> >> To be able to better handle some kind of objects, for example big >> blobs, it would be nice if Git could store its objects in other object >> databases (ODB). > > Thanks for this, and sorry for the late reply. It's good to know that > others are thinking about "missing" objects in repos too. > >> - "have": the helper should respond with the sha1, size and type of >> all the objects the external ODB contains, one object per line. > > This should work well if we are not caching this "have" information > locally (that is, if the object store can be accessed with low latency), > but I am not sure if this will work otherwise. Yeah, there could be problems related to caching or not caching the "have" information. As a repo should not send the blobs that are in an external odb, I think it could be useful to cache the "have" information. I plan to take a look and add related tests soon. > I see that you have > proposed a local cache-using method later in the e-mail - my comments on > that are below. > >> - "get ": the helper should then read from the external ODB >> the content of the object corresponding to and pass it to >> Git. > > This makes sense - I have some patches [1] that implement this with the > "fault_in" mechanism described in your e-mail. > > [1] > https://public-inbox.org/git/cover.1499800530.git.jonathanta...@google.com/ > >> * Transfering information >> >> To tranfer information about the blobs stored in external ODB, some >> special refs, called "odb ref", similar as replace refs, are used in >> the tests of this series, but in general nothing forces the helper to >> use that mechanism. >> >> The external odb helper is responsible for using and creating the refs >> in refs/odbs//, if it wants to do that. It is free for >> example to just create one ref, as it is also free to create many >> refs. Git would just transmit the refs that have been created by this >> helper, if Git is asked to do so. >> >> For now in the tests there is one odb ref per blob, as it is simple >> and as it is similar to what git-lfs does. Each ref name is >> refs/odbs// where is the sha1 of the blob stored >> in the external odb named . >> >> These odb refs point to a blob that is stored in the Git >> repository and contain information about the blob stored in the >> external odb. This information can be specific to the external odb. >> The repos can then share this information using commands like: >> >> `git fetch origin "refs/odbs//*:refs/odbs//*"` >> >> At the end of the current patch series, "git clone" is teached a >> "--initial-refspec" option, that asks it to first fetch some specified >> refs. This is used in the tests to fetch the odb refs first. >> >> This way only one "git clone" command can setup a repo using the >> external ODB mechanism as long as the right helper is installed on the >> machine and as long as the following options are used: >> >> - "--initial-refspec " to fetch the odb refspec >> - "-c odb..command=" to configure the helper > > A method like this means that information about every object is > downloaded, regardless of which branches were actually cloned, and > regardless of what parameters (e.g. max blob size) were used to control > the objects that were actually cloned. > > We could make, say, one "odb ref" per size and branch - for example, > "refs/odbs/master/0", "refs/odbs/master/1k", "refs/odbs/master/1m", etc. > - and have the client know which one to download. But this wouldn't > scale if we introduce different object filters in the clone and fetch > commands. Yeah, there are multiple ways to do that. > I think that it is best to have upload-pack send this information > together with the packfile, since it knows exactly what objects were > omitted, and therefore what information the client needs. As discussed > in a sibling e-mail, clone/fetch already needs to be modified to omit > objects anyway. I try to avoid sending this information as I don't think it is necessary and it simplify things a lot to not have to change the communication protocol.
Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support
(It looks like I did not reply to this email yet, sorry about this late reply.) On Thu, Jul 6, 2017 at 7:36 PM, Ben Peartwrote: > > On 7/1/2017 3:41 PM, Christian Couder wrote: >> >> On Fri, Jun 23, 2017 at 8:24 PM, Ben Peart wrote: >>> >>> Great to see this making progress! >>> >>> My thoughts and questions are mostly about the overall design tradeoffs. >>> >>> Is your intention to enable the ODB to completely replace the regular >>> object >>> store or just to supplement it? >> >> It is to supplement it, as I think the regular object store works very >> well most of the time. > > I certainly understand the desire to restrict the scope of the patch series. > I know full replacement is a much larger problem as it would touch much more > of the codebase. > > I'd still like to see an object store that was thread safe, more robust (ie > transactional) and hopefully faster so I am hoping we can design the ODB > interface to eventually enable that. I doubt that the way Git and the external odb helpers communicate in process mode is good enough for multi-threading, so I think this would require another communication mechanism altogether. > For example: it seems the ODB helpers need to be able to be called before > the regular object store in the "put" case (so they can intercept large > objects for example) and after in in the "get" case to enable "fault-in." > Something like this: > > have/get > > git object store > large object ODB helper > > put > === > large object ODB helper > git object store > > It would be nice if that order wasn't hard coded but that the order or level > of the "git object store" could be specified using the same mechanism as > used for the ODB helpers so that some day you could do something like this: > > have/get > > "LMDB" ODB helper > git object store > > put > === > "LMDB" ODB helper > git object store > > (and even further out, drop the current git object store completely :)). Yeah, I understand that it could help. >>> I think it would be good to ensure the >>> interface is robust and performant enough to actually replace the current >>> object store interface (even if we don't actually do that just yet). >> >> >> I agree that it should be robust and performant, but I don't think it >> needs to be as performant in all cases as the current object store >> right now. >> >>> Another way of asking this is: do the 3 verbs (have, get, put) and the 3 >>> types of "get" enable you to wrap the current loose object and pack file >>> code as ODBs and run completely via the external ODB interface? If not, >>> what is missing and can it be added? > > One example of what I think is missing is a way to stream objects (ie > get_stream, put_stream). This isn't used often in git but it did exist last > I checked. I'm not saying this needs to be supported in the first version - > more if we want to support total replacement. I agree and it seems to me that others have already pointed that the streaming API could be used. > I also wonder if we'd need an "optimize" verb (for "git gc") or a "validate" > verb (for "git fsck"). Again, only if/when we are looking at total > replacement. Yeah, I agree that something might be useful for these commands. >> Right now the "put" verb only send plain blobs, so the most logical >> way to run completely via the external ODB interface would be to use >> it to send and receive plain blobs. There are tests scripts (t0420, >> t0470 and t0480) that use an http server as the external ODB and all >> the blobs are stored in it. >> >> And yeah for now it works only for blobs. There is a temporary patch >> in the series that limits it to blobs. For the non RFC patch series, I >> think it should either use the attribute system to tell which objects >> should be run via the external ODB interface, or perhaps there should >> be a way to ask each external ODB helper which kind of objects and >> blobs it can handle. I should add that in the future work part. > > Sounds good. For GVFS we handle all object types (including commits and > trees) so would need this to be enabled so that we can switch to using it. Ok. >>> _Eventually_ it would be great to see the current object store(s) moved >>> behind the new ODB interface. >> >> This is not one of my goals and I think it could be a problem if we >> want to keep the "fault in" mode. > >> In this mode the helper writes or reads directly to or from the >> current object store, so it needs the current object store to be >> available. > > I think implementing "fault in" should be an option that the ODB handler can > implement but should not be required by the design/interface. As you state > above, this could be as simple as having the ODB handler write the object to > the git object store on "get." This is 'get_direct' since v5 and yeah it is optional. >> Also I think compatibility with other git implementations is important >> and it is a good thing that
Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support
On Tue, 20 Jun 2017 09:54:34 +0200 Christian Couderwrote: > Git can store its objects only in the form of loose objects in > separate files or packed objects in a pack file. > > To be able to better handle some kind of objects, for example big > blobs, it would be nice if Git could store its objects in other object > databases (ODB). Thanks for this, and sorry for the late reply. It's good to know that others are thinking about "missing" objects in repos too. > - "have": the helper should respond with the sha1, size and type of > all the objects the external ODB contains, one object per line. This should work well if we are not caching this "have" information locally (that is, if the object store can be accessed with low latency), but I am not sure if this will work otherwise. I see that you have proposed a local cache-using method later in the e-mail - my comments on that are below. > - "get ": the helper should then read from the external ODB > the content of the object corresponding to and pass it to > Git. This makes sense - I have some patches [1] that implement this with the "fault_in" mechanism described in your e-mail. [1] https://public-inbox.org/git/cover.1499800530.git.jonathanta...@google.com/ > * Transfering information > > To tranfer information about the blobs stored in external ODB, some > special refs, called "odb ref", similar as replace refs, are used in > the tests of this series, but in general nothing forces the helper to > use that mechanism. > > The external odb helper is responsible for using and creating the refs > in refs/odbs//, if it wants to do that. It is free for > example to just create one ref, as it is also free to create many > refs. Git would just transmit the refs that have been created by this > helper, if Git is asked to do so. > > For now in the tests there is one odb ref per blob, as it is simple > and as it is similar to what git-lfs does. Each ref name is > refs/odbs// where is the sha1 of the blob stored > in the external odb named . > > These odb refs point to a blob that is stored in the Git > repository and contain information about the blob stored in the > external odb. This information can be specific to the external odb. > The repos can then share this information using commands like: > > `git fetch origin "refs/odbs//*:refs/odbs//*"` > > At the end of the current patch series, "git clone" is teached a > "--initial-refspec" option, that asks it to first fetch some specified > refs. This is used in the tests to fetch the odb refs first. > > This way only one "git clone" command can setup a repo using the > external ODB mechanism as long as the right helper is installed on the > machine and as long as the following options are used: > > - "--initial-refspec " to fetch the odb refspec > - "-c odb..command=" to configure the helper A method like this means that information about every object is downloaded, regardless of which branches were actually cloned, and regardless of what parameters (e.g. max blob size) were used to control the objects that were actually cloned. We could make, say, one "odb ref" per size and branch - for example, "refs/odbs/master/0", "refs/odbs/master/1k", "refs/odbs/master/1m", etc. - and have the client know which one to download. But this wouldn't scale if we introduce different object filters in the clone and fetch commands. I think that it is best to have upload-pack send this information together with the packfile, since it knows exactly what objects were omitted, and therefore what information the client needs. As discussed in a sibling e-mail, clone/fetch already needs to be modified to omit objects anyway.
Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support
On 7/1/2017 3:41 PM, Christian Couder wrote: On Fri, Jun 23, 2017 at 8:24 PM, Ben Peartwrote: On 6/20/2017 3:54 AM, Christian Couder wrote: To be able to better handle some kind of objects, for example big blobs, it would be nice if Git could store its objects in other object databases (ODB). To do that, this patch series makes it possible to register commands, also called "helpers", using "odb..command" config variables, to access external ODBs where objects can be stored and retrieved. External ODBs should be able to tranfer information about the blobs they store. This patch series shows how this is possible using kind of replace refs. Great to see this making progress! My thoughts and questions are mostly about the overall design tradeoffs. Is your intention to enable the ODB to completely replace the regular object store or just to supplement it? It is to supplement it, as I think the regular object store works very well most of the time. I certainly understand the desire to restrict the scope of the patch series. I know full replacement is a much larger problem as it would touch much more of the codebase. I'd still like to see an object store that was thread safe, more robust (ie transactional) and hopefully faster so I am hoping we can design the ODB interface to eventually enable that. For example: it seems the ODB helpers need to be able to be called before the regular object store in the "put" case (so they can intercept large objects for example) and after in in the "get" case to enable "fault-in." Something like this: have/get git object store large object ODB helper put === large object ODB helper git object store It would be nice if that order wasn't hard coded but that the order or level of the "git object store" could be specified using the same mechanism as used for the ODB helpers so that some day you could do something like this: have/get "LMDB" ODB helper git object store put === "LMDB" ODB helper git object store (and even further out, drop the current git object store completely :)). I think it would be good to ensure the interface is robust and performant enough to actually replace the current object store interface (even if we don't actually do that just yet). I agree that it should be robust and performant, but I don't think it needs to be as performant in all cases as the current object store right now. Another way of asking this is: do the 3 verbs (have, get, put) and the 3 types of "get" enable you to wrap the current loose object and pack file code as ODBs and run completely via the external ODB interface? If not, what is missing and can it be added? One example of what I think is missing is a way to stream objects (ie get_stream, put_stream). This isn't used often in git but it did exist last I checked. I'm not saying this needs to be supported in the first version - more if we want to support total replacement. I also wonder if we'd need an "optimize" verb (for "git gc") or a "validate" verb (for "git fsck"). Again, only if/when we are looking at total replacement. Right now the "put" verb only send plain blobs, so the most logical way to run completely via the external ODB interface would be to use it to send and receive plain blobs. There are tests scripts (t0420, t0470 and t0480) that use an http server as the external ODB and all the blobs are stored in it. And yeah for now it works only for blobs. There is a temporary patch in the series that limits it to blobs. For the non RFC patch series, I think it should either use the attribute system to tell which objects should be run via the external ODB interface, or perhaps there should be a way to ask each external ODB helper which kind of objects and blobs it can handle. I should add that in the future work part. Sounds good. For GVFS we handle all object types (including commits and trees) so would need this to be enabled so that we can switch to using it. _Eventually_ it would be great to see the current object store(s) moved behind the new ODB interface. This is not one of my goals and I think it could be a problem if we want to keep the "fault in" mode. > In this mode the helper writes or reads directly to or from the > current object store, so it needs the current object store to be > available. > I think implementing "fault in" should be an option that the ODB handler can implement but should not be required by the design/interface. As you state above, this could be as simple as having the ODB handler write the object to the git object store on "get." Also I think compatibility with other git implementations is important and it is a good thing that they can all work on a common repository format. I agree this should be an option but I don't want to say we'll _never_ move to a better object store. When there are multiple ODB providers, what is the order they are called? The
Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support
Christian Couderwrites: > On Sat, Jul 1, 2017 at 10:33 PM, Junio C Hamano wrote: >> Christian Couder writes: >> I think it would be good to ensure the interface is robust and performant enough to actually replace the current object store interface (even if we don't actually do that just yet). >>> >>> I agree that it should be robust and performant, but I don't think it >>> needs to be as performant in all cases as the current object store >>> right now. >> >> That sounds like starting from a defeatest position. Is there a >> reason why you think using an external interface could never perform >> well enough to be usable in everyday work? > > Perhaps in the future we will be able to make it as performant as, or > perhaps even more performant, than the current object store, but in > the current implementation the following issues mean that it will be > less performant That might be an answer to a different question; I was hoping to hear that it should be performant enough for everyday work, but never thought it would perform as well as local disk. I haven't used network filesystem quite a while, but a repository on NFS may still usable, and we know our own access pattern bettern than NFS which cannot anticipate what paths the next operations by its client happen, so it is not inconceivable that a well designed external object database interface would let us outperform "repo on NFS" scenario.
Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support
On Sat, Jul 1, 2017 at 10:33 PM, Junio C Hamanowrote: > Christian Couder writes: > >>> I think it would be good to ensure the >>> interface is robust and performant enough to actually replace the current >>> object store interface (even if we don't actually do that just yet). >> >> I agree that it should be robust and performant, but I don't think it >> needs to be as performant in all cases as the current object store >> right now. > > That sounds like starting from a defeatest position. Is there a > reason why you think using an external interface could never perform > well enough to be usable in everyday work? Perhaps in the future we will be able to make it as performant as, or perhaps even more performant, than the current object store, but in the current implementation the following issues mean that it will be less performant: - The external object stores are searched for an object after the object has not been found in the current object store. This means that searching for an object will be slower if the object is in an external object store. To overcome this the "have" information (when the external helper implements it) could be merged with information about what objects are in the current object store, for example in a big table or bitmap, so that only one lookup in this table or bitmap would be needed to know if an object is available and in which object store it is. But I really don't want to get into this right now. - When an external odb helper retrieves an object and passes it to Git, Git (or the helper itself in "fault in" mode) then stores the object in the current object store. This is because we assume that it will be faster to retrieve it again if it is cached in the current object store. There could be a capability that asks Git to not cache the objects that are retrieved from the external odb, but again I don't think it is necessary at all to implement this right now. I still think though that in some cases, like when the external odb is used to implement a bundle clone, using the external odb mechanism can already be more performant.
Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support
Christian Couderwrites: >> I think it would be good to ensure the >> interface is robust and performant enough to actually replace the current >> object store interface (even if we don't actually do that just yet). > > I agree that it should be robust and performant, but I don't think it > needs to be as performant in all cases as the current object store > right now. That sounds like starting from a defeatest position. Is there a reason why you think using an external interface could never perform well enough to be usable in everyday work?
Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support
On Sat, Jul 1, 2017 at 9:41 PM, Christian Couderwrote: > On Fri, Jun 23, 2017 at 8:24 PM, Ben Peart wrote: >> The fact that "git clone is taught a --initial-refspec" option" indicates >> this isn't just an ODB implementation detail. Is there a general capability >> that is missing from the ODB interface that needs to be addressed here? > > Technically you don't need to teach `git clone` the --initial-refspec > option to make it work. > It can work like this: > > $ git init > $ git remote add origin > $ git fetch origin > $ git config odb..command > $ git fetch origin > > But it is much simpler for the user to instead just do: > > $ git clone -c odb..command= --initial-refspec > > > I also think that the --initial-refspec option could perhaps be useful > for other kinds of refs for example tags, notes or replace refs, to > make sure that those refs are fetched first and that hooks can use > them when fetching other refs like branches in the later part of the > clone. Actually I am not sure that it's possible to setup hooks per se before or while cloning, but perhaps there are other kind of scripts or git commands that could trigger and use the refs that have been fetched first.
Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support
On Fri, Jun 23, 2017 at 8:24 PM, Ben Peartwrote: > > > On 6/20/2017 3:54 AM, Christian Couder wrote: >> To be able to better handle some kind of objects, for example big >> blobs, it would be nice if Git could store its objects in other object >> databases (ODB). >> >> To do that, this patch series makes it possible to register commands, >> also called "helpers", using "odb..command" config variables, >> to access external ODBs where objects can be stored and retrieved. >> >> External ODBs should be able to tranfer information about the blobs >> they store. This patch series shows how this is possible using kind of >> replace refs. > > Great to see this making progress! > > My thoughts and questions are mostly about the overall design tradeoffs. > > Is your intention to enable the ODB to completely replace the regular object > store or just to supplement it? It is to supplement it, as I think the regular object store works very well most of the time. > I think it would be good to ensure the > interface is robust and performant enough to actually replace the current > object store interface (even if we don't actually do that just yet). I agree that it should be robust and performant, but I don't think it needs to be as performant in all cases as the current object store right now. > Another way of asking this is: do the 3 verbs (have, get, put) and the 3 > types of "get" enable you to wrap the current loose object and pack file > code as ODBs and run completely via the external ODB interface? If not, > what is missing and can it be added? Right now the "put" verb only send plain blobs, so the most logical way to run completely via the external ODB interface would be to use it to send and receive plain blobs. There are tests scripts (t0420, t0470 and t0480) that use an http server as the external ODB and all the blobs are stored in it. And yeah for now it works only for blobs. There is a temporary patch in the series that limits it to blobs. For the non RFC patch series, I think it should either use the attribute system to tell which objects should be run via the external ODB interface, or perhaps there should be a way to ask each external ODB helper which kind of objects and blobs it can handle. I should add that in the future work part. > _Eventually_ it would be great to see the current object store(s) moved > behind the new ODB interface. This is not one of my goals and I think it could be a problem if we want to keep the "fault in" mode. In this mode the helper writes or reads directly to or from the current object store, so it needs the current object store to be available. Also I think compatibility with other git implementations is important and it is a good thing that they can all work on a common repository format. > When there are multiple ODB providers, what is the order they are called? The external_odb_config() function creates the helpers for the external ODBs in the order they are found in the config file, and then these helpers are called in turn in the same order. > If one fails a request (get, have, put) are the others called to see if they > can fulfill the request? Yes, but there are no tests to check that it works well. I will need to add some. > Can the order they are called for various verb be configured explicitly? Right now, you can configure the order by changing the config file, but the order will be the same for all the verbs. > For > example, it would be nice to have a "large object ODB handler" configured to > get first try at all "put" verbs. Then if it meets it's size requirements, > it will handle the verb, otherwise it fail and git will try the other ODBs. This can work if the "large object ODB handler" is configured first. Also this is linked with how you define which objects are handled by which helper. For example if the attribute system is used to describe which external ODB is used for which files, there could be a way to tell for example that blobs larger than 1MB are handled by the "large object ODB handler" while those that are smaller are handled by another helper. >> Design >> ~~ >> >> * The "helpers" (registered commands) >> >> Each helper manages access to one external ODB. >> >> There are now 2 different modes for helper: >> >>- When "odb..scriptMode" is set to "true", the helper is >> launched each time Git wants to communicate with the >> external ODB. >> >>- When "odb..scriptMode" is not set or set to "false", then >> the helper is launched once as a sub-process (using >> sub-process.h), and Git communicates with it using packet lines. > > Is it worth supporting two different modes long term? It seems that this > could be simplified (less code to write, debug, document, support) by only > supporting the 2nd that uses the sub-process. As far as I can tell, the > capabilities are the same, it's just the second one is more performant when > multiple calls are made. Yeah,
Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support
On 6/20/2017 3:54 AM, Christian Couder wrote: Goal Git can store its objects only in the form of loose objects in separate files or packed objects in a pack file. To be able to better handle some kind of objects, for example big blobs, it would be nice if Git could store its objects in other object databases (ODB). To do that, this patch series makes it possible to register commands, also called "helpers", using "odb..command" config variables, to access external ODBs where objects can be stored and retrieved. External ODBs should be able to tranfer information about the blobs they store. This patch series shows how this is possible using kind of replace refs. Great to see this making progress! My thoughts and questions are mostly about the overall design tradeoffs. Is your intention to enable the ODB to completely replace the regular object store or just to supplement it? I think it would be good to ensure the interface is robust and performant enough to actually replace the current object store interface (even if we don't actually do that just yet). Another way of asking this is: do the 3 verbs (have, get, put) and the 3 types of "get" enable you to wrap the current loose object and pack file code as ODBs and run completely via the external ODB interface? If not, what is missing and can it be added? _Eventually_ it would be great to see the current object store(s) moved behind the new ODB interface. When there are multiple ODB providers, what is the order they are called? If one fails a request (get, have, put) are the others called to see if they can fulfill the request? Can the order they are called for various verb be configured explicitly? For example, it would be nice to have a "large object ODB handler" configured to get first try at all "put" verbs. Then if it meets it's size requirements, it will handle the verb, otherwise it fail and git will try the other ODBs. Design ~~ * The "helpers" (registered commands) Each helper manages access to one external ODB. There are now 2 different modes for helper: - When "odb..scriptMode" is set to "true", the helper is launched each time Git wants to communicate with the external ODB. - When "odb..scriptMode" is not set or set to "false", then the helper is launched once as a sub-process (using sub-process.h), and Git communicates with it using packet lines. Is it worth supporting two different modes long term? It seems that this could be simplified (less code to write, debug, document, support) by only supporting the 2nd that uses the sub-process. As far as I can tell, the capabilities are the same, it's just the second one is more performant when multiple calls are made. A helper can be given different instructions by Git. The instructions that are supported are negociated at the beginning of the communication using a capability mechanism. For now the following instructions are supported: - "have": the helper should respond with the sha1, size and type of all the objects the external ODB contains, one object per line. - "get ": the helper should then read from the external ODB the content of the object corresponding to and pass it to Git. - "put ": the helper should then read from from Git an object and store it in the external ODB. Currently "have" and "put" are optional. It's good the various verbs can be optional. That way any particular ODB only has to handle those it needs to provide a different behavior for. There are 3 different kinds of "get" instructions depending on how the helper passes objects to Git: - "fault_in": the helper will write the requested objects directly into the regular Git object database, and then Git will retry reading it from there. I think the "fault_in" behavior can be implemented efficiently without the overhead of a 3rd special "get" instruction if we enable some of the other capabilities discussed. For example, assume an ODB is setup to handle missing objects (by registering itself as "last" in the prioritized list of ODB handlers). If it is ever asked to retrieve a missing object, it can retrieve the object and return it as a "git_object" or "plain_object" and also cache it locally as a loose object, pack file, or any other ODB handler supported mechanism. Future requests will then provide that object via the locally cached copy and its associated ODB handler. - "git_object": the helper will send the object as a Git object. - "plain_object": the helper will send the object (a blob) as a raw object. (The blob content will be sent as is.) For now the kind of "get" that is supported is read from the "odb..fetchKind" configuration variable, but in the future it should be decided as part of the capability negociation. I agree it makes sense to move this into the capability negotiation but I also wonder if we really need to support both. Is
Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support
On Tue, Jun 20, 2017 at 9:54 AM, Christian Couderwrote: > > Future work > ~~~ > > First sorry about the state of this patch series, it is not as clean > as I would have liked, butI think it is interesting to get feedback > from the mailing list at this point, because the previous RFC was sent > a long time ago and a lot of things changed. > > So a big part of the future work will be about cleaning this patch series. > > Other things I think I am going to do: > > - Ooops, I had not save my emacs buffer where I wrote this when I sent the patch series. This should have been: Other things I think I may work on: - Remove the "odb..scriptMode" and "odb..command" options and instead have just "odb..scriptCommand" and "odb..subprocessCommand". - Use capabilities instead of "odb..fetchKind" to decide which kind of "get" will be used. - Better test all the combinations of the above modes with and without "have" and "put" instructions. - Maybe also have different kinds of "put" so that Git could pass either a git object a plain object or ask the helper to retreive it directly from Git's object database. - Maybe add an "init" instruction as the script mode has something like this called "get_cap" and it would help the sub-process mode too, as it makes it possible for Git to know the capabilities before trying to send any instruction (that might not be supported by the helper). The "init" instruction would be the only required instruction for any helper to implement. - Add more long running tests and improve tests in general.
[RFC/PATCH v4 00/49] Add initial experimental external ODB support
Goal Git can store its objects only in the form of loose objects in separate files or packed objects in a pack file. To be able to better handle some kind of objects, for example big blobs, it would be nice if Git could store its objects in other object databases (ODB). To do that, this patch series makes it possible to register commands, also called "helpers", using "odb..command" config variables, to access external ODBs where objects can be stored and retrieved. External ODBs should be able to tranfer information about the blobs they store. This patch series shows how this is possible using kind of replace refs. Design ~~ * The "helpers" (registered commands) Each helper manages access to one external ODB. There are now 2 different modes for helper: - When "odb..scriptMode" is set to "true", the helper is launched each time Git wants to communicate with the external ODB. - When "odb..scriptMode" is not set or set to "false", then the helper is launched once as a sub-process (using sub-process.h), and Git communicates with it using packet lines. A helper can be given different instructions by Git. The instructions that are supported are negociated at the beginning of the communication using a capability mechanism. For now the following instructions are supported: - "have": the helper should respond with the sha1, size and type of all the objects the external ODB contains, one object per line. - "get ": the helper should then read from the external ODB the content of the object corresponding to and pass it to Git. - "put ": the helper should then read from from Git an object and store it in the external ODB. Currently "have" and "put" are optional. There are 3 different kinds of "get" instructions depending on how the helper passes objects to Git: - "fault_in": the helper will write the requested objects directly into the regular Git object database, and then Git will retry reading it from there. - "git_object": the helper will send the object as a Git object. - "plain_object": the helper will send the object (a blob) as a raw object. (The blob content will be sent as is.) For now the kind of "get" that is supported is read from the "odb..fetchKind" configuration variable, but in the future it should be decided as part of the capability negociation. * Transfering information To tranfer information about the blobs stored in external ODB, some special refs, called "odb ref", similar as replace refs, are used in the tests of this series, but in general nothing forces the helper to use that mechanism. The external odb helper is responsible for using and creating the refs in refs/odbs//, if it wants to do that. It is free for example to just create one ref, as it is also free to create many refs. Git would just transmit the refs that have been created by this helper, if Git is asked to do so. For now in the tests there is one odb ref per blob, as it is simple and as it is similar to what git-lfs does. Each ref name is refs/odbs// where is the sha1 of the blob stored in the external odb named . These odb refs point to a blob that is stored in the Git repository and contain information about the blob stored in the external odb. This information can be specific to the external odb. The repos can then share this information using commands like: `git fetch origin "refs/odbs//*:refs/odbs//*"` At the end of the current patch series, "git clone" is teached a "--initial-refspec" option, that asks it to first fetch some specified refs. This is used in the tests to fetch the odb refs first. This way only one "git clone" command can setup a repo using the external ODB mechanism as long as the right helper is installed on the machine and as long as the following options are used: - "--initial-refspec " to fetch the odb refspec - "-c odb..command=" to configure the helper There is also a test script that shows that the "--initial-refspec" option along with the external ODB mechanism can be used to implement cloning using bundles. * External object database This RFC patch series shows in the tests: - how to use another git repository as an external ODB (storing Git objects) - how to use an http server as an external ODB (storing plain objects) (This works in both script mode and sub-process mode.) * Performance So the sub-process mode, which is now the default, has been implemented in this new version of this patch series. This has been implemented using the refactoring that Ben Peart did on top of Lars Schneider's work on using sub-processes and packet lines in the smudge/clean filters for git-lfs. This also uses further work from Ben Peart called "read object process". See: http://public-inbox.org/git/20170113155253.1644-1-benpe...@microsoft.com/ http://public-inbox.org/git/20170322165220.5660-1-benpe...@microsoft.com/ Thanks to this, the external ODB mechanism should in the end perform as well as the