Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support

2017-09-15 Thread Christian Couder
(It looks like I did not reply to this other email yet, sorry about
this late reply.)

On Wed, Jul 12, 2017 at 9:06 PM, Jonathan Tan  wrote:
> On Tue, 20 Jun 2017 09:54:34 +0200
> Christian Couder  wrote:
>
>> Git can store its objects only in the form of loose objects in
>> separate files or packed objects in a pack file.
>>
>> To be able to better handle some kind of objects, for example big
>> blobs, it would be nice if Git could store its objects in other object
>> databases (ODB).
>
> Thanks for this, and sorry for the late reply. It's good to know that
> others are thinking about "missing" objects in repos too.
>
>>   - "have": the helper should respond with the sha1, size and type of
>> all the objects the external ODB contains, one object per line.
>
> This should work well if we are not caching this "have" information
> locally (that is, if the object store can be accessed with low latency),
> but I am not sure if this will work otherwise.

Yeah, there could be problems related to caching or not caching the
"have" information.
As a repo should not send the blobs that are in an external odb, I
think it could be useful to cache the "have" information.
I plan to take a look and add related tests soon.

> I see that you have
> proposed a local cache-using method later in the e-mail - my comments on
> that are below.
>
>>   - "get ": the helper should then read from the external ODB
>> the content of the object corresponding to  and pass it to
>> Git.
>
> This makes sense - I have some patches [1] that implement this with the
> "fault_in" mechanism described in your e-mail.
>
> [1] 
> https://public-inbox.org/git/cover.1499800530.git.jonathanta...@google.com/
>
>> * Transfering information
>>
>> To tranfer information about the blobs stored in external ODB, some
>> special refs, called "odb ref", similar as replace refs, are used in
>> the tests of this series, but in general nothing forces the helper to
>> use that mechanism.
>>
>> The external odb helper is responsible for using and creating the refs
>> in refs/odbs//, if it wants to do that. It is free for
>> example to just create one ref, as it is also free to create many
>> refs. Git would just transmit the refs that have been created by this
>> helper, if Git is asked to do so.
>>
>> For now in the tests there is one odb ref per blob, as it is simple
>> and as it is similar to what git-lfs does. Each ref name is
>> refs/odbs// where  is the sha1 of the blob stored
>> in the external odb named .
>>
>> These odb refs point to a blob that is stored in the Git
>> repository and contain information about the blob stored in the
>> external odb. This information can be specific to the external odb.
>> The repos can then share this information using commands like:
>>
>> `git fetch origin "refs/odbs//*:refs/odbs//*"`
>>
>> At the end of the current patch series, "git clone" is teached a
>> "--initial-refspec" option, that asks it to first fetch some specified
>> refs. This is used in the tests to fetch the odb refs first.
>>
>> This way only one "git clone" command can setup a repo using the
>> external ODB mechanism as long as the right helper is installed on the
>> machine and as long as the following options are used:
>>
>>   - "--initial-refspec " to fetch the odb refspec
>>   - "-c odb..command=" to configure the helper
>
> A method like this means that information about every object is
> downloaded, regardless of which branches were actually cloned, and
> regardless of what parameters (e.g. max blob size) were used to control
> the objects that were actually cloned.
>
> We could make, say, one "odb ref" per size and branch - for example,
> "refs/odbs/master/0", "refs/odbs/master/1k", "refs/odbs/master/1m", etc.
> - and have the client know which one to download. But this wouldn't
> scale if we introduce different object filters in the clone and fetch
> commands.

Yeah, there are multiple ways to do that.

> I think that it is best to have upload-pack send this information
> together with the packfile, since it knows exactly what objects were
> omitted, and therefore what information the client needs. As discussed
> in a sibling e-mail, clone/fetch already needs to be modified to omit
> objects anyway.

I try to avoid sending this information as I don't think it is
necessary and it simplify things a lot to not have to change the
communication protocol.


Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support

2017-09-15 Thread Christian Couder
(It looks like I did not reply to this email yet, sorry about this late reply.)

On Thu, Jul 6, 2017 at 7:36 PM, Ben Peart  wrote:
>
> On 7/1/2017 3:41 PM, Christian Couder wrote:
>>
>> On Fri, Jun 23, 2017 at 8:24 PM, Ben Peart  wrote:
>>>
>>> Great to see this making progress!
>>>
>>> My thoughts and questions are mostly about the overall design tradeoffs.
>>>
>>> Is your intention to enable the ODB to completely replace the regular
>>> object
>>> store or just to supplement it?
>>
>> It is to supplement it, as I think the regular object store works very
>> well most of the time.
>
> I certainly understand the desire to restrict the scope of the patch series.
> I know full replacement is a much larger problem as it would touch much more
> of the codebase.
>
> I'd still like to see an object store that was thread safe, more robust (ie
> transactional) and hopefully faster so I am hoping we can design the ODB
> interface to eventually enable that.

I doubt that the way Git and the external odb helpers communicate in
process mode is good enough for multi-threading, so I think this would
require another communication mechanism altogether.

> For example: it seems the ODB helpers need to be able to be called before
> the regular object store in the "put" case (so they can intercept large
> objects for example) and after in in the "get" case to enable "fault-in."
> Something like this:
>
> have/get
> 
> git object store
> large object ODB helper
>
> put
> ===
> large object ODB helper
> git object store
>
> It would be nice if that order wasn't hard coded but that the order or level
> of the "git object store" could be specified using the same mechanism as
> used for the ODB helpers so that some day you could do something like this:
>
> have/get
> 
> "LMDB" ODB helper
> git object store
>
> put
> ===
> "LMDB" ODB helper
> git object store
>
> (and even further out, drop the current git object store completely :)).

Yeah, I understand that it could help.

>>> I think it would be good to ensure the
>>> interface is robust and performant enough to actually replace the current
>>> object store interface (even if we don't actually do that just yet).
>>
>>
>> I agree that it should be robust and performant, but I don't think it
>> needs to be as performant in all cases as the current object store
>> right now.
>>
>>> Another way of asking this is: do the 3 verbs (have, get, put) and the 3
>>> types of "get" enable you to wrap the current loose object and pack file
>>> code as ODBs and run completely via the external ODB interface?  If not,
>>> what is missing and can it be added?
>
> One example of what I think is missing is a way to stream objects (ie
> get_stream, put_stream).  This isn't used often in git but it did exist last
> I checked.  I'm not saying this needs to be supported in the first version -
> more if we want to support total replacement.

I agree and it seems to me that others have already pointed that the
streaming API could be used.

> I also wonder if we'd need an "optimize" verb (for "git gc") or a "validate"
> verb (for "git fsck").  Again, only if/when we are looking at total
> replacement.

Yeah, I agree that something might be useful for these commands.

>> Right now the "put" verb only send plain blobs, so the most logical
>> way to run completely via the external ODB interface would be to use
>> it to send and receive plain blobs. There are tests scripts (t0420,
>> t0470 and t0480) that use an http server as the external ODB and all
>> the blobs are stored in it.
>>
>> And yeah for now it works only for blobs. There is a temporary patch
>> in the series that limits it to blobs. For the non RFC patch series, I
>> think it should either use the attribute system to tell which objects
>> should be run via the external ODB interface, or perhaps there should
>> be a way to ask each external ODB helper which kind of objects and
>> blobs it can handle. I should add that in the future work part.
>
> Sounds good.  For GVFS we handle all object types (including commits and
> trees) so would need this to be enabled so that we can switch to using it.

Ok.

>>> _Eventually_ it would be great to see the current object store(s) moved
>>> behind the new ODB interface.
>>
>> This is not one of my goals and I think it could be a problem if we
>> want to keep the "fault in" mode.
>
>> In this mode the helper writes or reads directly to or from the
>> current object store, so it needs the current object store to be
>> available.
>
> I think implementing "fault in" should be an option that the ODB handler can
> implement but should not be required by the design/interface.  As you state
> above, this could be as simple as having the ODB handler write the object to
> the git object store on "get."

This is 'get_direct' since v5 and yeah it is optional.

>> Also I think compatibility with other git implementations is important
>> and it is a good thing that 

Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support

2017-07-12 Thread Jonathan Tan
On Tue, 20 Jun 2017 09:54:34 +0200
Christian Couder  wrote:

> Git can store its objects only in the form of loose objects in
> separate files or packed objects in a pack file.
> 
> To be able to better handle some kind of objects, for example big
> blobs, it would be nice if Git could store its objects in other object
> databases (ODB).

Thanks for this, and sorry for the late reply. It's good to know that
others are thinking about "missing" objects in repos too.

>   - "have": the helper should respond with the sha1, size and type of
> all the objects the external ODB contains, one object per line.

This should work well if we are not caching this "have" information
locally (that is, if the object store can be accessed with low latency),
but I am not sure if this will work otherwise. I see that you have
proposed a local cache-using method later in the e-mail - my comments on
that are below.

>   - "get ": the helper should then read from the external ODB
> the content of the object corresponding to  and pass it to
> Git.

This makes sense - I have some patches [1] that implement this with the
"fault_in" mechanism described in your e-mail.

[1] https://public-inbox.org/git/cover.1499800530.git.jonathanta...@google.com/

> * Transfering information
> 
> To tranfer information about the blobs stored in external ODB, some
> special refs, called "odb ref", similar as replace refs, are used in
> the tests of this series, but in general nothing forces the helper to
> use that mechanism.
> 
> The external odb helper is responsible for using and creating the refs
> in refs/odbs//, if it wants to do that. It is free for
> example to just create one ref, as it is also free to create many
> refs. Git would just transmit the refs that have been created by this
> helper, if Git is asked to do so.
> 
> For now in the tests there is one odb ref per blob, as it is simple
> and as it is similar to what git-lfs does. Each ref name is
> refs/odbs// where  is the sha1 of the blob stored
> in the external odb named .
> 
> These odb refs point to a blob that is stored in the Git
> repository and contain information about the blob stored in the
> external odb. This information can be specific to the external odb.
> The repos can then share this information using commands like:
> 
> `git fetch origin "refs/odbs//*:refs/odbs//*"`
> 
> At the end of the current patch series, "git clone" is teached a
> "--initial-refspec" option, that asks it to first fetch some specified
> refs. This is used in the tests to fetch the odb refs first.
> 
> This way only one "git clone" command can setup a repo using the
> external ODB mechanism as long as the right helper is installed on the
> machine and as long as the following options are used:
> 
>   - "--initial-refspec " to fetch the odb refspec
>   - "-c odb..command=" to configure the helper

A method like this means that information about every object is
downloaded, regardless of which branches were actually cloned, and
regardless of what parameters (e.g. max blob size) were used to control
the objects that were actually cloned.

We could make, say, one "odb ref" per size and branch - for example,
"refs/odbs/master/0", "refs/odbs/master/1k", "refs/odbs/master/1m", etc.
- and have the client know which one to download. But this wouldn't
scale if we introduce different object filters in the clone and fetch
commands.

I think that it is best to have upload-pack send this information
together with the packfile, since it knows exactly what objects were
omitted, and therefore what information the client needs. As discussed
in a sibling e-mail, clone/fetch already needs to be modified to omit
objects anyway.


Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support

2017-07-06 Thread Ben Peart



On 7/1/2017 3:41 PM, Christian Couder wrote:

On Fri, Jun 23, 2017 at 8:24 PM, Ben Peart  wrote:



On 6/20/2017 3:54 AM, Christian Couder wrote:



To be able to better handle some kind of objects, for example big
blobs, it would be nice if Git could store its objects in other object
databases (ODB).

To do that, this patch series makes it possible to register commands,
also called "helpers", using "odb..command" config variables,
to access external ODBs where objects can be stored and retrieved.

External ODBs should be able to tranfer information about the blobs
they store. This patch series shows how this is possible using kind of
replace refs.


Great to see this making progress!

My thoughts and questions are mostly about the overall design tradeoffs.

Is your intention to enable the ODB to completely replace the regular object
store or just to supplement it?


It is to supplement it, as I think the regular object store works very
well most of the time.



I certainly understand the desire to restrict the scope of the patch 
series.  I know full replacement is a much larger problem as it would 
touch much more of the codebase.


I'd still like to see an object store that was thread safe, more robust 
(ie transactional) and hopefully faster so I am hoping we can design the 
ODB interface to eventually enable that.


For example: it seems the ODB helpers need to be able to be called 
before the regular object store in the "put" case (so they can intercept 
large objects for example) and after in in the "get" case to enable 
"fault-in."  Something like this:


have/get

git object store
large object ODB helper

put
===
large object ODB helper
git object store

It would be nice if that order wasn't hard coded but that the order or 
level of the "git object store" could be specified using the same 
mechanism as used for the ODB helpers so that some day you could do 
something like this:


have/get

"LMDB" ODB helper
git object store

put
===
"LMDB" ODB helper
git object store

(and even further out, drop the current git object store completely :)).


I think it would be good to ensure the
interface is robust and performant enough to actually replace the current
object store interface (even if we don't actually do that just yet).


I agree that it should be robust and performant, but I don't think it
needs to be as performant in all cases as the current object store
right now.


Another way of asking this is: do the 3 verbs (have, get, put) and the 3
types of "get" enable you to wrap the current loose object and pack file
code as ODBs and run completely via the external ODB interface?  If not,
what is missing and can it be added?




One example of what I think is missing is a way to stream objects (ie 
get_stream, put_stream).  This isn't used often in git but it did exist 
last I checked.  I'm not saying this needs to be supported in the first 
version - more if we want to support total replacement.


I also wonder if we'd need an "optimize" verb (for "git gc") or a 
"validate" verb (for "git fsck").  Again, only if/when we are looking at 
total replacement.



Right now the "put" verb only send plain blobs, so the most logical
way to run completely via the external ODB interface would be to use
it to send and receive plain blobs. There are tests scripts (t0420,
t0470 and t0480) that use an http server as the external ODB and all
the blobs are stored in it.

And yeah for now it works only for blobs. There is a temporary patch
in the series that limits it to blobs. For the non RFC patch series, I
think it should either use the attribute system to tell which objects
should be run via the external ODB interface, or perhaps there should
be a way to ask each external ODB helper which kind of objects and
blobs it can handle. I should add that in the future work part.



Sounds good.  For GVFS we handle all object types (including commits and 
trees) so would need this to be enabled so that we can switch to using it.



_Eventually_ it would be great to see the current object store(s) moved
behind the new ODB interface.


This is not one of my goals and I think it could be a problem if we
want to keep the "fault in" mode.

> In this mode the helper writes or reads directly to or from the
> current object store, so it needs the current object store to be
> available.
>

I think implementing "fault in" should be an option that the ODB handler 
can implement but should not be required by the design/interface.  As 
you state above, this could be as simple as having the ODB handler write 
the object to the git object store on "get."



Also I think compatibility with other git implementations is important
and it is a good thing that they can all work on a common repository
format.


I agree this should be an option but I don't want to say we'll _never_ 
move to a better object store.





When there are multiple ODB providers, what is the order they are called?


The 

Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support

2017-07-03 Thread Junio C Hamano
Christian Couder  writes:

> On Sat, Jul 1, 2017 at 10:33 PM, Junio C Hamano  wrote:
>> Christian Couder  writes:
>>
 I think it would be good to ensure the
 interface is robust and performant enough to actually replace the current
 object store interface (even if we don't actually do that just yet).
>>>
>>> I agree that it should be robust and performant, but I don't think it
>>> needs to be as performant in all cases as the current object store
>>> right now.
>>
>> That sounds like starting from a defeatest position.  Is there a
>> reason why you think using an external interface could never perform
>> well enough to be usable in everyday work?
>
> Perhaps in the future we will be able to make it as performant as, or
> perhaps even more performant, than the current object store, but in
> the current implementation the following issues mean that it will be
> less performant

That might be an answer to a different question; I was hoping to
hear that it should be performant enough for everyday work, but
never thought it would perform as well as local disk.

I haven't used network filesystem quite a while, but a repository on
NFS may still usable, and we know our own access pattern bettern
than NFS which cannot anticipate what paths the next operations by
its client happen, so it is not inconceivable that a well designed
external object database interface would let us outperform "repo on
NFS" scenario.


Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support

2017-07-01 Thread Christian Couder
On Sat, Jul 1, 2017 at 10:33 PM, Junio C Hamano  wrote:
> Christian Couder  writes:
>
>>> I think it would be good to ensure the
>>> interface is robust and performant enough to actually replace the current
>>> object store interface (even if we don't actually do that just yet).
>>
>> I agree that it should be robust and performant, but I don't think it
>> needs to be as performant in all cases as the current object store
>> right now.
>
> That sounds like starting from a defeatest position.  Is there a
> reason why you think using an external interface could never perform
> well enough to be usable in everyday work?

Perhaps in the future we will be able to make it as performant as, or
perhaps even more performant, than the current object store, but in
the current implementation the following issues mean that it will be
less performant:

- The external object stores are searched for an object after the
object has not been found in the current object store. This means that
searching for an object will be slower if the object is in an external
object store. To overcome this the "have" information (when the
external helper implements it) could be merged with information about
what objects are in the current object store, for example in a big
table or bitmap, so that only one lookup in this table or bitmap would
be needed to know if an object is available and in which object store
it is. But I really don't want to get into this right now.

- When an external odb helper retrieves an object and passes it to
Git, Git (or the helper itself in "fault in" mode) then stores the
object in the current object store. This is because we assume that it
will be faster to retrieve it again if it is cached in the current
object store. There could be a capability that asks Git to not cache
the objects that are retrieved from the external odb, but again I
don't think it is necessary at all to implement this right now.

I still think though that in some cases, like when the external odb is
used to implement a bundle clone, using the external odb mechanism can
already be more performant.


Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support

2017-07-01 Thread Junio C Hamano
Christian Couder  writes:

>> I think it would be good to ensure the
>> interface is robust and performant enough to actually replace the current
>> object store interface (even if we don't actually do that just yet).
>
> I agree that it should be robust and performant, but I don't think it
> needs to be as performant in all cases as the current object store
> right now.

That sounds like starting from a defeatest position.  Is there a
reason why you think using an external interface could never perform
well enough to be usable in everyday work?


Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support

2017-07-01 Thread Christian Couder
On Sat, Jul 1, 2017 at 9:41 PM, Christian Couder
 wrote:
> On Fri, Jun 23, 2017 at 8:24 PM, Ben Peart  wrote:

>> The fact that "git clone is taught a --initial-refspec" option" indicates
>> this isn't just an ODB implementation detail.  Is there a general capability
>> that is missing from the ODB interface that needs to be addressed here?
>
> Technically you don't need to teach `git clone` the --initial-refspec
> option to make it work.
> It can work like this:
>
> $ git init
> $ git remote add origin 
> $ git fetch origin 
> $ git config odb..command 
> $ git fetch origin
>
> But it is much simpler for the user to instead just do:
>
> $ git clone -c odb..command= --initial-refspec
>  
>
> I also think that the --initial-refspec option could perhaps be useful
> for other kinds of refs for example tags, notes or replace refs, to
> make sure that those refs are fetched first and that hooks can use
> them when fetching other refs like branches in the later part of the
> clone.

Actually I am not sure that it's possible to setup hooks per se before
or while cloning, but perhaps there are other kind of scripts or git
commands that could trigger and use the refs that have been fetched
first.


Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support

2017-07-01 Thread Christian Couder
On Fri, Jun 23, 2017 at 8:24 PM, Ben Peart  wrote:
>
>
> On 6/20/2017 3:54 AM, Christian Couder wrote:

>> To be able to better handle some kind of objects, for example big
>> blobs, it would be nice if Git could store its objects in other object
>> databases (ODB).
>>
>> To do that, this patch series makes it possible to register commands,
>> also called "helpers", using "odb..command" config variables,
>> to access external ODBs where objects can be stored and retrieved.
>>
>> External ODBs should be able to tranfer information about the blobs
>> they store. This patch series shows how this is possible using kind of
>> replace refs.
>
> Great to see this making progress!
>
> My thoughts and questions are mostly about the overall design tradeoffs.
>
> Is your intention to enable the ODB to completely replace the regular object
> store or just to supplement it?

It is to supplement it, as I think the regular object store works very
well most of the time.

> I think it would be good to ensure the
> interface is robust and performant enough to actually replace the current
> object store interface (even if we don't actually do that just yet).

I agree that it should be robust and performant, but I don't think it
needs to be as performant in all cases as the current object store
right now.

> Another way of asking this is: do the 3 verbs (have, get, put) and the 3
> types of "get" enable you to wrap the current loose object and pack file
> code as ODBs and run completely via the external ODB interface?  If not,
> what is missing and can it be added?

Right now the "put" verb only send plain blobs, so the most logical
way to run completely via the external ODB interface would be to use
it to send and receive plain blobs. There are tests scripts (t0420,
t0470 and t0480) that use an http server as the external ODB and all
the blobs are stored in it.

And yeah for now it works only for blobs. There is a temporary patch
in the series that limits it to blobs. For the non RFC patch series, I
think it should either use the attribute system to tell which objects
should be run via the external ODB interface, or perhaps there should
be a way to ask each external ODB helper which kind of objects and
blobs it can handle. I should add that in the future work part.

> _Eventually_ it would be great to see the current object store(s) moved
> behind the new ODB interface.

This is not one of my goals and I think it could be a problem if we
want to keep the "fault in" mode.
In this mode the helper writes or reads directly to or from the
current object store, so it needs the current object store to be
available.

Also I think compatibility with other git implementations is important
and it is a good thing that they can all work on a common repository
format.

> When there are multiple ODB providers, what is the order they are called?

The external_odb_config() function creates the helpers for the
external ODBs in the order they are found in the config file, and then
these helpers are called in turn in the same order.

> If one fails a request (get, have, put) are the others called to see if they
> can fulfill the request?

Yes, but there are no tests to check that it works well. I will need
to add some.

> Can the order they are called for various verb be configured explicitly?

Right now, you can configure the order by changing the config file,
but the order will be the same for all the verbs.

> For
> example, it would be nice to have a "large object ODB handler" configured to
> get first try at all "put" verbs.  Then if it meets it's size requirements,
> it will handle the verb, otherwise it fail and git will try the other ODBs.

This can work if the "large object ODB handler" is configured first.

Also this is linked with how you define which objects are handled by
which helper. For example if the attribute system is used to describe
which external ODB is used for which files, there could be a way to
tell for example that blobs larger than 1MB are handled by the "large
object ODB handler" while those that are smaller are handled by
another helper.

>> Design
>> ~~
>>
>> * The "helpers" (registered commands)
>>
>> Each helper manages access to one external ODB.
>>
>> There are now 2 different modes for helper:
>>
>>- When "odb..scriptMode" is set to "true", the helper is
>>  launched each time Git wants to communicate with the 
>>  external ODB.
>>
>>- When "odb..scriptMode" is not set or set to "false", then
>>  the helper is launched once as a sub-process (using
>>  sub-process.h), and Git communicates with it using packet lines.
>
> Is it worth supporting two different modes long term?  It seems that this
> could be simplified (less code to write, debug, document, support) by only
> supporting the 2nd that uses the sub-process.  As far as I can tell, the
> capabilities are the same, it's just the second one is more performant when
> multiple calls are made.

Yeah, 

Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support

2017-06-23 Thread Ben Peart



On 6/20/2017 3:54 AM, Christian Couder wrote:

Goal


Git can store its objects only in the form of loose objects in
separate files or packed objects in a pack file.

To be able to better handle some kind of objects, for example big
blobs, it would be nice if Git could store its objects in other object
databases (ODB).

To do that, this patch series makes it possible to register commands,
also called "helpers", using "odb..command" config variables,
to access external ODBs where objects can be stored and retrieved.

External ODBs should be able to tranfer information about the blobs
they store. This patch series shows how this is possible using kind of
replace refs.



Great to see this making progress!

My thoughts and questions are mostly about the overall design tradeoffs.

Is your intention to enable the ODB to completely replace the regular 
object store or just to supplement it?  I think it would be good to 
ensure the interface is robust and performant enough to actually replace 
the current object store interface (even if we don't actually do that 
just yet).


Another way of asking this is: do the 3 verbs (have, get, put) and the 3 
types of "get" enable you to wrap the current loose object and pack file 
code as ODBs and run completely via the external ODB interface?  If not, 
what is missing and can it be added?


_Eventually_ it would be great to see the current object store(s) moved 
behind the new ODB interface.


When there are multiple ODB providers, what is the order they are 
called?  If one fails a request (get, have, put) are the others called 
to see if they can fulfill the request?


Can the order they are called for various verb be configured explicitly? 
For example, it would be nice to have a "large object ODB handler" 
configured to get first try at all "put" verbs.  Then if it meets it's 
size requirements, it will handle the verb, otherwise it fail and git 
will try the other ODBs.




Design
~~

* The "helpers" (registered commands)

Each helper manages access to one external ODB.

There are now 2 different modes for helper:

   - When "odb..scriptMode" is set to "true", the helper is
 launched each time Git wants to communicate with the 
 external ODB.

   - When "odb..scriptMode" is not set or set to "false", then
 the helper is launched once as a sub-process (using
 sub-process.h), and Git communicates with it using packet lines.



Is it worth supporting two different modes long term?  It seems that 
this could be simplified (less code to write, debug, document, support) 
by only supporting the 2nd that uses the sub-process.  As far as I can 
tell, the capabilities are the same, it's just the second one is more 
performant when multiple calls are made.



A helper can be given different instructions by Git. The instructions
that are supported are negociated at the beginning of the
communication using a capability mechanism.

For now the following instructions are supported:

   - "have": the helper should respond with the sha1, size and type of
 all the objects the external ODB contains, one object per line.

   - "get ": the helper should then read from the external ODB
 the content of the object corresponding to  and pass it to Git.

   - "put   ": the helper should then read from from
 Git an object and store it in the external ODB.

Currently "have" and "put" are optional.


It's good the various verbs can be optional.  That way any particular 
ODB only has to handle those it needs to provide a different behavior for.




There are 3 different kinds of "get" instructions depending on how the
helper passes objects to Git:

   - "fault_in": the helper will write the requested objects directly
 into the regular Git object database, and then Git will retry
 reading it from there.



I think the "fault_in" behavior can be implemented efficiently without 
the overhead of a 3rd special "get" instruction if we enable some of the 
other capabilities discussed.


For example, assume an ODB is setup to handle missing objects (by 
registering itself as "last" in the prioritized list of ODB handlers). 
If it is ever asked to retrieve a missing object, it can retrieve the 
object and return it as a "git_object" or "plain_object" and also cache 
it locally as a loose object, pack file, or any other ODB handler 
supported mechanism.  Future requests will then provide that object via 
the locally cached copy and its associated ODB handler.



   - "git_object": the helper will send the object as a Git object.

   - "plain_object": the helper will send the object (a blob) as a raw
 object. (The blob content will be sent as is.)

For now the kind of "get" that is supported is read from the
"odb..fetchKind" configuration variable, but in the future it
should be decided as part of the capability negociation.



I agree it makes sense to move this into the capability negotiation but 
I also wonder if we really need to support both.  Is 

Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support

2017-06-20 Thread Christian Couder
On Tue, Jun 20, 2017 at 9:54 AM, Christian Couder
 wrote:
>
> Future work
> ~~~
>
> First sorry about the state of this patch series, it is not as clean
> as I would have liked, butI think it is interesting to get feedback
> from the mailing list at this point, because the previous RFC was sent
> a long time ago and a lot of things changed.
>
> So a big part of the future work will be about cleaning this patch series.
>
> Other things I think I am going to do:
>
>   -

Ooops, I had not save my emacs buffer where I wrote this when I sent
the patch series.

This should have been:

Other things I think I may work on:

  - Remove the "odb..scriptMode" and "odb..command"
options and instead have just "odb..scriptCommand" and
"odb..subprocessCommand".

  - Use capabilities instead of "odb..fetchKind" to decide
which kind of "get" will be used.

  - Better test all the combinations of the above modes with and
without "have" and "put" instructions.

  - Maybe also have different kinds of "put" so that Git could pass
either a git object a plain object or ask the helper to retreive
it directly from Git's object database.

  - Maybe add an "init" instruction as the script mode has something
like this called "get_cap" and it would help the sub-process mode
too, as it makes it possible for Git to know the capabilities
before trying to send any instruction (that might not be supported
by the helper). The "init" instruction would be the only required
instruction for any helper to implement.

  - Add more long running tests and improve tests in general.


[RFC/PATCH v4 00/49] Add initial experimental external ODB support

2017-06-20 Thread Christian Couder
Goal


Git can store its objects only in the form of loose objects in
separate files or packed objects in a pack file.

To be able to better handle some kind of objects, for example big
blobs, it would be nice if Git could store its objects in other object
databases (ODB).

To do that, this patch series makes it possible to register commands,
also called "helpers", using "odb..command" config variables,
to access external ODBs where objects can be stored and retrieved.

External ODBs should be able to tranfer information about the blobs
they store. This patch series shows how this is possible using kind of
replace refs.

Design
~~

* The "helpers" (registered commands)

Each helper manages access to one external ODB.

There are now 2 different modes for helper:

  - When "odb..scriptMode" is set to "true", the helper is
launched each time Git wants to communicate with the 
external ODB.

  - When "odb..scriptMode" is not set or set to "false", then
the helper is launched once as a sub-process (using
sub-process.h), and Git communicates with it using packet lines.

A helper can be given different instructions by Git. The instructions
that are supported are negociated at the beginning of the
communication using a capability mechanism.

For now the following instructions are supported: 

  - "have": the helper should respond with the sha1, size and type of
all the objects the external ODB contains, one object per line.

  - "get ": the helper should then read from the external ODB
the content of the object corresponding to  and pass it to Git.

  - "put   ": the helper should then read from from
Git an object and store it in the external ODB.

Currently "have" and "put" are optional.

There are 3 different kinds of "get" instructions depending on how the
helper passes objects to Git:

  - "fault_in": the helper will write the requested objects directly
into the regular Git object database, and then Git will retry
reading it from there.

  - "git_object": the helper will send the object as a Git object.

  - "plain_object": the helper will send the object (a blob) as a raw
object. (The blob content will be sent as is.)

For now the kind of "get" that is supported is read from the
"odb..fetchKind" configuration variable, but in the future it
should be decided as part of the capability negociation.

* Transfering information

To tranfer information about the blobs stored in external ODB, some
special refs, called "odb ref", similar as replace refs, are used in
the tests of this series, but in general nothing forces the helper to
use that mechanism.

The external odb helper is responsible for using and creating the refs
in refs/odbs//, if it wants to do that. It is free for example
to just create one ref, as it is also free to create many refs. Git
would just transmit the refs that have been created by this helper, if
Git is asked to do so.

For now in the tests there is one odb ref per blob, as it is simple
and as it is similar to what git-lfs does. Each ref name is
refs/odbs// where  is the sha1 of the blob stored
in the external odb named .

These odb refs point to a blob that is stored in the Git
repository and contain information about the blob stored in the
external odb. This information can be specific to the external odb.
The repos can then share this information using commands like:

`git fetch origin "refs/odbs//*:refs/odbs//*"`

At the end of the current patch series, "git clone" is teached a
"--initial-refspec" option, that asks it to first fetch some specified
refs. This is used in the tests to fetch the odb refs first.

This way only one "git clone" command can setup a repo using the
external ODB mechanism as long as the right helper is installed on the
machine and as long as the following options are used:

  - "--initial-refspec " to fetch the odb refspec
  - "-c odb..command=" to configure the helper

There is also a test script that shows that the "--initial-refspec"
option along with the external ODB mechanism can be used to implement
cloning using bundles.

* External object database

This RFC patch series shows in the tests:

  - how to use another git repository as an external ODB (storing Git objects)
  - how to use an http server as an external ODB (storing plain objects)

(This works in both script mode and sub-process mode.)

* Performance

So the sub-process mode, which is now the default, has been
implemented in this new version of this patch series.

This has been implemented using the refactoring that Ben Peart did on
top of Lars Schneider's work on using sub-processes and packet lines
in the smudge/clean filters for git-lfs. This also uses further work
from Ben Peart called "read object process".

See:

http://public-inbox.org/git/20170113155253.1644-1-benpe...@microsoft.com/
http://public-inbox.org/git/20170322165220.5660-1-benpe...@microsoft.com/

Thanks to this, the external ODB mechanism should in the end perform
as well as the