Re: [PATCH v5 25/40] external-odb: add 'get_direct' support

2017-09-15 Thread Jonathan Tan
On Fri, 15 Sep 2017 13:24:50 +0200
Christian Couder  wrote:

> > There are still some nuances. For example, if an external ODB provides
> > both a tree and a blob that the tree references, do we fetch the tree in
> > order to call "have" on all its blobs, or do we trust the ODB that if it
> > has the tree, it has all the other objects? In my design, I do the
> > latter, but in the general case where we have multiple ODBs, we might
> > have to do the former. (And if we do the former, it seems to me that the
> > connectivity check must be performed "online" - that is, with the ODBs
> > being able to provide "get".)
> 
> Yeah, I agree that the problem is more complex if there can be trees
> or all kind of objects in the external odb.
> But as I explain in the following email to Junio, I don't think
> storing other kind of objects is one of the most interesting use case:
> 
> https://public-inbox.org/git/CAP8UFD3=nutrf24clsok4hsgm3nxgh8sbzvpmcg7cnchj2z...@mail.gmail.com/

If we start with only blobs in the ODB, that makes sense (the ODB will
need to supply a fast enough "list" or "have", but, as you wrote before,
a mechanism like fetching an additional ref that contains all the
necessary information whenever we fetch refs would be enough). I agree
that it would work with existing use cases (including yours).

> > (Also, my work extends all the way to fetch/clone [1], but admittedly I
> > have been taking it a step at a time and recently have only been
> > discussing how the local repo should handle the missing object
> > situation.)
> >
> > [1] 
> > https://public-inbox.org/git/cover.1499800530.git.jonathanta...@google.com/
> 
> Yeah, I think your work is interesting and could perhaps be useful for
> external odbs as there could be situations that would be handled
> better using your work or something similar.

Thanks.


Re: [PATCH v5 25/40] external-odb: add 'get_direct' support

2017-09-15 Thread Christian Couder
On Thu, Sep 14, 2017 at 8:19 PM, Jonathan Tan  wrote:
> On Thu, 14 Sep 2017 10:39:35 +0200
> Christian Couder  wrote:
>
>> From the following email:
>>
>> https://public-inbox.org/git/20170804145113.5ceaf...@twelve2.svl.corp.google.com/
>>
>> it looks like his work is fundamentally about changing the rules of
>> connectivity checks. Objects are split between "homegrown" objects and
>> "imported" objects which are in separate pack files. Then references
>> to imported objects are not checked during connectivity check.
>>
>> I think changing connectivity rules is not necessary to make something
>> like external odb work. For example when fetching a pack that refers
>> to objects that are in an external odb, if access this external odb
>> has been configured, then the connectivity check will pass as the
>> missing objects in the pack will be seen as already part of the repo.
>
> There are still some nuances. For example, if an external ODB provides
> both a tree and a blob that the tree references, do we fetch the tree in
> order to call "have" on all its blobs, or do we trust the ODB that if it
> has the tree, it has all the other objects? In my design, I do the
> latter, but in the general case where we have multiple ODBs, we might
> have to do the former. (And if we do the former, it seems to me that the
> connectivity check must be performed "online" - that is, with the ODBs
> being able to provide "get".)

Yeah, I agree that the problem is more complex if there can be trees
or all kind of objects in the external odb.
But as I explain in the following email to Junio, I don't think
storing other kind of objects is one of the most interesting use case:

https://public-inbox.org/git/CAP8UFD3=nutrf24clsok4hsgm3nxgh8sbzvpmcg7cnchj2z...@mail.gmail.com/

> (Also, my work extends all the way to fetch/clone [1], but admittedly I
> have been taking it a step at a time and recently have only been
> discussing how the local repo should handle the missing object
> situation.)
>
> [1] 
> https://public-inbox.org/git/cover.1499800530.git.jonathanta...@google.com/

Yeah, I think your work is interesting and could perhaps be useful for
external odbs as there could be situations that would be handled
better using your work or something similar.


Re: [PATCH v5 25/40] external-odb: add 'get_direct' support

2017-09-14 Thread Jonathan Tan
On Thu, 14 Sep 2017 10:39:35 +0200
Christian Couder  wrote:

> From the following email:
> 
> https://public-inbox.org/git/20170804145113.5ceaf...@twelve2.svl.corp.google.com/
> 
> it looks like his work is fundamentally about changing the rules of
> connectivity checks. Objects are split between "homegrown" objects and
> "imported" objects which are in separate pack files. Then references
> to imported objects are not checked during connectivity check.
> 
> I think changing connectivity rules is not necessary to make something
> like external odb work. For example when fetching a pack that refers
> to objects that are in an external odb, if access this external odb
> has been configured, then the connectivity check will pass as the
> missing objects in the pack will be seen as already part of the repo.

There are still some nuances. For example, if an external ODB provides
both a tree and a blob that the tree references, do we fetch the tree in
order to call "have" on all its blobs, or do we trust the ODB that if it
has the tree, it has all the other objects? In my design, I do the
latter, but in the general case where we have multiple ODBs, we might
have to do the former. (And if we do the former, it seems to me that the
connectivity check must be performed "online" - that is, with the ODBs
being able to provide "get".)

(Also, my work extends all the way to fetch/clone [1], but admittedly I
have been taking it a step at a time and recently have only been
discussing how the local repo should handle the missing object
situation.)

[1] https://public-inbox.org/git/cover.1499800530.git.jonathanta...@google.com/

> Yeah, if some commands like fsck are used, then possibly all the
> objects will have to be requested from the external odb, as it may not
> be possible to fully check all the objects, especially the blobs,
> without accessing all their data. But I think this is a problem that
> could be dealt with in different ways. For example we could develop
> specific options in fsck so that it doesn't check the sha1 of objects
> that are marked with some specific attributes, or that are stored in
> external odbs, or that are bigger than some size.

The hard part is in dealing with missing commits and trees, I think, not
blobs.


Re: [PATCH v5 25/40] external-odb: add 'get_direct' support

2017-09-14 Thread Christian Couder
On Thu, Aug 3, 2017 at 11:40 PM, Junio C Hamano  wrote:
> Christian Couder  writes:
>
>> This implements the 'get_direct' capability/instruction that makes
>> it possible for external odb helper scripts to pass blobs to Git
>> by directly writing them as loose objects files.
>
> I am not sure if the assumption is made clear in this series, but I
> am (perhaps incorrectly) guessing that it is assumed that the
> intended use of this feature is to offload access to large blobs
> by not including them in the initial clone.

Yeah, it could be used for that, but that's not the only interesting use case.

It could also be used for example if the working tree contains a huge
number of blobs and it is better to download only the blobs that are
needed when they are needed. In fact the code for 'get_direct' was
taken from Ben Peart's "read-object" patch series (actually from an
earlier version of this patch series):

https://public-inbox.org/git/20170714132651.170708-1-benpe...@microsoft.com/

> So from that point of
> view, I think it makes tons of sense to let the external helper to
> directly populate the database bypassing Git (i.e. instead of
> feeding data stream and have Git store it) like this "direct" method
> does.
>
> How does this compare with (and how well does this work with) what
> Jonathan Tan is doing recently?

>From the following email:

https://public-inbox.org/git/20170804145113.5ceaf...@twelve2.svl.corp.google.com/

it looks like his work is fundamentally about changing the rules of
connectivity checks. Objects are split between "homegrown" objects and
"imported" objects which are in separate pack files. Then references
to imported objects are not checked during connectivity check.

I think changing connectivity rules is not necessary to make something
like external odb work. For example when fetching a pack that refers
to objects that are in an external odb, if access this external odb
has been configured, then the connectivity check will pass as the
missing objects in the pack will be seen as already part of the repo.

Yeah, if some commands like fsck are used, then possibly all the
objects will have to be requested from the external odb, as it may not
be possible to fully check all the objects, especially the blobs,
without accessing all their data. But I think this is a problem that
could be dealt with in different ways. For example we could develop
specific options in fsck so that it doesn't check the sha1 of objects
that are marked with some specific attributes, or that are stored in
external odbs, or that are bigger than some size.


Re: [PATCH v5 25/40] external-odb: add 'get_direct' support

2017-08-03 Thread Junio C Hamano
Christian Couder  writes:

> This implements the 'get_direct' capability/instruction that makes
> it possible for external odb helper scripts to pass blobs to Git
> by directly writing them as loose objects files.

I am not sure if the assumption is made clear in this series, but I
am (perhaps incorrectly) guessing that it is assumed that the
intended use of this feature is to offload access to large blobs
by not including them in the initial clone.  So from that point of
view, I think it makes tons of sense to let the external helper to
directly populate the database bypassing Git (i.e. instead of
feeding data stream and have Git store it) like this "direct" method
does.

How does this compare with (and how well does this work with) what
Jonathan Tan is doing recently?


[PATCH v5 25/40] external-odb: add 'get_direct' support

2017-08-03 Thread Christian Couder
This implements the 'get_direct' capability/instruction that makes
it possible for external odb helper scripts to pass blobs to Git
by directly writing them as loose objects files.

It is better to call this a "direct" mode rather than a "fault-in"
mode as we could have the same kind of mechanism to "put" objects
into an external odb, where the odb helper would access blobs it
wants to send to an external odb directly from files, but it
would be strange to call that a fault-in mode too.

Signed-off-by: Christian Couder 
---
 external-odb.c | 21 -
 external-odb.h |  1 +
 odb-helper.c   | 27 +--
 odb-helper.h   |  1 +
 4 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/external-odb.c b/external-odb.c
index 52cb448d01..31d21bfe04 100644
--- a/external-odb.c
+++ b/external-odb.c
@@ -96,7 +96,8 @@ int external_odb_get_object(const unsigned char *sha1)
int ret;
int fd;
 
-   if (!odb_helper_has_object(o, sha1))
+   if (!(o->supported_capabilities & ODB_HELPER_CAP_GET_RAW_OBJ) &&
+   !(o->supported_capabilities & ODB_HELPER_CAP_GET_GIT_OBJ))
continue;
 
fd = create_object_tmpfile(, path);
@@ -122,6 +123,24 @@ int external_odb_get_object(const unsigned char *sha1)
return -1;
 }
 
+int external_odb_get_direct(const unsigned char *sha1)
+{
+   struct odb_helper *o;
+
+   if (!external_odb_has_object(sha1))
+   return -1;
+
+   for (o = helpers; o; o = o->next) {
+   if (!(o->supported_capabilities & ODB_HELPER_CAP_GET_DIRECT))
+   continue;
+   if (odb_helper_get_direct(o, sha1) < 0)
+   continue;
+   return 0;
+   }
+
+   return -1;
+}
+
 int external_odb_put_object(const void *buf, size_t len,
const char *type, unsigned char *sha1)
 {
diff --git a/external-odb.h b/external-odb.h
index 3e0e6d0165..247b131fd5 100644
--- a/external-odb.h
+++ b/external-odb.h
@@ -4,6 +4,7 @@
 const char *external_odb_root(void);
 int external_odb_has_object(const unsigned char *sha1);
 int external_odb_get_object(const unsigned char *sha1);
+int external_odb_get_direct(const unsigned char *sha1);
 int external_odb_put_object(const void *buf, size_t len,
const char *type, unsigned char *sha1);
 
diff --git a/odb-helper.c b/odb-helper.c
index 0603993057..b1f5464214 100644
--- a/odb-helper.c
+++ b/odb-helper.c
@@ -405,14 +405,37 @@ static int odb_helper_get_git_object(struct odb_helper *o,
return 0;
 }
 
+int odb_helper_get_direct(struct odb_helper *o,
+ const unsigned char *sha1)
+{
+   struct odb_helper_object *obj;
+   struct odb_helper_cmd cmd;
+
+   obj = odb_helper_lookup(o, sha1);
+   if (!obj)
+   return -1;
+
+   if (odb_helper_start(o, , 0, "get_direct %s", sha1_to_hex(sha1)) < 
0)
+   return -1;
+
+   if (odb_helper_finish(o, ))
+   return -1;
+
+   return 0;
+}
+
 int odb_helper_get_object(struct odb_helper *o,
  const unsigned char *sha1,
  int fd)
 {
+   if (o->supported_capabilities & ODB_HELPER_CAP_GET_GIT_OBJ)
+   return odb_helper_get_git_object(o, sha1, fd);
if (o->supported_capabilities & ODB_HELPER_CAP_GET_RAW_OBJ)
return odb_helper_get_raw_object(o, sha1, fd);
-   else
-   return odb_helper_get_git_object(o, sha1, fd);
+   if (o->supported_capabilities & ODB_HELPER_CAP_GET_DIRECT)
+   return 0;
+
+   BUG("invalid get capability (capabilities: '%d')", 
o->supported_capabilities);
 }
 
 int odb_helper_put_object(struct odb_helper *o,
diff --git a/odb-helper.h b/odb-helper.h
index 318e0d48dc..f2fd2b7c9c 100644
--- a/odb-helper.h
+++ b/odb-helper.h
@@ -33,6 +33,7 @@ int odb_helper_init(struct odb_helper *o);
 int odb_helper_has_object(struct odb_helper *o, const unsigned char *sha1);
 int odb_helper_get_object(struct odb_helper *o, const unsigned char *sha1,
  int fd);
+int odb_helper_get_direct(struct odb_helper *o, const unsigned char *sha1);
 int odb_helper_put_object(struct odb_helper *o,
  const void *buf, size_t len,
  const char *type, unsigned char *sha1);
-- 
2.14.0.rc1.52.gf02fb0ddac.dirty