Re: [PATCH v5 25/40] external-odb: add 'get_direct' support
On Fri, 15 Sep 2017 13:24:50 +0200 Christian Couder wrote: > > There are still some nuances. For example, if an external ODB provides > > both a tree and a blob that the tree references, do we fetch the tree in > > order to call "have" on all its blobs, or do we trust the ODB that if it > > has the tree, it has all the other objects? In my design, I do the > > latter, but in the general case where we have multiple ODBs, we might > > have to do the former. (And if we do the former, it seems to me that the > > connectivity check must be performed "online" - that is, with the ODBs > > being able to provide "get".) > > Yeah, I agree that the problem is more complex if there can be trees > or all kind of objects in the external odb. > But as I explain in the following email to Junio, I don't think > storing other kind of objects is one of the most interesting use case: > > https://public-inbox.org/git/CAP8UFD3=nutrf24clsok4hsgm3nxgh8sbzvpmcg7cnchj2z...@mail.gmail.com/ If we start with only blobs in the ODB, that makes sense (the ODB will need to supply a fast enough "list" or "have", but, as you wrote before, a mechanism like fetching an additional ref that contains all the necessary information whenever we fetch refs would be enough). I agree that it would work with existing use cases (including yours). > > (Also, my work extends all the way to fetch/clone [1], but admittedly I > > have been taking it a step at a time and recently have only been > > discussing how the local repo should handle the missing object > > situation.) > > > > [1] > > https://public-inbox.org/git/cover.1499800530.git.jonathanta...@google.com/ > > Yeah, I think your work is interesting and could perhaps be useful for > external odbs as there could be situations that would be handled > better using your work or something similar. Thanks.
Re: [PATCH v5 25/40] external-odb: add 'get_direct' support
On Thu, Sep 14, 2017 at 8:19 PM, Jonathan Tan wrote: > On Thu, 14 Sep 2017 10:39:35 +0200 > Christian Couder wrote: > >> From the following email: >> >> https://public-inbox.org/git/20170804145113.5ceaf...@twelve2.svl.corp.google.com/ >> >> it looks like his work is fundamentally about changing the rules of >> connectivity checks. Objects are split between "homegrown" objects and >> "imported" objects which are in separate pack files. Then references >> to imported objects are not checked during connectivity check. >> >> I think changing connectivity rules is not necessary to make something >> like external odb work. For example when fetching a pack that refers >> to objects that are in an external odb, if access this external odb >> has been configured, then the connectivity check will pass as the >> missing objects in the pack will be seen as already part of the repo. > > There are still some nuances. For example, if an external ODB provides > both a tree and a blob that the tree references, do we fetch the tree in > order to call "have" on all its blobs, or do we trust the ODB that if it > has the tree, it has all the other objects? In my design, I do the > latter, but in the general case where we have multiple ODBs, we might > have to do the former. (And if we do the former, it seems to me that the > connectivity check must be performed "online" - that is, with the ODBs > being able to provide "get".) Yeah, I agree that the problem is more complex if there can be trees or all kind of objects in the external odb. But as I explain in the following email to Junio, I don't think storing other kind of objects is one of the most interesting use case: https://public-inbox.org/git/CAP8UFD3=nutrf24clsok4hsgm3nxgh8sbzvpmcg7cnchj2z...@mail.gmail.com/ > (Also, my work extends all the way to fetch/clone [1], but admittedly I > have been taking it a step at a time and recently have only been > discussing how the local repo should handle the missing object > situation.) > > [1] > https://public-inbox.org/git/cover.1499800530.git.jonathanta...@google.com/ Yeah, I think your work is interesting and could perhaps be useful for external odbs as there could be situations that would be handled better using your work or something similar.
Re: [PATCH v5 25/40] external-odb: add 'get_direct' support
On Thu, 14 Sep 2017 10:39:35 +0200 Christian Couder wrote: > From the following email: > > https://public-inbox.org/git/20170804145113.5ceaf...@twelve2.svl.corp.google.com/ > > it looks like his work is fundamentally about changing the rules of > connectivity checks. Objects are split between "homegrown" objects and > "imported" objects which are in separate pack files. Then references > to imported objects are not checked during connectivity check. > > I think changing connectivity rules is not necessary to make something > like external odb work. For example when fetching a pack that refers > to objects that are in an external odb, if access this external odb > has been configured, then the connectivity check will pass as the > missing objects in the pack will be seen as already part of the repo. There are still some nuances. For example, if an external ODB provides both a tree and a blob that the tree references, do we fetch the tree in order to call "have" on all its blobs, or do we trust the ODB that if it has the tree, it has all the other objects? In my design, I do the latter, but in the general case where we have multiple ODBs, we might have to do the former. (And if we do the former, it seems to me that the connectivity check must be performed "online" - that is, with the ODBs being able to provide "get".) (Also, my work extends all the way to fetch/clone [1], but admittedly I have been taking it a step at a time and recently have only been discussing how the local repo should handle the missing object situation.) [1] https://public-inbox.org/git/cover.1499800530.git.jonathanta...@google.com/ > Yeah, if some commands like fsck are used, then possibly all the > objects will have to be requested from the external odb, as it may not > be possible to fully check all the objects, especially the blobs, > without accessing all their data. But I think this is a problem that > could be dealt with in different ways. For example we could develop > specific options in fsck so that it doesn't check the sha1 of objects > that are marked with some specific attributes, or that are stored in > external odbs, or that are bigger than some size. The hard part is in dealing with missing commits and trees, I think, not blobs.
Re: [PATCH v5 25/40] external-odb: add 'get_direct' support
On Thu, Aug 3, 2017 at 11:40 PM, Junio C Hamano wrote: > Christian Couder writes: > >> This implements the 'get_direct' capability/instruction that makes >> it possible for external odb helper scripts to pass blobs to Git >> by directly writing them as loose objects files. > > I am not sure if the assumption is made clear in this series, but I > am (perhaps incorrectly) guessing that it is assumed that the > intended use of this feature is to offload access to large blobs > by not including them in the initial clone. Yeah, it could be used for that, but that's not the only interesting use case. It could also be used for example if the working tree contains a huge number of blobs and it is better to download only the blobs that are needed when they are needed. In fact the code for 'get_direct' was taken from Ben Peart's "read-object" patch series (actually from an earlier version of this patch series): https://public-inbox.org/git/20170714132651.170708-1-benpe...@microsoft.com/ > So from that point of > view, I think it makes tons of sense to let the external helper to > directly populate the database bypassing Git (i.e. instead of > feeding data stream and have Git store it) like this "direct" method > does. > > How does this compare with (and how well does this work with) what > Jonathan Tan is doing recently? >From the following email: https://public-inbox.org/git/20170804145113.5ceaf...@twelve2.svl.corp.google.com/ it looks like his work is fundamentally about changing the rules of connectivity checks. Objects are split between "homegrown" objects and "imported" objects which are in separate pack files. Then references to imported objects are not checked during connectivity check. I think changing connectivity rules is not necessary to make something like external odb work. For example when fetching a pack that refers to objects that are in an external odb, if access this external odb has been configured, then the connectivity check will pass as the missing objects in the pack will be seen as already part of the repo. Yeah, if some commands like fsck are used, then possibly all the objects will have to be requested from the external odb, as it may not be possible to fully check all the objects, especially the blobs, without accessing all their data. But I think this is a problem that could be dealt with in different ways. For example we could develop specific options in fsck so that it doesn't check the sha1 of objects that are marked with some specific attributes, or that are stored in external odbs, or that are bigger than some size.
Re: [PATCH v5 25/40] external-odb: add 'get_direct' support
Christian Couder writes: > This implements the 'get_direct' capability/instruction that makes > it possible for external odb helper scripts to pass blobs to Git > by directly writing them as loose objects files. I am not sure if the assumption is made clear in this series, but I am (perhaps incorrectly) guessing that it is assumed that the intended use of this feature is to offload access to large blobs by not including them in the initial clone. So from that point of view, I think it makes tons of sense to let the external helper to directly populate the database bypassing Git (i.e. instead of feeding data stream and have Git store it) like this "direct" method does. How does this compare with (and how well does this work with) what Jonathan Tan is doing recently?
[PATCH v5 25/40] external-odb: add 'get_direct' support
This implements the 'get_direct' capability/instruction that makes it possible for external odb helper scripts to pass blobs to Git by directly writing them as loose objects files. It is better to call this a "direct" mode rather than a "fault-in" mode as we could have the same kind of mechanism to "put" objects into an external odb, where the odb helper would access blobs it wants to send to an external odb directly from files, but it would be strange to call that a fault-in mode too. Signed-off-by: Christian Couder --- external-odb.c | 21 - external-odb.h | 1 + odb-helper.c | 27 +-- odb-helper.h | 1 + 4 files changed, 47 insertions(+), 3 deletions(-) diff --git a/external-odb.c b/external-odb.c index 52cb448d01..31d21bfe04 100644 --- a/external-odb.c +++ b/external-odb.c @@ -96,7 +96,8 @@ int external_odb_get_object(const unsigned char *sha1) int ret; int fd; - if (!odb_helper_has_object(o, sha1)) + if (!(o->supported_capabilities & ODB_HELPER_CAP_GET_RAW_OBJ) && + !(o->supported_capabilities & ODB_HELPER_CAP_GET_GIT_OBJ)) continue; fd = create_object_tmpfile(&tmpfile, path); @@ -122,6 +123,24 @@ int external_odb_get_object(const unsigned char *sha1) return -1; } +int external_odb_get_direct(const unsigned char *sha1) +{ + struct odb_helper *o; + + if (!external_odb_has_object(sha1)) + return -1; + + for (o = helpers; o; o = o->next) { + if (!(o->supported_capabilities & ODB_HELPER_CAP_GET_DIRECT)) + continue; + if (odb_helper_get_direct(o, sha1) < 0) + continue; + return 0; + } + + return -1; +} + int external_odb_put_object(const void *buf, size_t len, const char *type, unsigned char *sha1) { diff --git a/external-odb.h b/external-odb.h index 3e0e6d0165..247b131fd5 100644 --- a/external-odb.h +++ b/external-odb.h @@ -4,6 +4,7 @@ const char *external_odb_root(void); int external_odb_has_object(const unsigned char *sha1); int external_odb_get_object(const unsigned char *sha1); +int external_odb_get_direct(const unsigned char *sha1); int external_odb_put_object(const void *buf, size_t len, const char *type, unsigned char *sha1); diff --git a/odb-helper.c b/odb-helper.c index 0603993057..b1f5464214 100644 --- a/odb-helper.c +++ b/odb-helper.c @@ -405,14 +405,37 @@ static int odb_helper_get_git_object(struct odb_helper *o, return 0; } +int odb_helper_get_direct(struct odb_helper *o, + const unsigned char *sha1) +{ + struct odb_helper_object *obj; + struct odb_helper_cmd cmd; + + obj = odb_helper_lookup(o, sha1); + if (!obj) + return -1; + + if (odb_helper_start(o, &cmd, 0, "get_direct %s", sha1_to_hex(sha1)) < 0) + return -1; + + if (odb_helper_finish(o, &cmd)) + return -1; + + return 0; +} + int odb_helper_get_object(struct odb_helper *o, const unsigned char *sha1, int fd) { + if (o->supported_capabilities & ODB_HELPER_CAP_GET_GIT_OBJ) + return odb_helper_get_git_object(o, sha1, fd); if (o->supported_capabilities & ODB_HELPER_CAP_GET_RAW_OBJ) return odb_helper_get_raw_object(o, sha1, fd); - else - return odb_helper_get_git_object(o, sha1, fd); + if (o->supported_capabilities & ODB_HELPER_CAP_GET_DIRECT) + return 0; + + BUG("invalid get capability (capabilities: '%d')", o->supported_capabilities); } int odb_helper_put_object(struct odb_helper *o, diff --git a/odb-helper.h b/odb-helper.h index 318e0d48dc..f2fd2b7c9c 100644 --- a/odb-helper.h +++ b/odb-helper.h @@ -33,6 +33,7 @@ int odb_helper_init(struct odb_helper *o); int odb_helper_has_object(struct odb_helper *o, const unsigned char *sha1); int odb_helper_get_object(struct odb_helper *o, const unsigned char *sha1, int fd); +int odb_helper_get_direct(struct odb_helper *o, const unsigned char *sha1); int odb_helper_put_object(struct odb_helper *o, const void *buf, size_t len, const char *type, unsigned char *sha1); -- 2.14.0.rc1.52.gf02fb0ddac.dirty