Re: [PATCH v2 4/4] bundle v3: the beginning

2016-06-09 Thread Jeff King
On Thu, Jun 09, 2016 at 03:53:26PM +0700, Duy Nguyen wrote:

> > Yes. To me, this was always about punting large blobs from the clones.
> > Basically the way git-lfs and other tools work, but without munging your
> > history permanently.
> 
> Makes sense. If we keep all trees and commits locally, pack v4 still
> has a chance to rise!

Yeah, I don't think anything here precludes pack v4.

> > I don't know if Christian had other cases in mind (like the many-files
> > case, which I think is better served by something like narrow clones).
> 
> Although for git-gc or git-fsck, I guess we need special support
> anyway not to download large blobs unnecessarily. Not sure if git-gc
> can already do that now. All I remember is git-repack can still be
> used to make a repo independent from odb alternates. We probably want
> to avoid that. git-fsck definitely should verify that large remote
> blobs are good without downloading them (a new "fsck" command to
> external odb, maybe).

I think git-gc should work out of the box; you'd want to use "repack
-l", which git-gc passes already.

Fsck would be OK as long as you didn't actually load blobs. We have
--connectivity-only for that, but of course it isn't the default. You'd
probably want the default mode to fsck local blobs, but _not_ to fault
in external blobs (but have an option to fault them all in if you really
wanted to be sure you have everything).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] bundle v3: the beginning

2016-06-09 Thread Duy Nguyen
On Wed, Jun 8, 2016 at 11:19 PM, Jeff King  wrote:
> On Wed, Jun 08, 2016 at 05:44:06PM +0700, Duy Nguyen wrote:
>
>> On Wed, Jun 8, 2016 at 3:23 AM, Jeff King  wrote:
>> > Because this "external odb" essentially acts as a git alternate, we
>> > would hit it only when we couldn't find an object through regular means.
>> > Git would then make the object available in the usual on-disk format
>> > (probably as a loose object).
>>
>> This means git-gc (and all things that do rev-list --objects --all)
>> would download at least all trees and commits? Or will we have special
>> treatment for those commands?
>
> Yes. To me, this was always about punting large blobs from the clones.
> Basically the way git-lfs and other tools work, but without munging your
> history permanently.

Makes sense. If we keep all trees and commits locally, pack v4 still
has a chance to rise!

> I don't know if Christian had other cases in mind (like the many-files
> case, which I think is better served by something like narrow clones).

Although for git-gc or git-fsck, I guess we need special support
anyway not to download large blobs unnecessarily. Not sure if git-gc
can already do that now. All I remember is git-repack can still be
used to make a repo independent from odb alternates. We probably want
to avoid that. git-fsck definitely should verify that large remote
blobs are good without downloading them (a new "fsck" command to
external odb, maybe).
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] bundle v3: the beginning

2016-06-08 Thread Jeff King
On Wed, Jun 08, 2016 at 11:05:20AM -0700, Junio C Hamano wrote:

> > Likewise, I'm not sure if "get" should be allowed to return contents
> > that don't match the sha1.
> 
> Yes, this is what I was getting at.  It would be ideal to come up
> with a way to do the large-blob offload without resorting to hacks
> (like LFS and annex where "the same object contents will always
> result in the same object name" is deliberately broken), and "object
> name must match what the data hashes down to" is a basic requirement
> for that.

I meant to elaborate here more, but it looks like I didn't.

One thing that an external odb command might want to be able to do is
say "I _do_ have that object, but it would be expensive or impossible to
get right now, so I will give you a placeholder" (e.g., you are just
trying to run "git log" while on an airplane, and you would not want to
die() because you cannot fetch some blob).

But the right way is not to have "get" send content that does not match
the requested sha1. It needs to make git aware that the object is a
placeholder, so git does not do stupid things like write the bogus
content into a loose object.

The right way may be as simple as the external odb returning a non-zero
exit code, and git fills in the placeholder data itself (or dies,
possibly, depending on what the user asks it to do).

One of the reasons I worked up that initial external-odb patch was
because I knew that before we settled on a definite interface, we would
have to try it out and see what odd corner cases came up. E.g., when do
we fault in objects in a way that's annoying to the user? Which code
paths need to handle "we do have this object available, but you can't
see it right now, so what kind of fallback can we do?". Etc.

Unfortunately I never actually did any of that testing. ;)

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] bundle v3: the beginning

2016-06-08 Thread Junio C Hamano
Jeff King  writes:

> This interface comes from my earlier patches, so I'll try to shed a
> little light on the decisions I made there.
>
> Because this "external odb" essentially acts as a git alternate, we
> would hit it only when we couldn't find an object through regular means.
> Git would then make the object available in the usual on-disk format
> (probably as a loose object).
>
> So in most processes, we would not need to consult the odb command at
> all. And when we do, the first thing would be to get its "have" list,
> which would at most run once per process.
>
> So the per-object cost is really calling "get", and my assumption there
> was that the cost of actually retrieving the object over the network
> would dwarf the fork/exec cost.

OK, presented that way, the design makes sense (I do not know if
Christian's (revised) design and implementation does or not, though,
as I haven't seen it).

As "check for non-existence" is important and costly, grabbing
"have" once is a good strategy, just like we open the .idx files of
available packfiles.

>> >   - " have": the command should output the sha1, size and
>> > type of all the objects the external ODB contains, one object per
>> > line.
>> 
>> Why size and type at this point is needed by the clients?  That is
>> more expensive to compute than just a bare list of object names.
>
> Yes, but it lets get avoid doing a lot of "get" operations.

OK, so it is more like having richer information in pack-v4 index ;-)

>> >   - " put   ": the command should then read
>> > from stdin an object and store it in the external ODB.
>> 
>> Is ODB required to sanity check that  matches what the data
>> hashes down to?
>
> I think that would be up to the ODB, but it does seem like a good idea.
>
> Likewise, I'm not sure if "get" should be allowed to return contents
> that don't match the sha1.

Yes, this is what I was getting at.  It would be ideal to come up
with a way to do the large-blob offload without resorting to hacks
(like LFS and annex where "the same object contents will always
result in the same object name" is deliberately broken), and "object
name must match what the data hashes down to" is a basic requirement
for that.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] bundle v3: the beginning

2016-06-08 Thread Jeff King
On Wed, Jun 08, 2016 at 05:44:06PM +0700, Duy Nguyen wrote:

> On Wed, Jun 8, 2016 at 3:23 AM, Jeff King  wrote:
> > Because this "external odb" essentially acts as a git alternate, we
> > would hit it only when we couldn't find an object through regular means.
> > Git would then make the object available in the usual on-disk format
> > (probably as a loose object).
> 
> This means git-gc (and all things that do rev-list --objects --all)
> would download at least all trees and commits? Or will we have special
> treatment for those commands?

Yes. To me, this was always about punting large blobs from the clones.
Basically the way git-lfs and other tools work, but without munging your
history permanently.

I don't know if Christian had other cases in mind (like the many-files
case, which I think is better served by something like narrow clones).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] bundle v3: the beginning

2016-06-08 Thread Duy Nguyen
On Wed, Jun 8, 2016 at 3:23 AM, Jeff King  wrote:
> Because this "external odb" essentially acts as a git alternate, we
> would hit it only when we couldn't find an object through regular means.
> Git would then make the object available in the usual on-disk format
> (probably as a loose object).

This means git-gc (and all things that do rev-list --objects --all)
would download at least all trees and commits? Or will we have special
treatment for those commands?
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] bundle v3: the beginning

2016-06-07 Thread Jeff King
On Tue, Jun 07, 2016 at 03:19:46PM +0200, Christian Couder wrote:

> >  But there are lots of cases where the server might want to tell
> >  the client that don't involve bundles at all.
> 
> The idea is also that anytime the server needs to send external ODB
> data to the client, it would ask its own external ODB to prepare a
> kind of bundle with that data and use the bundle v3 mechanism to send
> it.
> That may need the bundle v3 mechanism to be extended, but I don't see
> in which cases it would not work.

Ah, I see we do not have the same underlying mental model.

I think the external odb is purely the _client's_ business. The server
does not have to have an external odb at all, and does not need to know
about the client's. The client is responsible for telling the server
during the git protocol anything it would need to know (like "do not
bother sending objects over 50MB; I can get them elsewhere").

This makes the problem much more complicated, but it is more flexible
and decentralized.

> >a. The receiving side of a connection (e.g., a fetch client)
> >   somehow has out-of-band access to some objects. How does it
> >   tell the other side "do not bother sending me these objects; I
> >   can get them in another way"?
> 
> I don't see a difference with regular objects that the fetch client
> already has. If it already has some regular objects, a way to tell the
> server "don't bother sending me these objects" is useful already and
> it should be possible to use it to tell the server that there is no
> need to send some objects stored in the external ODB too.

The way to do that with normal objects is by finding shared commit tips,
and assuming the normal git repository property of "if you have X, you
have all of the objects reachable from X".

This whole idea is essentially creating "holes" in that property. You
can enumerate all of the holes, but I am not sure that scales well. We
get a lot of efficiency by communicating only ref tips during the
negotiation, and not individual object names.

> Also something like this is needed for shallow clones and narrow
> clones anyway.

Yes, and I don't think it scales well there, either. A single shallow
cutoff works OK. But if you repeatedly shallow-fetch into a repository,
you end up with a patchwork of disconnected "islands" of history. The
CPU required on the server side to serve those fetch requests is much
greater than what would normally be needed. You can't use things like
reachability bitmaps, and you have to open up the trees for each island
to see which objects the other side actually has.

> >b. The receiving side of a connection has out-of-band access to
> >   some objects. Some of these will be expensive to get (e.g.,
> >   requiring a large download), and some may be fast (e.g.,
> >   they've already been fetched to a local cache). How do we tell
> >   the sending side not to assume we have cheap access to these
> >   objects (e.g., for use as a delta base)?
> 
> I don't think we need to tell the sending side we have cheap access or
> not to some objects.
> If the objects are managed by the external ODB, it's the external ODB
> on the server and on the client that will manage these objects. They
> should not be used as delta bases.
> Perhaps there is no mechanism to say that some objects (basically all
> external ODB managed objects) should not be used as delta bases, but
> that could be added.

Yes, I agree that _if_ the server can access the list of objects
available in the external odb, this becomes much easier. I'm just not
convinced that level of coupling is a good idea.

Note that the server would also want to take this into account during
repacking, as otherwise you end up with fetches that are very expensive
to serve (you want to send X which is a delta based on Y, but you know
that Y is available via the external odb, and therefore should not be
used as a base. So you have to throw out the delta for X and either send
it whole or compute a new one. That's much more expensive than blitting
the delta from disk, which is what a normal clone would do).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] bundle v3: the beginning

2016-06-07 Thread Jeff King
On Tue, Jun 07, 2016 at 12:23:40PM -0700, Junio C Hamano wrote:

> Christian Couder  writes:
> 
> > Git can store its objects only in the form of loose objects in
> > separate files or packed objects in a pack file.
> > To be able to better handle some kind of objects, for example big
> > blobs, it would be nice if Git could store its objects in other object
> > databases (ODB).
> >
> > To do that, this patch series makes it possible to register commands,
> > using "odb..command" config variables, to access external
> > ODBs. Each specified command will then be called the following ways:
> 
> Hopefully it is done via a cheap RPC instead of forking/execing the
> command for each and every object lookup.

This interface comes from my earlier patches, so I'll try to shed a
little light on the decisions I made there.

Because this "external odb" essentially acts as a git alternate, we
would hit it only when we couldn't find an object through regular means.
Git would then make the object available in the usual on-disk format
(probably as a loose object).

So in most processes, we would not need to consult the odb command at
all. And when we do, the first thing would be to get its "have" list,
which would at most run once per process.

So the per-object cost is really calling "get", and my assumption there
was that the cost of actually retrieving the object over the network
would dwarf the fork/exec cost.

I also waffled on having git cache the output of " have" in
some fast-lookup format to save even the single fork/exec. But I figured
that was something that could be added later if needed.

You'll note that this is sort of a "fault-in" model. Another model would
be to treat external odb updates similar to fetches. I.e., we touch the
network only during a special update operation, and then try to work
locally with whatever the external odb has. IMHO this policy could
actually be up to the external odb itself (i.e., its "have" command
could serve from a local cache if it likes).

> >   - " have": the command should output the sha1, size and
> > type of all the objects the external ODB contains, one object per
> > line.
> 
> Why size and type at this point is needed by the clients?  That is
> more expensive to compute than just a bare list of object names.

Yes, but it lets get avoid doing a lot of "get" operations. For example,
in a regular diff without binary-diffs enabled, we can automatically
determine that a diff will be considered binary based purely on the size
of the objects (related to core.bigfilethreshold). So if we know the
sizes, we can run "git log -p" without faulting-in each of the objects
just to say "woah, that looks binary".

One can accomplish this with .gitattributes, too, of course, but the
size thing just works out of the box.

There are other places where it will come in handy, too. E.g., fscking a
tree object you have, you want to make sure that the object referred to
with mode 100644 is actually a blob.

I also don't think the cost to compute size and type on the server is
all that important. Yes, if you're backing your external odb with a git
repository that runs "git cat-file" on the fly, it is more expensive.
But in practice, I'd expect the server side to create a static manifest
and serve it over HTTP (this also gives the benefit of things like
ETags).

> >   - " get ": the command should then read from the
> > external ODB the content of the object corresponding to  and
> > output it on stdout.
> 
> The type and size should be given at this point.

I don't think there's a reason not to; I didn't here because it would be
redundant with what Git already knows from the "have" manifest it
receives above.

> >   - " put   ": the command should then read
> > from stdin an object and store it in the external ODB.
> 
> Is ODB required to sanity check that  matches what the data
> hashes down to?

I think that would be up to the ODB, but it does seem like a good idea.

Likewise, I'm not sure if "get" should be allowed to return contents
that don't match the sha1. That would be fine for things like "diff",
but would probably make "fsck" unhappy.

> If this thing is primarily to offload large blobs, you might also
> want not "get" but "checkout  " to bypass Git entirely,
> but I haven't thought it through.

My mental model is that the external odb gets the object into the local
odb, and then you can use the regular streaming-checkout code path. And
the local odb serves as your cache.

That does mean you might have two copies of each object (one in the odb,
and one in the working tree), as opposed to a true cacheless system,
which can get away with one.

I think you could do that cacheless thing with the interface here,
though; the "get" operation can stream, and you can stream directly to
the working tree.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH v2 4/4] bundle v3: the beginning

2016-06-07 Thread Junio C Hamano
Christian Couder  writes:

> Git can store its objects only in the form of loose objects in
> separate files or packed objects in a pack file.
> To be able to better handle some kind of objects, for example big
> blobs, it would be nice if Git could store its objects in other object
> databases (ODB).
>
> To do that, this patch series makes it possible to register commands,
> using "odb..command" config variables, to access external
> ODBs. Each specified command will then be called the following ways:

Hopefully it is done via a cheap RPC instead of forking/execing the
command for each and every object lookup.

>   - " have": the command should output the sha1, size and
> type of all the objects the external ODB contains, one object per
> line.

Why size and type at this point is needed by the clients?  That is
more expensive to compute than just a bare list of object names.

>   - " get ": the command should then read from the
> external ODB the content of the object corresponding to  and
> output it on stdout.

The type and size should be given at this point.

>   - " put   ": the command should then read
> from stdin an object and store it in the external ODB.

Is ODB required to sanity check that  matches what the data
hashes down to?

If this thing is primarily to offload large blobs, you might also
want not "get" but "checkout  " to bypass Git entirely,
but I haven't thought it through.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] bundle v3: the beginning

2016-06-07 Thread Christian Couder
On Wed, Jun 1, 2016 at 3:37 PM, Duy Nguyen  wrote:
> On Tue, May 31, 2016 at 8:18 PM, Christian Couder
>  wrote:
 I wonder if this mechanism could also be used or extended to clone and
 fetch an alternate object database.

 In [1], [2] and [3], and this was also discussed during the
 Contributor Summit last month, Peff says that he started working on
 alternate object database support a long time ago, and that the hard
 part is a protocol extension to tell remotes that you can access some
 objects in a different way.

 If a Git client would download a "$name.bndl" v3 bundle file that
 would have a "data: $URL/alt-odb-$name.odb" extended header, the Git
 client would just need to download "$URL/alt-odb-$name.odb" and use
 the alternate object database support on this file.
>>>
>>> What does this file contain exactly? A list of SHA-1 that can be
>>> retrieved from this remote/alternate odb?
>>
>> It would depend on the external odb. Git could support different
>> external odb that have different trade-offs.
>>
>>> I wonder if we could just
>>> git-replace for this marking. The replaced content could contain the
>>> uri pointing to the alt odb.
>>
>> Yeah, interesting!
>> That's indeed another possibility that might not need the transfer of
>> any external odb.
>>
>> But in this case it might be cleaner to just have a separate ref hierarchy 
>> like:
>>
>> refs/external-odbs/my-ext-odb/
>>
>> instead of using the replace one.
>>
>> Or maybe:
>>
>> refs/replace/external-odbs/my-ext-odb/
>>
>> if we really want to use the replace hierarchy.
>
> Yep. replace hierarchy crossed my mind. But then I thought about
> performance degradation when there are more than one pack (we have to
> search through them all for every SHA-1) and discarded it because we
> would need to do the same linear search here. I guess we will most
> likely have one or two name spaces so it probably won't matter.

Yeah.

>>> We could optionally contact alt odb to
>>> retrieve real content, or just show the replaced/fake data when alt
>>> odb is out of reach.
>>
>> Yeah, I wonder if that really needs the replace mechanism.
>
> Replace mechanism provides good hook point. But it really depends how
> invasive this remote odb is. If a fake content is enough to avoid
> breakages up high, git-replace is enough. If you really need to pass
> remote odb info up so higher levels can do something more fancy, then
> it's insufficient.
>
>> By the way this makes me wonder if we could implement resumable clone
>> using some kind of replace ref.
>>
>> The client while cloning nearly as usual would download one or more
>> special replace refs that would points to objects with links to
>> download bundles using standard protocols.
>> Just after the clone, the client would read these objects and download
>> the bundles from these objects.
>> And then it would clone from these bundles.
>
> I thought we have settled on resumable clone, just waiting for an
> implementation :) Doing it your way, you would need to download these
> special objects too (in a pack?) and come back download some more
> bundles. It would be more efficient to show the bundle uri early and
> go download the bundle on the side while you go on to get the
> addition/smaller pack that contains the rest.

Yeah, something like the bundle v3 mechanism is probably more efficient.

Thanks,
Christian.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] bundle v3: the beginning

2016-06-07 Thread Christian Couder
On Wed, Jun 1, 2016 at 12:31 AM, Jeff King  wrote:
> On Fri, May 20, 2016 at 02:39:06PM +0200, Christian Couder wrote:
>
>> I wonder if this mechanism could also be used or extended to clone and
>> fetch an alternate object database.
>>
>> In [1], [2] and [3], and this was also discussed during the
>> Contributor Summit last month, Peff says that he started working on
>> alternate object database support a long time ago, and that the hard
>> part is a protocol extension to tell remotes that you can access some
>> objects in a different way.
>>
>> If a Git client would download a "$name.bndl" v3 bundle file that
>> would have a "data: $URL/alt-odb-$name.odb" extended header, the Git
>> client would just need to download "$URL/alt-odb-$name.odb" and use
>> the alternate object database support on this file.
>>
>> This way it would know all it has to know to access the objects in the
>> alternate database. The alternate object database may not contain the
>> real objects, if they are too big for example, but just files that
>> describe how to get the real objects.
>
> I'm not sure about this strategy.

I am also not sure that this is the best strategy, but I think it's
worth discussing.

> I see two complications:
>
>   1. I don't think bundles need to be a part of this "external odb"
>  strategy at all. If I understand correctly, I think you want to use
>  it as a place to stuff metadata that the server tells the client,
>  like "by the way, go here if you want another way to access some
>  objects".

Yeah, basically I think it might be possible to use the bundle
mechanism to transfer what an external ODB on the client would need to
be initialized or updated.

>  But there are lots of cases where the server might want to tell
>  the client that don't involve bundles at all.

The idea is also that anytime the server needs to send external ODB
data to the client, it would ask its own external ODB to prepare a
kind of bundle with that data and use the bundle v3 mechanism to send
it.
That may need the bundle v3 mechanism to be extended, but I don't see
in which cases it would not work.

>   2. A server pointing the client to another object store is actually
>  the least interesting bit of the protocol.
>
>  The more interesting cases (to me) are:
>
>a. The receiving side of a connection (e.g., a fetch client)
>   somehow has out-of-band access to some objects. How does it
>   tell the other side "do not bother sending me these objects; I
>   can get them in another way"?

I don't see a difference with regular objects that the fetch client
already has. If it already has some regular objects, a way to tell the
server "don't bother sending me these objects" is useful already and
it should be possible to use it to tell the server that there is no
need to send some objects stored in the external ODB too.

Also something like this is needed for shallow clones and narrow clones anyway.

>b. The receiving side of a connection has out-of-band access to
>   some objects. Some of these will be expensive to get (e.g.,
>   requiring a large download), and some may be fast (e.g.,
>   they've already been fetched to a local cache). How do we tell
>   the sending side not to assume we have cheap access to these
>   objects (e.g., for use as a delta base)?

I don't think we need to tell the sending side we have cheap access or
not to some objects.
If the objects are managed by the external ODB, it's the external ODB
on the server and on the client that will manage these objects. They
should not be used as delta bases.
Perhaps there is no mechanism to say that some objects (basically all
external ODB managed objects) should not be used as delta bases, but
that could be added.

Thanks,
Christian.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] bundle v3: the beginning

2016-06-07 Thread Duy Nguyen
On Tue, Jun 7, 2016 at 3:46 PM, Christian Couder
 wrote:
>> Any thought on object streaming support?
>
> No I didn't think about this. In fact I am not sure what this means.
>
>> It could be a big deal (might
>> affect some design decisions).
>
> Could you elaborate on this?

Object streaming api is in streaming.h. Normally objects are small and
we can inflate the whole thing in memory before doing anything with
them. For really large objects (which I guess is one of the reasons
for remote odb) we don't want to do that. It takes lots of memory and
you could have objects larger than your physical memory. In some cases
when can ignore those objects (e.g. mark them binary and choose not to
diff). In some other cases (e.g. checkout), we use streaming interface
to process an object while we're inflating it to keep memory usage
down. It's easy to add a new streaming backend, once you settle on how
remote odb streams stuff.

>> I would also think about how pack v4
>> fits in this (e.g. how a tree walker can still walk fast, a big
>> promise of pack v4; I suppose if you still maintain "pack" concept
>> over external odb then it might work). Not that it really matters.
>> Pack v4 is the future, but the future can never be "today" :)
>
> Sorry I haven't really followed pack v4 and I forgot what it is about.

It's a new pack format (and practically vaporware at this point) that
promises much faster access when you need to walk through trees and
commits (think rev-list --objects --all, or git-blame). Because we are
(or I am) still not sure if pack v4 will ever get to the state where
it can be merged to git.git, I think it's ok for you to ignore it too
if you want. You can read more about the format here [1] and go even
further back to [2] when Nicolas teased us with the pack size
(smaller, which is a nice side effect). The potential issue with pack
v4 is, the tree walker (struct tree_desc and related funcs in
walk-tree.h) needs to know about pack v4 in order to walk fast.
Current tree walker does not care if an object is packed (using what
format) at all. Remote odb for pack v4 must have some way that allows
to read pack data directly, something close to "mmap", it's not just
about an api to "get me the canonical content of this object".

[1] http://article.gmane.org/gmane.comp.version-control.git/234012
[2] http://article.gmane.org/gmane.comp.version-control.git/233038
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] bundle v3: the beginning

2016-06-07 Thread Mike Hommey
On Tue, Jun 07, 2016 at 10:46:07AM +0200, Christian Couder wrote:
> The high level overview of the patch series I would like to send
> really soon now could go like this:
> 
> ---
> Git can store its objects only in the form of loose objects in
> separate files or packed objects in a pack file.
> To be able to better handle some kind of objects, for example big
> blobs, it would be nice if Git could store its objects in other object
> databases (ODB).
> 
> To do that, this patch series makes it possible to register commands,
> using "odb..command" config variables, to access external
> ODBs. Each specified command will then be called the following ways:
> 
>   - " have": the command should output the sha1, size and
> type of all the objects the external ODB contains, one object per
> line.
>   - " get ": the command should then read from the
> external ODB the content of the object corresponding to  and
> output it on stdout.
>   - " put   ": the command should then read
> from stdin an object and store it in the external ODB.

(disclaimer: I didn't look at the patch series)

Does this mean you're going to fork/exec() a new  for each of
these? It would probably be better if it was "batched", where the
executable is invoked once and the commands are passed to its stdin.

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] bundle v3: the beginning

2016-06-07 Thread Christian Couder
On Wed, Jun 1, 2016 at 4:00 PM, Duy Nguyen  wrote:
> On Tue, May 31, 2016 at 8:18 PM, Christian Couder
>  wrote:
 [3] 
 http://thread.gmane.org/gmane.comp.version-control.git/202902/focus=203020
>>>
>>> This points to  https://github.com/peff/git/commits/jk/external-odb
>>> which is dead. Jeff, do you still have it somewhere, or is it not
>>> worth looking at anymore?
>>
>> I have rebased, fixed and improved it a bit. I added write support for
>> blobs. But the result is not very clean right now.
>> I was going to send a RFC patch series after cleaning the result, but
>> as you ask, here are some links to some branches:
>>
>> - https://github.com/chriscool/git/commits/gl-external-odb3 (the
>> updated patches from Peff, plus 2 small patches from me)
>> - https://github.com/chriscool/git/commits/gl-external-odb7 (the same
>> as above, plus a number of WIP patches to add blob write support)
>
> Thanks. I had a super quick look. It would be nice if you could give a
> high level overview on this (if you're going to spend a lot more time on it).

Sorry about the late answer.

Here is a new series after some cleanup:

https://github.com/chriscool/git/commits/gl-external-odb12

The high level overview of the patch series I would like to send
really soon now could go like this:

---
Git can store its objects only in the form of loose objects in
separate files or packed objects in a pack file.
To be able to better handle some kind of objects, for example big
blobs, it would be nice if Git could store its objects in other object
databases (ODB).

To do that, this patch series makes it possible to register commands,
using "odb..command" config variables, to access external
ODBs. Each specified command will then be called the following ways:

  - " have": the command should output the sha1, size and
type of all the objects the external ODB contains, one object per
line.
  - " get ": the command should then read from the
external ODB the content of the object corresponding to  and
output it on stdout.
  - " put   ": the command should then read
from stdin an object and store it in the external ODB.

This RFC patch series does not address the following important parts
of a complete solution:

  - There is no way to transfer external ODB content using Git.
  - No real external ODB has been interfaced with Git. The tests use
another git repo in a separate directory for this purpose which is
probably useless in the real world.
---

> One random thought, maybe it's better to have a daemon for external
> odb right from the start (one for all odbs, or one per odb, I don't
> know). It could do fancy stuff like object caching if necessary, and
> it can avoid high cost handshake (e.g. via tls) every time a git
> process runs and gets one object. Reducing process spawn would
> definitely receive a big cheer from Windows crowd.

The caching could be done inside Git and I am not sure it's worth
optimizing this now.
It could also make it more difficult to write support for an external
ODB if we required a daemon.
Maybe later we can add support for "odb..daemon" if we think
that this is worth it.

> Any thought on object streaming support?

No I didn't think about this. In fact I am not sure what this means.

> It could be a big deal (might
> affect some design decisions).

Could you elaborate on this?

> I would also think about how pack v4
> fits in this (e.g. how a tree walker can still walk fast, a big
> promise of pack v4; I suppose if you still maintain "pack" concept
> over external odb then it might work). Not that it really matters.
> Pack v4 is the future, but the future can never be "today" :)

Sorry I haven't really followed pack v4 and I forgot what it is about.

Thanks,
Christian.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] bundle v3: the beginning

2016-06-01 Thread Duy Nguyen
On Tue, May 31, 2016 at 8:18 PM, Christian Couder
 wrote:
>>> [3] 
>>> http://thread.gmane.org/gmane.comp.version-control.git/202902/focus=203020
>>
>> This points to  https://github.com/peff/git/commits/jk/external-odb
>> which is dead. Jeff, do you still have it somewhere, or is it not
>> worth looking at anymore?
>
> I have rebased, fixed and improved it a bit. I added write support for
> blobs. But the result is not very clean right now.
> I was going to send a RFC patch series after cleaning the result, but
> as you ask, here are some links to some branches:
>
> - https://github.com/chriscool/git/commits/gl-external-odb3 (the
> updated patches from Peff, plus 2 small patches from me)
> - https://github.com/chriscool/git/commits/gl-external-odb7 (the same
> as above, plus a number of WIP patches to add blob write support)

Thanks. I had a super quick look. It would be nice if you could give a
high level overview on this (if you're going to spend a lot more time on it).

One random thought, maybe it's better to have a daemon for external
odb right from the start (one for all odbs, or one per odb, I don't
know). It could do fancy stuff like object caching if necessary, and
it can avoid high cost handshake (e.g. via tls) every time a git
process runs and gets one object. Reducing process spawn would
definitely receive a big cheer from Windows crowd.

Any thought on object streaming support? It could be a big deal (might
affect some design decisions). I would also think about how pack v4
fits in this (e.g. how a tree walker can still walk fast, a big
promise of pack v4; I suppose if you still maintain "pack" concept
over external odb then it might work). Not that it really matters.
Pack v4 is the future, but the future can never be "today" :)
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] bundle v3: the beginning

2016-06-01 Thread Duy Nguyen
On Tue, May 31, 2016 at 8:18 PM, Christian Couder
 wrote:
>>> I wonder if this mechanism could also be used or extended to clone and
>>> fetch an alternate object database.
>>>
>>> In [1], [2] and [3], and this was also discussed during the
>>> Contributor Summit last month, Peff says that he started working on
>>> alternate object database support a long time ago, and that the hard
>>> part is a protocol extension to tell remotes that you can access some
>>> objects in a different way.
>>>
>>> If a Git client would download a "$name.bndl" v3 bundle file that
>>> would have a "data: $URL/alt-odb-$name.odb" extended header, the Git
>>> client would just need to download "$URL/alt-odb-$name.odb" and use
>>> the alternate object database support on this file.
>>
>> What does this file contain exactly? A list of SHA-1 that can be
>> retrieved from this remote/alternate odb?
>
> It would depend on the external odb. Git could support different
> external odb that have different trade-offs.
>
>> I wonder if we could just
>> git-replace for this marking. The replaced content could contain the
>> uri pointing to the alt odb.
>
> Yeah, interesting!
> That's indeed another possibility that might not need the transfer of
> any external odb.
>
> But in this case it might be cleaner to just have a separate ref hierarchy 
> like:
>
> refs/external-odbs/my-ext-odb/
>
> instead of using the replace one.
>
> Or maybe:
>
> refs/replace/external-odbs/my-ext-odb/
>
> if we really want to use the replace hierarchy.

Yep. replace hierarchy crossed my mind. But then I thought about
performance degradation when there are more than one pack (we have to
search through them all for every SHA-1) and discarded it because we
would need to do the same linear search here. I guess we will most
likely have one or two name spaces so it probably won't matter.

>> We could optionally contact alt odb to
>> retrieve real content, or just show the replaced/fake data when alt
>> odb is out of reach.
>
> Yeah, I wonder if that really needs the replace mechanism.

Replace mechanism provides good hook point. But it really depends how
invasive this remote odb is. If a fake content is enough to avoid
breakages up high, git-replace is enough. If you really need to pass
remote odb info up so higher levels can do something more fancy, then
it's insufficient.

> By the way this makes me wonder if we could implement resumable clone
> using some kind of replace ref.
>
> The client while cloning nearly as usual would download one or more
> special replace refs that would points to objects with links to
> download bundles using standard protocols.
> Just after the clone, the client would read these objects and download
> the bundles from these objects.
> And then it would clone from these bundles.

I thought we have settled on resumable clone, just waiting for an
implementation :) Doing it your way, you would need to download these
special objects too (in a pack?) and come back download some more
bundles. It would be more efficient to show the bundle uri early and
go download the bundle on the side while you go on to get the
addition/smaller pack that contains the rest.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] bundle v3: the beginning

2016-05-31 Thread Jeff King
On Fri, May 20, 2016 at 02:39:06PM +0200, Christian Couder wrote:

> I wonder if this mechanism could also be used or extended to clone and
> fetch an alternate object database.
> 
> In [1], [2] and [3], and this was also discussed during the
> Contributor Summit last month, Peff says that he started working on
> alternate object database support a long time ago, and that the hard
> part is a protocol extension to tell remotes that you can access some
> objects in a different way.
> 
> If a Git client would download a "$name.bndl" v3 bundle file that
> would have a "data: $URL/alt-odb-$name.odb" extended header, the Git
> client would just need to download "$URL/alt-odb-$name.odb" and use
> the alternate object database support on this file.
> 
> This way it would know all it has to know to access the objects in the
> alternate database. The alternate object database may not contain the
> real objects, if they are too big for example, but just files that
> describe how to get the real objects.

I'm not sure about this strategy. I see two complications:

  1. I don't think bundles need to be a part of this "external odb"
 strategy at all. If I understand correctly, I think you want to use
 it as a place to stuff metadata that the server tells the client,
 like "by the way, go here if you want another way to access some
 objects".

 But there are lots of cases where the server might want to tell
 the client that don't involve bundles at all.

  2. A server pointing the client to another object store is actually
 the least interesting bit of the protocol.

 The more interesting cases (to me) are:

   a. The receiving side of a connection (e.g., a fetch client)
  somehow has out-of-band access to some objects. How does it
  tell the other side "do not bother sending me these objects; I
  can get them in another way"?

   b. The receiving side of a connection has out-of-band access to
  some objects. Some of these will be expensive to get (e.g.,
  requiring a large download), and some may be fast (e.g.,
  they've already been fetched to a local cache). How do we tell
  the sending side not to assume we have cheap access to these
  objects (e.g., for use as a delta base)?

So I don't think you want to tie this into bundles due to (1), and I
think that bundles would be insufficient anyway because of (2).

Or maybe I'm misunderstanding what you propose.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] bundle v3: the beginning

2016-05-31 Thread Jeff King
On Tue, May 31, 2016 at 07:43:27PM +0700, Duy Nguyen wrote:

> > [3] 
> > http://thread.gmane.org/gmane.comp.version-control.git/202902/focus=203020
> 
> This points to  https://github.com/peff/git/commits/jk/external-odb
> which is dead. Jeff, do you still have it somewhere, or is it not
> worth looking at anymore?

It's now "jk/external-odb-wip" at the same repo. I wouldn't be surprised
if it doesn't even compile, though. I basically rebase my topics daily
against Junio's "master", so it may be carried forward, but things
marked "-wip" aren't part of my daily git build, and generally don't
even get compile-tested (usually if the rebase looks too hairy or awful,
I'll drop it completely, though, and I haven't done that here).

You're probably better off looking whatever Christian produces. :)

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] bundle v3: the beginning

2016-05-31 Thread Christian Couder
On Tue, May 31, 2016 at 2:43 PM, Duy Nguyen  wrote:
> On Fri, May 20, 2016 at 7:39 PM, Christian Couder
>  wrote:
>> I am responding to this 2+ month old email because I am investigating
>> adding an alternate object store at the same level as loose and packed
>> objects. This alternate object store could be used for large files. I
>> am working on this for GitLab. (Yeah, I am working, as a freelance,
>> for both Booking.com and GitLab these days.)
>
> I'm also interested in this from a different angle, narrow clone that
> potentially allows to skip download some large blobs (likely old ones
> from the past that nobody will bother).

Interesting!

[...]

>> I wonder if this mechanism could also be used or extended to clone and
>> fetch an alternate object database.
>>
>> In [1], [2] and [3], and this was also discussed during the
>> Contributor Summit last month, Peff says that he started working on
>> alternate object database support a long time ago, and that the hard
>> part is a protocol extension to tell remotes that you can access some
>> objects in a different way.
>>
>> If a Git client would download a "$name.bndl" v3 bundle file that
>> would have a "data: $URL/alt-odb-$name.odb" extended header, the Git
>> client would just need to download "$URL/alt-odb-$name.odb" and use
>> the alternate object database support on this file.
>
> What does this file contain exactly? A list of SHA-1 that can be
> retrieved from this remote/alternate odb?

It would depend on the external odb. Git could support different
external odb that have different trade-offs.

> I wonder if we could just
> git-replace for this marking. The replaced content could contain the
> uri pointing to the alt odb.

Yeah, interesting!
That's indeed another possibility that might not need the transfer of
any external odb.

But in this case it might be cleaner to just have a separate ref hierarchy like:

refs/external-odbs/my-ext-odb/

instead of using the replace one.

Or maybe:

refs/replace/external-odbs/my-ext-odb/

if we really want to use the replace hierarchy.

> We could optionally contact alt odb to
> retrieve real content, or just show the replaced/fake data when alt
> odb is out of reach.

Yeah, I wonder if that really needs the replace mechanism.

> Transferring git-replace is basically ref
> exchange, which may be fine if you don't have a lot of objects in this
> alt odb.

Yeah sure, great idea!

By the way this makes me wonder if we could implement resumable clone
using some kind of replace ref.

The client while cloning nearly as usual would download one or more
special replace refs that would points to objects with links to
download bundles using standard protocols.
Just after the clone, the client would read these objects and download
the bundles from these objects.
And then it would clone from these bundles.

> If you do, well, we need to deal with lots of refs anyway.
> This may benefit from it too.
>
>> [3] 
>> http://thread.gmane.org/gmane.comp.version-control.git/202902/focus=203020
>
> This points to  https://github.com/peff/git/commits/jk/external-odb
> which is dead. Jeff, do you still have it somewhere, or is it not
> worth looking at anymore?

I have rebased, fixed and improved it a bit. I added write support for
blobs. But the result is not very clean right now.
I was going to send a RFC patch series after cleaning the result, but
as you ask, here are some links to some branches:

- https://github.com/chriscool/git/commits/gl-external-odb3 (the
updated patches from Peff, plus 2 small patches from me)
- https://github.com/chriscool/git/commits/gl-external-odb7 (the same
as above, plus a number of WIP patches to add blob write support)

Thanks,
Christian.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] bundle v3: the beginning

2016-05-31 Thread Duy Nguyen
On Fri, May 20, 2016 at 7:39 PM, Christian Couder
 wrote:
> I am responding to this 2+ month old email because I am investigating
> adding an alternate object store at the same level as loose and packed
> objects. This alternate object store could be used for large files. I
> am working on this for GitLab. (Yeah, I am working, as a freelance,
> for both Booking.com and GitLab these days.)

I'm also interested in this from a different angle, narrow clone that
potentially allows to skip download some large blobs (likely old ones
from the past that nobody will bother).

> On Wed, Mar 2, 2016 at 9:32 PM, Junio C Hamano  wrote:
>> The bundle v3 format introduces an ability to have the bundle header
>> (which describes what references in the bundled history can be
>> fetched, and what objects the receiving repository must have in
>> order to unbundle it successfully) in one file, and the bundled pack
>> stream data in a separate file.
>>
>> A v3 bundle file begins with a line with "# v3 git bundle", followed
>> by zero or more "extended header" lines, and an empty line, finally
>> followed by the list of prerequisites and references in the same
>> format as v2 bundle.  If it uses the "split bundle" feature, there
>> is a "data: $URL" extended header line, and nothing follows the list
>> of prerequisites and references.  Also, "sha1: " extended header
>> line may exist to help validating that the pack stream data matches
>> the bundle header.
>>
>> A typical expected use of a split bundle is to help initial clone
>> that involves a huge data transfer, and would go like this:
>>
>>  - Any repository people would clone and fetch from would regularly
>>be repacked, and it is expected that there would be a packfile
>>without prerequisites that holds all (or at least most) of the
>>history of it (call it pack-$name.pack).
>>
>>  - After arranging that packfile to be downloadable over popular
>>transfer methods used for serving static files (such as HTTP or
>>HTTPS) that are easily resumable as $URL/pack-$name.pack, a v3
>>bundle file (call it $name.bndl) can be prepared with an extended
>>header "data: $URL/pack-$name.pack" to point at the download
>>location for the packfile, and be served at "$URL/$name.bndl".
>>
>>  - An updated Git client, when trying to "git clone" from such a
>>repository, may be redirected to $URL/$name.bndl", which would be
>>a tiny text file (when split bundle feature is used).
>>
>>  - The client would then inspect the downloaded $name.bndl, learn
>>that the corresponding packfile exists at $URL/pack-$name.pack,
>>and downloads it as pack-$name.pack, until the download succeeds.
>>This can easily be done with "wget --continue" equivalent over an
>>unreliable link.  The checksum recorded on the "sha1: " header
>>line is expected to be used by this downloader (not written yet).
>
> I wonder if this mechanism could also be used or extended to clone and
> fetch an alternate object database.
>
> In [1], [2] and [3], and this was also discussed during the
> Contributor Summit last month, Peff says that he started working on
> alternate object database support a long time ago, and that the hard
> part is a protocol extension to tell remotes that you can access some
> objects in a different way.
>
> If a Git client would download a "$name.bndl" v3 bundle file that
> would have a "data: $URL/alt-odb-$name.odb" extended header, the Git
> client would just need to download "$URL/alt-odb-$name.odb" and use
> the alternate object database support on this file.

What does this file contain exactly? A list of SHA-1 that can be
retrieved from this remote/alternate odb? I wonder if we could just
git-replace for this marking. The replaced content could contain the
uri pointing to the alt odb. We could optionally contact alt odb to
retrieve real content, or just show the replaced/fake data when alt
odb is out of reach. Transferring git-replace is basically ref
exchange, which may be fine if you don't have a lot of objects in this
alt odb. If you do, well, we need to deal with lots of refs anyway.
This may benefit from it too.

> [3] http://thread.gmane.org/gmane.comp.version-control.git/202902/focus=203020

This points to  https://github.com/peff/git/commits/jk/external-odb
which is dead. Jeff, do you still have it somewhere, or is it not
worth looking at anymore?
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] bundle v3: the beginning

2016-05-20 Thread Christian Couder
I am responding to this 2+ month old email because I am investigating
adding an alternate object store at the same level as loose and packed
objects. This alternate object store could be used for large files. I
am working on this for GitLab. (Yeah, I am working, as a freelance,
for both Booking.com and GitLab these days.)

On Wed, Mar 2, 2016 at 9:32 PM, Junio C Hamano  wrote:
> The bundle v3 format introduces an ability to have the bundle header
> (which describes what references in the bundled history can be
> fetched, and what objects the receiving repository must have in
> order to unbundle it successfully) in one file, and the bundled pack
> stream data in a separate file.
>
> A v3 bundle file begins with a line with "# v3 git bundle", followed
> by zero or more "extended header" lines, and an empty line, finally
> followed by the list of prerequisites and references in the same
> format as v2 bundle.  If it uses the "split bundle" feature, there
> is a "data: $URL" extended header line, and nothing follows the list
> of prerequisites and references.  Also, "sha1: " extended header
> line may exist to help validating that the pack stream data matches
> the bundle header.
>
> A typical expected use of a split bundle is to help initial clone
> that involves a huge data transfer, and would go like this:
>
>  - Any repository people would clone and fetch from would regularly
>be repacked, and it is expected that there would be a packfile
>without prerequisites that holds all (or at least most) of the
>history of it (call it pack-$name.pack).
>
>  - After arranging that packfile to be downloadable over popular
>transfer methods used for serving static files (such as HTTP or
>HTTPS) that are easily resumable as $URL/pack-$name.pack, a v3
>bundle file (call it $name.bndl) can be prepared with an extended
>header "data: $URL/pack-$name.pack" to point at the download
>location for the packfile, and be served at "$URL/$name.bndl".
>
>  - An updated Git client, when trying to "git clone" from such a
>repository, may be redirected to $URL/$name.bndl", which would be
>a tiny text file (when split bundle feature is used).
>
>  - The client would then inspect the downloaded $name.bndl, learn
>that the corresponding packfile exists at $URL/pack-$name.pack,
>and downloads it as pack-$name.pack, until the download succeeds.
>This can easily be done with "wget --continue" equivalent over an
>unreliable link.  The checksum recorded on the "sha1: " header
>line is expected to be used by this downloader (not written yet).

I wonder if this mechanism could also be used or extended to clone and
fetch an alternate object database.

In [1], [2] and [3], and this was also discussed during the
Contributor Summit last month, Peff says that he started working on
alternate object database support a long time ago, and that the hard
part is a protocol extension to tell remotes that you can access some
objects in a different way.

If a Git client would download a "$name.bndl" v3 bundle file that
would have a "data: $URL/alt-odb-$name.odb" extended header, the Git
client would just need to download "$URL/alt-odb-$name.odb" and use
the alternate object database support on this file.

This way it would know all it has to know to access the objects in the
alternate database. The alternate object database may not contain the
real objects, if they are too big for example, but just files that
describe how to get the real objects.

>  - After fully downloading $name.bndl and pack-$name.pack and
>storing them next to each other, the client would clone from the
>$name.bndl; this would populate the newly created repository with
>reasonably recent history.
>
>  - Then the client can issue "git fetch" against the original
>repository to obtain the most recent part of the history created
>since the bundle was made.

[1] http://thread.gmane.org/gmane.comp.version-control.git/206886/focus=207040
[2] http://thread.gmane.org/gmane.comp.version-control.git/247171
[3] http://thread.gmane.org/gmane.comp.version-control.git/202902/focus=203020
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] bundle v3: the beginning

2016-03-02 Thread Duy Nguyen
On Thu, Mar 3, 2016 at 9:57 AM, Junio C Hamano  wrote:
> Duy Nguyen  writes:
>
>> would it be
>> ok if we introduced a minimal resumable download service via
>> git-daemon to enable this feature with very little setup? Like
>> git-shell, you can only download certain packfiles for this use case
>> and nothing else with this service.
>
> I think it is a matter of priorities.
>
> A minimalistic site that offers only git-daemon traffic without a
> working HTTP server would certainly benefit from such a thing, but
> serving static files efficiently over the web is commodity service
> these days.  Wouldn't it be sufficient to just recommend having a
> normal HTTP server serving static files, which should be "very
> little setup" in today's world?
>
> Such a "minimal resumable download service" over the git-daemon
> transport still has to reinvent what is already done well by the
> HTTP servers and clients (e.g. support of ETag equivalent to make
> sure that the client can notice that the underlying data has changed
> for a given resource, headers to communicate the total length,
> making a range request and responding to it, etc. etc.).
>
> In addition,, by going the custom protocol route, you wouldn't
> benefit from caching HTTP proxies available to the clients.
>
> So I am not sure if the benefit outweighs the cost.

What I had in mind was individuals who just want to publish their work
over git://. Right now it's just a matter of running git-daemon and
configuring it a bit. If it was me, I wouldn't expect all the bells
and whistles that come with http. But I agree that this is low
priority, "scratch your own itch" kind of thing. Let's have resumable
clone with standard download protocols first, then we'll see.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] bundle v3: the beginning

2016-03-02 Thread Junio C Hamano
Duy Nguyen  writes:

> would it be
> ok if we introduced a minimal resumable download service via
> git-daemon to enable this feature with very little setup? Like
> git-shell, you can only download certain packfiles for this use case
> and nothing else with this service.

I think it is a matter of priorities.

A minimalistic site that offers only git-daemon traffic without a
working HTTP server would certainly benefit from such a thing, but
serving static files efficiently over the web is commodity service
these days.  Wouldn't it be sufficient to just recommend having a
normal HTTP server serving static files, which should be "very
little setup" in today's world?

Such a "minimal resumable download service" over the git-daemon
transport still has to reinvent what is already done well by the
HTTP servers and clients (e.g. support of ETag equivalent to make
sure that the client can notice that the underlying data has changed
for a given resource, headers to communicate the total length,
making a range request and responding to it, etc. etc.).

In addition,, by going the custom protocol route, you wouldn't
benefit from caching HTTP proxies available to the clients.

So I am not sure if the benefit outweighs the cost.

I wouldn't stop you if you really want to do it, but again, it is a
matter of priorities.  I personally feel that it would be a waste of
engineering talent, and it certainly would be a waste of review
bandwidth, if you gave priority to this over other more widely
useful parts of the system.  The procedure to repack should be
updated to produce such a base pack with the separate bundle header
on the server side, the protocol needs to be updated to allow
redirection for "clone" traffic, the logic to decide when to
redirect must be designed (e.g. "single branch" clone should not
choose a pack/bundle that represents the full repository, but a pack
for the branch that was asked), etc.  There are still tons of things
that need to be done, and it would be a distraction to invent a
custom download service nobody other than git-daemon talks before
all of the above is done.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] bundle v3: the beginning

2016-03-02 Thread Duy Nguyen
On Thu, Mar 3, 2016 at 3:32 AM, Junio C Hamano  wrote:
>  - After arranging that packfile to be downloadable over popular
>transfer methods used for serving static files (such as HTTP or
>HTTPS) that are easily resumable as $URL/pack-$name.pack, a v3
>bundle file (call it $name.bndl) can be prepared with an extended
>header "data: $URL/pack-$name.pack" to point at the download
>location for the packfile, and be served at "$URL/$name.bndl".

Extra setup to offload things to CDN is great and all. But would it be
ok if we introduced a minimal resumable download service via
git-daemon to enable this feature with very little setup? Like
git-shell, you can only download certain packfiles for this use case
and nothing else with this service.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/4] bundle v3: the beginning

2016-03-02 Thread Junio C Hamano
The bundle v3 format introduces an ability to have the bundle header
(which describes what references in the bundled history can be
fetched, and what objects the receiving repository must have in
order to unbundle it successfully) in one file, and the bundled pack
stream data in a separate file.

A v3 bundle file begins with a line with "# v3 git bundle", followed
by zero or more "extended header" lines, and an empty line, finally
followed by the list of prerequisites and references in the same
format as v2 bundle.  If it uses the "split bundle" feature, there
is a "data: $URL" extended header line, and nothing follows the list
of prerequisites and references.  Also, "sha1: " extended header
line may exist to help validating that the pack stream data matches
the bundle header.

A typical expected use of a split bundle is to help initial clone
that involves a huge data transfer, and would go like this:

 - Any repository people would clone and fetch from would regularly
   be repacked, and it is expected that there would be a packfile
   without prerequisites that holds all (or at least most) of the
   history of it (call it pack-$name.pack).

 - After arranging that packfile to be downloadable over popular
   transfer methods used for serving static files (such as HTTP or
   HTTPS) that are easily resumable as $URL/pack-$name.pack, a v3
   bundle file (call it $name.bndl) can be prepared with an extended
   header "data: $URL/pack-$name.pack" to point at the download
   location for the packfile, and be served at "$URL/$name.bndl".

 - An updated Git client, when trying to "git clone" from such a
   repository, may be redirected to $URL/$name.bndl", which would be
   a tiny text file (when split bundle feature is used).

 - The client would then inspect the downloaded $name.bndl, learn
   that the corresponding packfile exists at $URL/pack-$name.pack,
   and downloads it as pack-$name.pack, until the download succeeds.
   This can easily be done with "wget --continue" equivalent over an
   unreliable link.  The checksum recorded on the "sha1: " header
   line is expected to be used by this downloader (not written yet).

 - After fully downloading $name.bndl and pack-$name.pack and
   storing them next to each other, the client would clone from the
   $name.bndl; this would populate the newly created repository with
   reasonably recent history.

 - Then the client can issue "git fetch" against the original
   repository to obtain the most recent part of the history created
   since the bundle was made.

Signed-off-by: Junio C Hamano 
---
 bundle.c  | 103 +-
 bundle.h  |   3 ++
 t/t5704-bundle.sh |  64 +
 3 files changed, 161 insertions(+), 9 deletions(-)

diff --git a/bundle.c b/bundle.c
index 32bdb01..480630d 100644
--- a/bundle.c
+++ b/bundle.c
@@ -10,7 +10,8 @@
 #include "refs.h"
 #include "argv-array.h"
 
-static const char bundle_signature[] = "# v2 git bundle\n";
+static const char bundle_signature_v2[] = "# v2 git bundle\n";
+static const char bundle_signature_v3[] = "# v3 git bundle\n";
 
 static void add_to_ref_list(const unsigned char *sha1, const char *name,
struct ref_list *list)
@@ -33,16 +34,55 @@ static int parse_bundle_header(int fd, struct bundle_header 
*header, int quiet)
int status = 0;
 
/* The bundle header begins with the signature */
-   if (strbuf_getwholeline_fd(, fd, '\n') ||
-   strcmp(buf.buf, bundle_signature)) {
+   if (strbuf_getwholeline_fd(, fd, '\n')) {
+   bad_bundle:
if (!quiet)
-   error(_("'%s' does not look like a v2 bundle file"),
+   error(_("'%s' does not look like a supported bundle 
file"),
  header->filename);
status = -1;
goto abort;
}
 
-   /* The bundle header ends with an empty line */
+   if (!strcmp(buf.buf, bundle_signature_v2))
+   header->bundle_version = 2;
+   else if (!strcmp(buf.buf, bundle_signature_v3))
+   header->bundle_version = 3;
+   else
+   goto bad_bundle;
+
+   if (header->bundle_version == 3) {
+   /*
+* bundle version v3 has extended headers before the
+* list of prerequisites and references.  The extended
+* headers end with an empty line.
+*/
+   while (!strbuf_getwholeline_fd(, fd, '\n')) {
+   const char *cp;
+   if (buf.len && buf.buf[buf.len - 1] == '\n')
+   buf.buf[--buf.len] = '\0';
+   if (!buf.len)
+   break;
+   if (skip_prefix(buf.buf, "data: ", )) {
+   header->datafile = xstrdup(cp);
+   continue;
+