Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes

2017-06-15 Thread Jonathan Tan
On Thu, 15 Jun 2017 16:28:24 -0400
Jeff Hostetler  wrote:

> I agree with Peff here.  I've been working on my partial/narrow/sparse
> clone/fetch ideas since my original RFC and have come to the conclusion
> that the server can do the size limiting efficiently, but we should
> leave the pathname filtering to the client.  That is, let the client
> get the commits and trees and then locally apply pattern matching,
> whether that be a sparse-checkout definition or simple ".git*"
> matching and then make a later request for the "blobs of interest".

This means that the client would need to inflate all the trees it
received, but that might not be that bad.

> Whether we "fault-in" the missing blobs or have a "fetch-blobs"
> command (like fetch-pack) to get them is open to debate.
> 
> There are concerns about the size of the requested blob-id list in a
> fetch-blobs approach, but I think there are ways to say I need all
> of the blobs referenced by the directory /foo in commit  (and
> optionally, that aren't present in directory /foo in commit 
> or in the range ..).  (handwave)

Unless the server no longer has the relevant commit (e.g. because a
branch was rewound), but even then, even if we only sent blob hashes,
the list would be probably much smaller than the downloaded pack anyway,
so things might still be OK.


Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes

2017-06-15 Thread Jeff Hostetler



On 6/2/2017 6:26 PM, Jeff King wrote:

On Fri, Jun 02, 2017 at 12:38:45PM -0700, Jonathan Tan wrote:

...

We have a name-hash cache extension in the bitmap file, but it doesn't
carry enough information to deduce the .git-ness of a file. I don't
think it would be too hard to add a "flags" extension, and give a single
bit to "this is a .git file".

I do also wonder if the two features would need to be separated for a
GVFS-style solution. If you're not just avoiding large blobs but trying
to get a narrow clone, you don't want the .git files from the
uninteresting parts of the tree. You want to get no blobs at all, and
then fault them in as they become relevant due to user action.

-Peff



I agree with Peff here.  I've been working on my partial/narrow/sparse
clone/fetch ideas since my original RFC and have come to the conclusion
that the server can do the size limiting efficiently, but we should
leave the pathname filtering to the client.  That is, let the client
get the commits and trees and then locally apply pattern matching,
whether that be a sparse-checkout definition or simple ".git*"
matching and then make a later request for the "blobs of interest".

Whether we "fault-in" the missing blobs or have a "fetch-blobs"
command (like fetch-pack) to get them is open to debate.

There are concerns about the size of the requested blob-id list in a
fetch-blobs approach, but I think there are ways to say I need all
of the blobs referenced by the directory /foo in commit  (and
optionally, that aren't present in directory /foo in commit 
or in the range ..).  (handwave)

Jeff


Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes

2017-06-07 Thread Jeff King
On Fri, Jun 02, 2017 at 04:25:08PM -0700, Jonathan Nieder wrote:

> > We have a name-hash cache extension in the bitmap file, but it doesn't
> > carry enough information to deduce the .git-ness of a file. I don't
> > think it would be too hard to add a "flags" extension, and give a single
> > bit to "this is a .git file".
> 
> A nicer approach IMHO is to include an extra bitmap, like the existing
> object-type bitmaps (see the dump_bitmap calls in
> bitmap_writer_finish).  This would would represent the set of all
> .git* blobs in the pack.

Yeah, it could be stored as a bitmap, which would be slightly smaller
(since it would be mostly 0's). I think either way it would need an
extension flag in the header to signal its presence.

Older versions of Git are OK with having flags they don't understand. I
know JGit used to complain about seeing a bitmap with unknown flags, but
I'm not sure if that is still the case.

> >  If you're not just avoiding large blobs but trying
> > to get a narrow clone, you don't want the .git files from the
> > uninteresting parts of the tree.
> 
> This is something I've wondered about, too.  Part of the story is that
> we haven't started omitting trees, so there is already O(number of
> trees) objects being sent and some additional small blobs for .git*
> specials doesn't make it much worse.  Sending all .git* blobs keeps
> things simple since the server doesn't have to infer which .git* blobs
> are relevant to this client.
> 
> Longer term, we will likely want to allow clients to request omission
> of some trees, too.  Omitting the corresponding .git* files becomes
> more straightforward at that point.
> 
> Does that make sense?

Yeah, I agree we'd want to avoid the trees, too, in that case.

-Peff


Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes

2017-06-03 Thread Junio C Hamano
Jeff King  writes:

> I do also wonder if the two features would need to be separated for a
> GVFS-style solution. If you're not just avoiding large blobs but trying
> to get a narrow clone, you don't want the .git files from the
> uninteresting parts of the tree. You want to get no blobs at all, and
> then fault them in as they become relevant due to user action.

Thanks; I didn't think of this one while reviewing the overall design,
but I agree that this is something important to think about.


Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes

2017-06-02 Thread Jonathan Nieder
Jeff King wrote:
> On Fri, Jun 02, 2017 at 12:38:45PM -0700, Jonathan Tan wrote:

>> +--blob-max-bytes=::
>> +This option can only be used with --stdout. If specified, a blob
>> +larger than this will not be packed unless a to-be-packed tree
>> +has that blob with a filename beginning with ".git".  The size
>> +can be suffixed with "k", "m", or "g", and may be "0".
>> ++
>> +If specified, after printing the packfile, pack-objects will print the
>> +count of excluded blobs (8 bytes) . Subsequently, for each excluded blob
>> +in hash order, pack-objects will print its hash (20 bytes) and its size
>> +(8 bytes). All numbers are in network byte order.
>> ++
>> +If pack-objects cannot efficiently determine, for an arbitrary blob,
>> +that no to-be-packed tree has that blob with a filename beginning with
>> +".git" (for example, if bitmaps are used to calculate the objects to be
>> +packed), it will pack all blobs regardless of size.
>
> Hmm. So this feature can't be used with bitmaps at all?  That seems like
> a big downside, as we still have to walk the whole graph to come up with
> the list of blobs (even if we're sending them). That's 30-40 seconds of
> CPU per clone on torvalds/linux. I imagine it's much worse on
> repositories big enough to need a full GVFS-style "don't send any blobs"
> approach.
>
> We have a name-hash cache extension in the bitmap file, but it doesn't
> carry enough information to deduce the .git-ness of a file. I don't
> think it would be too hard to add a "flags" extension, and give a single
> bit to "this is a .git file".

A nicer approach IMHO is to include an extra bitmap, like the existing
object-type bitmaps (see the dump_bitmap calls in
bitmap_writer_finish).  This would would represent the set of all
.git* blobs in the pack.

[...]
>  If you're not just avoiding large blobs but trying
> to get a narrow clone, you don't want the .git files from the
> uninteresting parts of the tree.

This is something I've wondered about, too.  Part of the story is that
we haven't started omitting trees, so there is already O(number of
trees) objects being sent and some additional small blobs for .git*
specials doesn't make it much worse.  Sending all .git* blobs keeps
things simple since the server doesn't have to infer which .git* blobs
are relevant to this client.

Longer term, we will likely want to allow clients to request omission
of some trees, too.  Omitting the corresponding .git* files becomes
more straightforward at that point.

Does that make sense?

Jonathan


Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes

2017-06-02 Thread Jeff King
On Fri, Jun 02, 2017 at 12:38:45PM -0700, Jonathan Tan wrote:

> +--blob-max-bytes=::
> + This option can only be used with --stdout. If specified, a blob
> + larger than this will not be packed unless a to-be-packed tree
> + has that blob with a filename beginning with ".git".  The size
> + can be suffixed with "k", "m", or "g", and may be "0".
> ++
> +If specified, after printing the packfile, pack-objects will print the
> +count of excluded blobs (8 bytes) . Subsequently, for each excluded blob
> +in hash order, pack-objects will print its hash (20 bytes) and its size
> +(8 bytes). All numbers are in network byte order.
> ++
> +If pack-objects cannot efficiently determine, for an arbitrary blob,
> +that no to-be-packed tree has that blob with a filename beginning with
> +".git" (for example, if bitmaps are used to calculate the objects to be
> +packed), it will pack all blobs regardless of size.

Hmm. So this feature can't be used with bitmaps at all?  That seems like
a big downside, as we still have to walk the whole graph to come up with
the list of blobs (even if we're sending them). That's 30-40 seconds of
CPU per clone on torvalds/linux. I imagine it's much worse on
repositories big enough to need a full GVFS-style "don't send any blobs"
approach.

We have a name-hash cache extension in the bitmap file, but it doesn't
carry enough information to deduce the .git-ness of a file. I don't
think it would be too hard to add a "flags" extension, and give a single
bit to "this is a .git file".

I do also wonder if the two features would need to be separated for a
GVFS-style solution. If you're not just avoiding large blobs but trying
to get a narrow clone, you don't want the .git files from the
uninteresting parts of the tree. You want to get no blobs at all, and
then fault them in as they become relevant due to user action.

-Peff