subject:"\[WIP v2 2\/2\] pack\-objects\: support \-\-blob\-max\-bytes"

Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes

2017-06-15 Thread Jonathan Tan

On Thu, 15 Jun 2017 16:28:24 -0400
Jeff Hostetler  wrote:

> I agree with Peff here.  I've been working on my partial/narrow/sparse
> clone/fetch ideas since my original RFC and have come to the conclusion
> that the server can do the size limiting efficiently, but we should
> leave the pathname filtering to the client.  That is, let the client
> get the commits and trees and then locally apply pattern matching,
> whether that be a sparse-checkout definition or simple ".git*"
> matching and then make a later request for the "blobs of interest".

This means that the client would need to inflate all the trees it
received, but that might not be that bad.

> Whether we "fault-in" the missing blobs or have a "fetch-blobs"
> command (like fetch-pack) to get them is open to debate.
> 
> There are concerns about the size of the requested blob-id list in a
> fetch-blobs approach, but I think there are ways to say I need all
> of the blobs referenced by the directory /foo in commit  (and
> optionally, that aren't present in directory /foo in commit 
> or in the range ..).  (handwave)

Unless the server no longer has the relevant commit (e.g. because a
branch was rewound), but even then, even if we only sent blob hashes,
the list would be probably much smaller than the downloaded pack anyway,
so things might still be OK.

Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes

2017-06-15 Thread Jeff Hostetler




On 6/2/2017 6:26 PM, Jeff King wrote:

On Fri, Jun 02, 2017 at 12:38:45PM -0700, Jonathan Tan wrote:

...

We have a name-hash cache extension in the bitmap file, but it doesn't
carry enough information to deduce the .git-ness of a file. I don't
think it would be too hard to add a "flags" extension, and give a single
bit to "this is a .git file".

I do also wonder if the two features would need to be separated for a
GVFS-style solution. If you're not just avoiding large blobs but trying
to get a narrow clone, you don't want the .git files from the
uninteresting parts of the tree. You want to get no blobs at all, and
then fault them in as they become relevant due to user action.

-Peff



I agree with Peff here.  I've been working on my partial/narrow/sparse
clone/fetch ideas since my original RFC and have come to the conclusion
that the server can do the size limiting efficiently, but we should
leave the pathname filtering to the client.  That is, let the client
get the commits and trees and then locally apply pattern matching,
whether that be a sparse-checkout definition or simple ".git*"
matching and then make a later request for the "blobs of interest".

Whether we "fault-in" the missing blobs or have a "fetch-blobs"
command (like fetch-pack) to get them is open to debate.

There are concerns about the size of the requested blob-id list in a
fetch-blobs approach, but I think there are ways to say I need all
of the blobs referenced by the directory /foo in commit  (and
optionally, that aren't present in directory /foo in commit 
or in the range ..).  (handwave)

Jeff

Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes

2017-06-07 Thread Jeff King

On Fri, Jun 02, 2017 at 04:25:08PM -0700, Jonathan Nieder wrote:

> > We have a name-hash cache extension in the bitmap file, but it doesn't
> > carry enough information to deduce the .git-ness of a file. I don't
> > think it would be too hard to add a "flags" extension, and give a single
> > bit to "this is a .git file".
> 
> A nicer approach IMHO is to include an extra bitmap, like the existing
> object-type bitmaps (see the dump_bitmap calls in
> bitmap_writer_finish).  This would would represent the set of all
> .git* blobs in the pack.

Yeah, it could be stored as a bitmap, which would be slightly smaller
(since it would be mostly 0's). I think either way it would need an
extension flag in the header to signal its presence.

Older versions of Git are OK with having flags they don't understand. I
know JGit used to complain about seeing a bitmap with unknown flags, but
I'm not sure if that is still the case.

> >  If you're not just avoiding large blobs but trying
> > to get a narrow clone, you don't want the .git files from the
> > uninteresting parts of the tree.
> 
> This is something I've wondered about, too.  Part of the story is that
> we haven't started omitting trees, so there is already O(number of
> trees) objects being sent and some additional small blobs for .git*
> specials doesn't make it much worse.  Sending all .git* blobs keeps
> things simple since the server doesn't have to infer which .git* blobs
> are relevant to this client.
> 
> Longer term, we will likely want to allow clients to request omission
> of some trees, too.  Omitting the corresponding .git* files becomes
> more straightforward at that point.
> 
> Does that make sense?

Yeah, I agree we'd want to avoid the trees, too, in that case.

-Peff

Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes

2017-06-03 Thread Junio C Hamano

Jeff King  writes:

> I do also wonder if the two features would need to be separated for a
> GVFS-style solution. If you're not just avoiding large blobs but trying
> to get a narrow clone, you don't want the .git files from the
> uninteresting parts of the tree. You want to get no blobs at all, and
> then fault them in as they become relevant due to user action.

Thanks; I didn't think of this one while reviewing the overall design,
but I agree that this is something important to think about.

Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes

2017-06-02 Thread Jonathan Nieder

Jeff King wrote:
> On Fri, Jun 02, 2017 at 12:38:45PM -0700, Jonathan Tan wrote:

>> +--blob-max-bytes=::
>> +This option can only be used with --stdout. If specified, a blob
>> +larger than this will not be packed unless a to-be-packed tree
>> +has that blob with a filename beginning with ".git".  The size
>> +can be suffixed with "k", "m", or "g", and may be "0".
>> ++
>> +If specified, after printing the packfile, pack-objects will print the
>> +count of excluded blobs (8 bytes) . Subsequently, for each excluded blob
>> +in hash order, pack-objects will print its hash (20 bytes) and its size
>> +(8 bytes). All numbers are in network byte order.
>> ++
>> +If pack-objects cannot efficiently determine, for an arbitrary blob,
>> +that no to-be-packed tree has that blob with a filename beginning with
>> +".git" (for example, if bitmaps are used to calculate the objects to be
>> +packed), it will pack all blobs regardless of size.
>
> Hmm. So this feature can't be used with bitmaps at all?  That seems like
> a big downside, as we still have to walk the whole graph to come up with
> the list of blobs (even if we're sending them). That's 30-40 seconds of
> CPU per clone on torvalds/linux. I imagine it's much worse on
> repositories big enough to need a full GVFS-style "don't send any blobs"
> approach.
>
> We have a name-hash cache extension in the bitmap file, but it doesn't
> carry enough information to deduce the .git-ness of a file. I don't
> think it would be too hard to add a "flags" extension, and give a single
> bit to "this is a .git file".

A nicer approach IMHO is to include an extra bitmap, like the existing
object-type bitmaps (see the dump_bitmap calls in
bitmap_writer_finish).  This would would represent the set of all
.git* blobs in the pack.

[...]
>  If you're not just avoiding large blobs but trying
> to get a narrow clone, you don't want the .git files from the
> uninteresting parts of the tree.

This is something I've wondered about, too.  Part of the story is that
we haven't started omitting trees, so there is already O(number of
trees) objects being sent and some additional small blobs for .git*
specials doesn't make it much worse.  Sending all .git* blobs keeps
things simple since the server doesn't have to infer which .git* blobs
are relevant to this client.

Longer term, we will likely want to allow clients to request omission
of some trees, too.  Omitting the corresponding .git* files becomes
more straightforward at that point.

Does that make sense?

Jonathan

Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes

2017-06-02 Thread Jeff King

On Fri, Jun 02, 2017 at 12:38:45PM -0700, Jonathan Tan wrote:

> +--blob-max-bytes=::
> + This option can only be used with --stdout. If specified, a blob
> + larger than this will not be packed unless a to-be-packed tree
> + has that blob with a filename beginning with ".git".  The size
> + can be suffixed with "k", "m", or "g", and may be "0".
> ++
> +If specified, after printing the packfile, pack-objects will print the
> +count of excluded blobs (8 bytes) . Subsequently, for each excluded blob
> +in hash order, pack-objects will print its hash (20 bytes) and its size
> +(8 bytes). All numbers are in network byte order.
> ++
> +If pack-objects cannot efficiently determine, for an arbitrary blob,
> +that no to-be-packed tree has that blob with a filename beginning with
> +".git" (for example, if bitmaps are used to calculate the objects to be
> +packed), it will pack all blobs regardless of size.

Hmm. So this feature can't be used with bitmaps at all?  That seems like
a big downside, as we still have to walk the whole graph to come up with
the list of blobs (even if we're sending them). That's 30-40 seconds of
CPU per clone on torvalds/linux. I imagine it's much worse on
repositories big enough to need a full GVFS-style "don't send any blobs"
approach.

We have a name-hash cache extension in the bitmap file, but it doesn't
carry enough information to deduce the .git-ness of a file. I don't
think it would be too hard to add a "flags" extension, and give a single
bit to "this is a .git file".

I do also wonder if the two features would need to be separated for a
GVFS-style solution. If you're not just avoiding large blobs but trying
to get a narrow clone, you don't want the .git files from the
uninteresting parts of the tree. You want to get no blobs at all, and
then fault them in as they become relevant due to user action.

-Peff

[WIP v2 2/2] pack-objects: support --blob-max-bytes

2017-06-02 Thread Jonathan Tan

As part of an effort to improve Git support for very large repositories
in which clients typically have only a subset of all version-controlled
blobs, teach pack-objects to support --blob-max-bytes, packing only
blobs not exceeding that size unless the blob corresponds to a file
whose name starts with ".git". upload-pack will eventually be taught to
use this new parameter if needed to exclude certain blobs during a fetch
or clone, potentially drastically reducing network consumption when
serving these very large repositories.

Since any excluded blob should not act as a delta base, I did this by
avoiding such oversized blobs from ever going into the "to_pack" data
structure, which contains both preferred bases (which do not go into the
final packfile but can act as delta bases) and the objects to be packed
(which go into the final packfile and also can act as delta bases). To
that end, add_object_entry() has been modified to exclude the
appropriate non-preferred-base objects. (Preferred bases, regardless of
size, still enter "to_pack" - they are something that the client
indicates that it has, not something that the server needs to serve, so
no exclusion occurs.)

In addition, any such exclusions are recorded and sent before the
packfile. These are recorded in binary format and in hash order, in a
format convenient for a client to use if stored directly to disk.

The other method in which objects are added to "to_pack",
add_object_entry_from_bitmap(), has not been modified - thus packing in
the presence of bitmaps still packs all blobs regardless of size. See
the documentation update in this commit for the rationale.

Signed-off-by: Jonathan Tan 
---
 Documentation/git-pack-objects.txt |  19 ++-
 builtin/pack-objects.c | 105 -
 t/t5300-pack-object.sh |  71 +
 3 files changed, 193 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-pack-objects.txt 
b/Documentation/git-pack-objects.txt
index 8973510a4..23ff8b5be 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -12,7 +12,8 @@ SYNOPSIS
 'git pack-objects' [-q | --progress | --all-progress] [--all-progress-implied]
[--no-reuse-delta] [--delta-base-offset] [--non-empty]
[--local] [--incremental] [--window=] [--depth=]
-   [--revs [--unpacked | --all]] [--stdout | base-name]
+   [--revs [--unpacked | --all]]
+   [--stdout [--blob-max-bytes=] | base-name]
[--shallow] [--keep-true-parents] < object-list
 
 
@@ -231,6 +232,22 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it 
creates a bundle.
With this option, parents that are hidden by grafts are packed
nevertheless.
 
+--blob-max-bytes=::
+   This option can only be used with --stdout. If specified, a blob
+   larger than this will not be packed unless a to-be-packed tree
+   has that blob with a filename beginning with ".git".  The size
+   can be suffixed with "k", "m", or "g", and may be "0".
++
+If specified, after printing the packfile, pack-objects will print the
+count of excluded blobs (8 bytes) . Subsequently, for each excluded blob
+in hash order, pack-objects will print its hash (20 bytes) and its size
+(8 bytes). All numbers are in network byte order.
++
+If pack-objects cannot efficiently determine, for an arbitrary blob,
+that no to-be-packed tree has that blob with a filename beginning with
+".git" (for example, if bitmaps are used to calculate the objects to be
+packed), it will pack all blobs regardless of size.
+
 SEE ALSO
 
 linkgit:git-rev-list[1]
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 730fa3d85..46578c05b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -77,6 +77,8 @@ static unsigned long cache_max_small_delta_size = 1000;
 
 static unsigned long window_memory_limit = 0;
 
+static long blob_max_bytes = -1;
+
 /*
  * stats
  */
@@ -1069,17 +1071,73 @@ static const char no_closure_warning[] = N_(
 "disabling bitmap writing, as some objects are not being packed"
 );
 
+struct oversized_blob {
+   struct hashmap_entry entry;
+   struct object_id oid;
+   unsigned long size;
+};
+struct hashmap oversized_blobs;
+
+static int oversized_blob_cmp_fn(const void *b1, const void *b2,
+const void *keydata)
+{
+   const struct object_id *oid2 = keydata ? keydata :
+   &((const struct oversized_blob *) b2)->oid;
+   return oidcmp(&((const struct oversized_blob *) b1)->oid, oid2);
+}
+
+/*
+ * Returns 1 and records this blob in oversized_blobs if the given blob does
+ * not meet any defined blob size limits.  Blobs that appear as a tree entry
+ * whose basename begins with ".git" are never considered oversized.
+ *
+ * If the tree entry corresponding to the given blob is unknown, pass NULL as
+ * name. In this case, this function will always return

Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes

Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes

Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes

Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes

Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes

Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes

[WIP v2 2/2] pack-objects: support --blob-max-bytes

7 matches

Site Navigation

Mail list logo

Footer information