Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes
On Thu, 15 Jun 2017 16:28:24 -0400 Jeff Hostetlerwrote: > I agree with Peff here. I've been working on my partial/narrow/sparse > clone/fetch ideas since my original RFC and have come to the conclusion > that the server can do the size limiting efficiently, but we should > leave the pathname filtering to the client. That is, let the client > get the commits and trees and then locally apply pattern matching, > whether that be a sparse-checkout definition or simple ".git*" > matching and then make a later request for the "blobs of interest". This means that the client would need to inflate all the trees it received, but that might not be that bad. > Whether we "fault-in" the missing blobs or have a "fetch-blobs" > command (like fetch-pack) to get them is open to debate. > > There are concerns about the size of the requested blob-id list in a > fetch-blobs approach, but I think there are ways to say I need all > of the blobs referenced by the directory /foo in commit (and > optionally, that aren't present in directory /foo in commit > or in the range ..). (handwave) Unless the server no longer has the relevant commit (e.g. because a branch was rewound), but even then, even if we only sent blob hashes, the list would be probably much smaller than the downloaded pack anyway, so things might still be OK.
Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes
On 6/2/2017 6:26 PM, Jeff King wrote: On Fri, Jun 02, 2017 at 12:38:45PM -0700, Jonathan Tan wrote: ... We have a name-hash cache extension in the bitmap file, but it doesn't carry enough information to deduce the .git-ness of a file. I don't think it would be too hard to add a "flags" extension, and give a single bit to "this is a .git file". I do also wonder if the two features would need to be separated for a GVFS-style solution. If you're not just avoiding large blobs but trying to get a narrow clone, you don't want the .git files from the uninteresting parts of the tree. You want to get no blobs at all, and then fault them in as they become relevant due to user action. -Peff I agree with Peff here. I've been working on my partial/narrow/sparse clone/fetch ideas since my original RFC and have come to the conclusion that the server can do the size limiting efficiently, but we should leave the pathname filtering to the client. That is, let the client get the commits and trees and then locally apply pattern matching, whether that be a sparse-checkout definition or simple ".git*" matching and then make a later request for the "blobs of interest". Whether we "fault-in" the missing blobs or have a "fetch-blobs" command (like fetch-pack) to get them is open to debate. There are concerns about the size of the requested blob-id list in a fetch-blobs approach, but I think there are ways to say I need all of the blobs referenced by the directory /foo in commit (and optionally, that aren't present in directory /foo in commit or in the range ..). (handwave) Jeff
Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes
On Fri, Jun 02, 2017 at 04:25:08PM -0700, Jonathan Nieder wrote: > > We have a name-hash cache extension in the bitmap file, but it doesn't > > carry enough information to deduce the .git-ness of a file. I don't > > think it would be too hard to add a "flags" extension, and give a single > > bit to "this is a .git file". > > A nicer approach IMHO is to include an extra bitmap, like the existing > object-type bitmaps (see the dump_bitmap calls in > bitmap_writer_finish). This would would represent the set of all > .git* blobs in the pack. Yeah, it could be stored as a bitmap, which would be slightly smaller (since it would be mostly 0's). I think either way it would need an extension flag in the header to signal its presence. Older versions of Git are OK with having flags they don't understand. I know JGit used to complain about seeing a bitmap with unknown flags, but I'm not sure if that is still the case. > > If you're not just avoiding large blobs but trying > > to get a narrow clone, you don't want the .git files from the > > uninteresting parts of the tree. > > This is something I've wondered about, too. Part of the story is that > we haven't started omitting trees, so there is already O(number of > trees) objects being sent and some additional small blobs for .git* > specials doesn't make it much worse. Sending all .git* blobs keeps > things simple since the server doesn't have to infer which .git* blobs > are relevant to this client. > > Longer term, we will likely want to allow clients to request omission > of some trees, too. Omitting the corresponding .git* files becomes > more straightforward at that point. > > Does that make sense? Yeah, I agree we'd want to avoid the trees, too, in that case. -Peff
Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes
Jeff Kingwrites: > I do also wonder if the two features would need to be separated for a > GVFS-style solution. If you're not just avoiding large blobs but trying > to get a narrow clone, you don't want the .git files from the > uninteresting parts of the tree. You want to get no blobs at all, and > then fault them in as they become relevant due to user action. Thanks; I didn't think of this one while reviewing the overall design, but I agree that this is something important to think about.
Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes
Jeff King wrote: > On Fri, Jun 02, 2017 at 12:38:45PM -0700, Jonathan Tan wrote: >> +--blob-max-bytes=:: >> +This option can only be used with --stdout. If specified, a blob >> +larger than this will not be packed unless a to-be-packed tree >> +has that blob with a filename beginning with ".git". The size >> +can be suffixed with "k", "m", or "g", and may be "0". >> ++ >> +If specified, after printing the packfile, pack-objects will print the >> +count of excluded blobs (8 bytes) . Subsequently, for each excluded blob >> +in hash order, pack-objects will print its hash (20 bytes) and its size >> +(8 bytes). All numbers are in network byte order. >> ++ >> +If pack-objects cannot efficiently determine, for an arbitrary blob, >> +that no to-be-packed tree has that blob with a filename beginning with >> +".git" (for example, if bitmaps are used to calculate the objects to be >> +packed), it will pack all blobs regardless of size. > > Hmm. So this feature can't be used with bitmaps at all? That seems like > a big downside, as we still have to walk the whole graph to come up with > the list of blobs (even if we're sending them). That's 30-40 seconds of > CPU per clone on torvalds/linux. I imagine it's much worse on > repositories big enough to need a full GVFS-style "don't send any blobs" > approach. > > We have a name-hash cache extension in the bitmap file, but it doesn't > carry enough information to deduce the .git-ness of a file. I don't > think it would be too hard to add a "flags" extension, and give a single > bit to "this is a .git file". A nicer approach IMHO is to include an extra bitmap, like the existing object-type bitmaps (see the dump_bitmap calls in bitmap_writer_finish). This would would represent the set of all .git* blobs in the pack. [...] > If you're not just avoiding large blobs but trying > to get a narrow clone, you don't want the .git files from the > uninteresting parts of the tree. This is something I've wondered about, too. Part of the story is that we haven't started omitting trees, so there is already O(number of trees) objects being sent and some additional small blobs for .git* specials doesn't make it much worse. Sending all .git* blobs keeps things simple since the server doesn't have to infer which .git* blobs are relevant to this client. Longer term, we will likely want to allow clients to request omission of some trees, too. Omitting the corresponding .git* files becomes more straightforward at that point. Does that make sense? Jonathan
Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes
On Fri, Jun 02, 2017 at 12:38:45PM -0700, Jonathan Tan wrote: > +--blob-max-bytes=:: > + This option can only be used with --stdout. If specified, a blob > + larger than this will not be packed unless a to-be-packed tree > + has that blob with a filename beginning with ".git". The size > + can be suffixed with "k", "m", or "g", and may be "0". > ++ > +If specified, after printing the packfile, pack-objects will print the > +count of excluded blobs (8 bytes) . Subsequently, for each excluded blob > +in hash order, pack-objects will print its hash (20 bytes) and its size > +(8 bytes). All numbers are in network byte order. > ++ > +If pack-objects cannot efficiently determine, for an arbitrary blob, > +that no to-be-packed tree has that blob with a filename beginning with > +".git" (for example, if bitmaps are used to calculate the objects to be > +packed), it will pack all blobs regardless of size. Hmm. So this feature can't be used with bitmaps at all? That seems like a big downside, as we still have to walk the whole graph to come up with the list of blobs (even if we're sending them). That's 30-40 seconds of CPU per clone on torvalds/linux. I imagine it's much worse on repositories big enough to need a full GVFS-style "don't send any blobs" approach. We have a name-hash cache extension in the bitmap file, but it doesn't carry enough information to deduce the .git-ness of a file. I don't think it would be too hard to add a "flags" extension, and give a single bit to "this is a .git file". I do also wonder if the two features would need to be separated for a GVFS-style solution. If you're not just avoiding large blobs but trying to get a narrow clone, you don't want the .git files from the uninteresting parts of the tree. You want to get no blobs at all, and then fault them in as they become relevant due to user action. -Peff
[WIP v2 2/2] pack-objects: support --blob-max-bytes
As part of an effort to improve Git support for very large repositories in which clients typically have only a subset of all version-controlled blobs, teach pack-objects to support --blob-max-bytes, packing only blobs not exceeding that size unless the blob corresponds to a file whose name starts with ".git". upload-pack will eventually be taught to use this new parameter if needed to exclude certain blobs during a fetch or clone, potentially drastically reducing network consumption when serving these very large repositories. Since any excluded blob should not act as a delta base, I did this by avoiding such oversized blobs from ever going into the "to_pack" data structure, which contains both preferred bases (which do not go into the final packfile but can act as delta bases) and the objects to be packed (which go into the final packfile and also can act as delta bases). To that end, add_object_entry() has been modified to exclude the appropriate non-preferred-base objects. (Preferred bases, regardless of size, still enter "to_pack" - they are something that the client indicates that it has, not something that the server needs to serve, so no exclusion occurs.) In addition, any such exclusions are recorded and sent before the packfile. These are recorded in binary format and in hash order, in a format convenient for a client to use if stored directly to disk. The other method in which objects are added to "to_pack", add_object_entry_from_bitmap(), has not been modified - thus packing in the presence of bitmaps still packs all blobs regardless of size. See the documentation update in this commit for the rationale. Signed-off-by: Jonathan Tan--- Documentation/git-pack-objects.txt | 19 ++- builtin/pack-objects.c | 105 - t/t5300-pack-object.sh | 71 + 3 files changed, 193 insertions(+), 2 deletions(-) diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt index 8973510a4..23ff8b5be 100644 --- a/Documentation/git-pack-objects.txt +++ b/Documentation/git-pack-objects.txt @@ -12,7 +12,8 @@ SYNOPSIS 'git pack-objects' [-q | --progress | --all-progress] [--all-progress-implied] [--no-reuse-delta] [--delta-base-offset] [--non-empty] [--local] [--incremental] [--window=] [--depth=] - [--revs [--unpacked | --all]] [--stdout | base-name] + [--revs [--unpacked | --all]] + [--stdout [--blob-max-bytes=] | base-name] [--shallow] [--keep-true-parents] < object-list @@ -231,6 +232,22 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it creates a bundle. With this option, parents that are hidden by grafts are packed nevertheless. +--blob-max-bytes=:: + This option can only be used with --stdout. If specified, a blob + larger than this will not be packed unless a to-be-packed tree + has that blob with a filename beginning with ".git". The size + can be suffixed with "k", "m", or "g", and may be "0". ++ +If specified, after printing the packfile, pack-objects will print the +count of excluded blobs (8 bytes) . Subsequently, for each excluded blob +in hash order, pack-objects will print its hash (20 bytes) and its size +(8 bytes). All numbers are in network byte order. ++ +If pack-objects cannot efficiently determine, for an arbitrary blob, +that no to-be-packed tree has that blob with a filename beginning with +".git" (for example, if bitmaps are used to calculate the objects to be +packed), it will pack all blobs regardless of size. + SEE ALSO linkgit:git-rev-list[1] diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 730fa3d85..46578c05b 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -77,6 +77,8 @@ static unsigned long cache_max_small_delta_size = 1000; static unsigned long window_memory_limit = 0; +static long blob_max_bytes = -1; + /* * stats */ @@ -1069,17 +1071,73 @@ static const char no_closure_warning[] = N_( "disabling bitmap writing, as some objects are not being packed" ); +struct oversized_blob { + struct hashmap_entry entry; + struct object_id oid; + unsigned long size; +}; +struct hashmap oversized_blobs; + +static int oversized_blob_cmp_fn(const void *b1, const void *b2, +const void *keydata) +{ + const struct object_id *oid2 = keydata ? keydata : + &((const struct oversized_blob *) b2)->oid; + return oidcmp(&((const struct oversized_blob *) b1)->oid, oid2); +} + +/* + * Returns 1 and records this blob in oversized_blobs if the given blob does + * not meet any defined blob size limits. Blobs that appear as a tree entry + * whose basename begins with ".git" are never considered oversized. + * + * If the tree entry corresponding to the given blob is unknown, pass NULL as + * name. In this case, this function will always return