Re: [WIP 2/2] pack-objects: support --blob-size-limit
Jonathan Tanwrites: > diff --git a/Documentation/git-pack-objects.txt > b/Documentation/git-pack-objects.txt > index 8973510a4..c4257cacc 100644 > --- a/Documentation/git-pack-objects.txt > +++ b/Documentation/git-pack-objects.txt > @@ -12,7 +12,8 @@ SYNOPSIS > 'git pack-objects' [-q | --progress | --all-progress] > [--all-progress-implied] > [--no-reuse-delta] [--delta-base-offset] [--non-empty] > [--local] [--incremental] [--window=] [--depth=] > - [--revs [--unpacked | --all]] [--stdout | base-name] > + [--revs [--unpacked | --all]] > + [--stdout [--blob-size-limit=] | base-name] > [--shallow] [--keep-true-parents] < object-list > > > @@ -231,6 +232,22 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it > creates a bundle. > With this option, parents that are hidden by grafts are packed > nevertheless. > > +--blob-size-limit=:: > + This option can only be used with --stdout. If specified, a blob > + of at least this size will not be packed unless a to-be-packed > + tree has that blob with a filename beginning with ".git". > + The size can be suffixed with "k", "m", or "g", and may be "0". It would be nice if the name of the parameter hints that this is counted in bytes, e.g. "--blob-max-bytes"; Or at least s///. > ++ > +If specified, after printing the packfile, pack-objects will print the > +count of excluded blobs (8 bytes) . Subsequently, for each excluded blob > +in hash order, pack-objects will print its hash (20 bytes) and its size > +(8 bytes). All numbers are in network byte order. Do we need to future-proof the output format so that we can later use 32-byte hash? The input to pack-objects (i.e. rev-list --objects) is hexadecimal text, and it may not be so bad to make this also text, e.g. " SP LF". That way, you do not have to worry about byte order, hash size, or length limited to uint64. Right now, the readers of a pack stream like unpack-objects and index-pack know to copy any "garbage" that follow a pack stream to its standard output, so older readers piped downstream of this new pack-objects will simply accept the (somewhat incomplete) pack and ignore this trailer, which is probably what we want to happen. > ++ > +If pack-objects cannot efficiently determine, for an arbitrary blob, > +that no to-be-packed tree has that blob with a filename beginning with > +".git" (for example, if bitmaps are used to calculate the objects to be > +packed), it will pack all blobs regardless of size. > + This is a bit hard to understand, at least to me. Let me step-wise deconstruct what I think you are saying. - A blob can appear in multiple trees, and each tree has name (i.e. pathname component) for the blob. - Sometimes we may know _all_ trees that contain a particular blob and hence _all_ names these trees call this blob. In such a case, we can _prove_ that the blob never is used as ".git" in any tree. - We exclude a blob that is larger than the specified limit, but only when we can prove the above. If it is possible that the blob might appear as ".git" in some tree, the blob is not omitted no matter its size. And this is a very conservative way to avoid omitting something that could be one of those control files like ".gitignore". Am I following your thought correctly? > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c > index 1062d8fe2..aaf7e92b0 100644 > --- a/builtin/pack-objects.c > +++ b/builtin/pack-objects.c > @@ -77,6 +77,8 @@ static unsigned long cache_max_small_delta_size = 1000; > > static unsigned long window_memory_limit = 0; > > +static long blob_size_limit = -1; > + Which is the largest, long, ssize_t, or offset_t? Perhaps intmax_t but it may be overkill? In any case, not within the scope of this series; we assume that an individual object's size fits in ulong around here. > @@ -776,6 +778,50 @@ static const char no_split_warning[] = N_( > "disabling bitmap writing, packs are split due to pack.packSizeLimit" > ); > > +struct oversized_blob { > + struct hashmap_entry entry; > + struct object_id oid; > + unsigned long size; > +}; > +struct hashmap oversized_blobs; > + > +static int oversized_blob_cmp_fn(const void *b1, const void *b2, > + const void *keydata) > +{ > + const struct object_id *oid2 = keydata ? keydata : > + &((const struct oversized_blob *) b2)->oid; > + return oidcmp(&((const struct oversized_blob *) b1)->oid, oid2); > +} > + > +static int oversized_blob_cmp(const void *b1, const void *b2) > +{ > + return oidcmp(&(*(const struct oversized_blob **) b1)->oid, > + &(*(const struct oversized_blob **) b2)->oid); > +} > + > +static void write_oversized_info(void) > +{ > + struct oversized_blob **obs = xmalloc(oversized_blobs.size * > + sizeof(*obs)); Can this multiplication
[WIP 2/2] pack-objects: support --blob-size-limit
As part of an effort to improve Git support for very large repositories in which clients typically have only a subset of all version-controlled blobs, teach pack-objects to support --blob-size-limit, packing only blobs below that size limit unless the blob corresponds to a file whose name starts with ".git". upload-pack will eventually be taught to use this new parameter if needed to exclude certain blobs during a fetch or clone, potentially drastically reducing network consumption when serving these very large repositories. Since any excluded blob should not act as a delta base, I did this by avoiding such oversized blobs from ever going into the "to_pack" data structure, which contains both preferred bases (which do not go into the final packfile but can act as delta bases) and the objects to be packed (which go into the final packfile and also can act as delta bases). To that end, add_object_entry() has been modified to exclude the appropriate non-preferred-base objects. (Preferred bases, regardless of size, still enter "to_pack" - they are something that the client indicates that it has, not something that the server needs to serve, so no exclusion occurs.) Any such exclusions are recorded and sent after the packfile. I have taken the opportunity to rename the "exclude" parameter to "preferred_base" to avoid confusion between the concept of "excluding an object from to_pack" and "including an object in to_pack which will be excluded in the final pack generation". The other method in which objects are added to "to_pack", add_object_entry_from_bitmap(), has not been modified - thus packing in the presence of bitmaps still packs all blobs regardless of size. See the documentation update in this commit for the rationale. Signed-off-by: Jonathan Tan--- Documentation/git-pack-objects.txt | 19 +- builtin/pack-objects.c | 115 +++-- t/t5300-pack-object.sh | 71 +++ 3 files changed, 199 insertions(+), 6 deletions(-) diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt index 8973510a4..c4257cacc 100644 --- a/Documentation/git-pack-objects.txt +++ b/Documentation/git-pack-objects.txt @@ -12,7 +12,8 @@ SYNOPSIS 'git pack-objects' [-q | --progress | --all-progress] [--all-progress-implied] [--no-reuse-delta] [--delta-base-offset] [--non-empty] [--local] [--incremental] [--window=] [--depth=] - [--revs [--unpacked | --all]] [--stdout | base-name] + [--revs [--unpacked | --all]] + [--stdout [--blob-size-limit=] | base-name] [--shallow] [--keep-true-parents] < object-list @@ -231,6 +232,22 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it creates a bundle. With this option, parents that are hidden by grafts are packed nevertheless. +--blob-size-limit=:: + This option can only be used with --stdout. If specified, a blob + of at least this size will not be packed unless a to-be-packed + tree has that blob with a filename beginning with ".git". + The size can be suffixed with "k", "m", or "g", and may be "0". ++ +If specified, after printing the packfile, pack-objects will print the +count of excluded blobs (8 bytes) . Subsequently, for each excluded blob +in hash order, pack-objects will print its hash (20 bytes) and its size +(8 bytes). All numbers are in network byte order. ++ +If pack-objects cannot efficiently determine, for an arbitrary blob, +that no to-be-packed tree has that blob with a filename beginning with +".git" (for example, if bitmaps are used to calculate the objects to be +packed), it will pack all blobs regardless of size. + SEE ALSO linkgit:git-rev-list[1] diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 1062d8fe2..aaf7e92b0 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -77,6 +77,8 @@ static unsigned long cache_max_small_delta_size = 1000; static unsigned long window_memory_limit = 0; +static long blob_size_limit = -1; + /* * stats */ @@ -776,6 +778,50 @@ static const char no_split_warning[] = N_( "disabling bitmap writing, packs are split due to pack.packSizeLimit" ); +struct oversized_blob { + struct hashmap_entry entry; + struct object_id oid; + unsigned long size; +}; +struct hashmap oversized_blobs; + +static int oversized_blob_cmp_fn(const void *b1, const void *b2, +const void *keydata) +{ + const struct object_id *oid2 = keydata ? keydata : + &((const struct oversized_blob *) b2)->oid; + return oidcmp(&((const struct oversized_blob *) b1)->oid, oid2); +} + +static int oversized_blob_cmp(const void *b1, const void *b2) +{ + return oidcmp(&(*(const struct oversized_blob **) b1)->oid, + &(*(const struct oversized_blob **) b2)->oid); +} + +static void write_oversized_info(void) +{ +