Re: [PATCH 10/16] pack-objects: use bitmaps when packing objects
Vicent Marti wrote: $ time ../git/git pack-objects --all --stdout Counting objects: 3053537, done. Compressing objects: 100% (495706/495706), done. Total 3053537 (delta 2529614), reused 3053537 (delta 2529614) real0m36.686s user0m34.440s sys 0m2.184s $ time ../git/git pack-objects --all --stdout Counting objects: 3053537, done. Compressing objects: 100% (495706/495706), done. Total 3053537 (delta 2529614), reused 3053537 (delta 2529614) real0m7.255s user0m6.892s sys 0m0.444s Awesome work! Can you put up this series on gh:vmg so I can try it out for myself? diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index b7cab18..469b8da 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c + if (!strcmp(k, pack.usebitmaps)) { + bitmap_support = git_config_bool(k, v); + return 0; + } Not using config_error_nonbool() to indicate an error? + if (use_bitmap_index) { + uint32_t size_hint; + + if (!prepare_bitmap_walk(revs, size_hint)) { + khint_t new_hash_size = (size_hint * (1.0 / __ac_HASH_UPPER)) + 0.5; How does this work? You've taken the inverse of __ac_HASH_UPPER, multiplied it by the size_hint you get from prepare_bitmap_walk(), and add 0.5? + kh_resize_sha1(packed_objects, new_hash_size); So packed_objects is a hashtable of type kh_sha1_t * (you introduced in [03/16]) that you're now resizing to new_hash_size. To find out what the significance of this new_hash_size is, it looks like I have to read prepare_bitmap_walk(). + nr_alloc = (size_hint + 63) ~63; + objects = xrealloc(objects, nr_alloc * sizeof(struct object_entry *)); Interesting. The only other place where we realloc the objects in this file is in pack-objects.c:949, and we do that because nr_object = nr_alloc. What is this 63 magic? if (prepare_revision_walk(revs)) die(revision walk setup failed); + Stray newline. + if (bitmap_support) { + if (use_internal_rev_list pack_to_stdout) + use_bitmap_index = 1; + } + Wait, what does pack_to_stdout have to do with deciding whether or not to walk the bitmap? diff --git a/pack-bitmap.c b/pack-bitmap.c new file mode 100644 index 000..090db15 --- /dev/null +++ b/pack-bitmap.c +struct stored_bitmap { + unsigned char sha1[20]; + struct ewah_bitmap *root; + struct stored_bitmap *xor; + int flags; +}; What exactly is this? What is stored_bitmap *xor? It looks like some sort of next-pointer, but why is it named xor? +struct bitmap_index { Okay, the bitmap index. + struct ewah_bitmap *commits; + struct ewah_bitmap *trees; + struct ewah_bitmap *blobs; + struct ewah_bitmap *tags; I might be asking a really stupid question here, but why do you have different bitmaps for different object types? Unless I'm mistaken, the packfile index doesn't make this differentiation: it sorts and stores the SHA-1s of the various objects; you request a SHA-1, it does a binary search and returns the object. + khash_sha1 *bitmaps; A hashmap keyed with the SHA-1, I presume. + struct packed_git *pack; You're defining which pack this bitmap index is for, right? + struct { + struct object_array entries; + khash_sha1 *map; + } fake_index; What is this? + struct bitmap *result; No clue what this is about. + int entry_count; No clue what this is, but I'm assuming it can't be important because it's an int. + char pack_checksum[20]; + + int version; Use something invariant like uint32_t? Also, there is no clear indication about where this information is going to go (header, presumably?). Look at pack.h: struct pack_idx_header { uint32_t idx_signature; uint32_t idx_version; }; + unsigned loaded : 1, +native_bitmaps : 1, +has_hash_cache : 1; Booleans, but I don't know what they're doing here even after reading your bitmap-format.txt. + struct ewah_bitmap *(*read_bitmap)(struct bitmap_index *index); I'm very confused now. Each bitmap_index has a specialized read_bitmap()? + void *map; + size_t map_size, map_pos; + + uint32_t *delta_hashes; I'll give up on the rest. +static struct bitmap_index bitmap_git; You could have made the struct static to begin with and ended it with a } bitmap_git; +static struct ewah_bitmap * +lookup_stored_bitmap(struct stored_bitmap *st) Please conform to Linux style and make it easier for us to grep by putting this in one line? +{ + struct ewah_bitmap *parent; + struct
Re: [PATCH 10/16] pack-objects: use bitmaps when packing objects
Vicent Marti tan...@gmail.com writes: diff --git a/Makefile b/Makefile index e03c773..0f2e72b 100644 --- a/Makefile +++ b/Makefile @@ -703,6 +703,7 @@ LIB_H += notes.h LIB_H += object.h LIB_H += pack-revindex.h LIB_H += pack.h +LIB_H += pack-bitmap.h LIB_H += parse-options.h LIB_H += patch-ids.h LIB_H += pathspec.h @@ -838,6 +839,7 @@ LIB_OBJS += notes.o LIB_OBJS += notes-cache.o LIB_OBJS += notes-merge.o LIB_OBJS += object.o +LIB_OBJS += pack-bitmap.o LIB_OBJS += pack-check.o LIB_OBJS += pack-revindex.o LIB_OBJS += pack-write.o What does this apply on? When starting with the series from origin/master, git-am fails, and 'git am -3' tells me I don't have the necessary blobs (from the 'index' line above). Not that it's super hard to fix this up as long as it's in the Makefile only, but still. -- Thomas Rast trast@{inf,student}.ethz.ch -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/16] pack-objects: use bitmaps when packing objects
Vicent Marti tan...@gmail.com writes: @@ -83,6 +84,9 @@ static struct progress *progress_state; static int pack_compression_level = Z_DEFAULT_COMPRESSION; static int pack_compression_seen; +static int bitmap_support; +static int use_bitmap_index; OK. @@ -2131,6 +2135,10 @@ static int git_pack_config(const char *k, const char *v, void *cb) cache_max_small_delta_size = git_config_int(k, v); return 0; } + if (!strcmp(k, pack.usebitmaps)) { + bitmap_support = git_config_bool(k, v); + return 0; + } Hmph, so bitmap_support, not use_bitmap_index, keeps track of the user request? Somewhat confusing. if (!strcmp(k, pack.threads)) { delta_search_threads = git_config_int(k, v); if (delta_search_threads 0) @@ -2366,8 +2374,24 @@ static void get_object_list(int ac, const char **av) die(bad revision '%s', line); } + if (use_bitmap_index) { + uint32_t size_hint; + + if (!prepare_bitmap_walk(revs, size_hint)) { + khint_t new_hash_size = (size_hint * (1.0 / __ac_HASH_UPPER)) + 0.5; What is __ac_HASH_UPPER? That is a very unusual name for a variable or a constant. Also it is mildly annoying to see unnecessary use of float like this. + kh_resize_sha1(packed_objects, new_hash_size); + + nr_alloc = (size_hint + 63) ~63; + objects = xrealloc(objects, nr_alloc * sizeof(struct object_entry *)); + + traverse_bitmap_commit_list(add_object_entry_1); + return; + } + } + if (prepare_revision_walk(revs)) die(revision walk setup failed); + mark_edges_uninteresting(revs.commits, revs, show_edge); traverse_commit_list(revs, show_commit, show_object, NULL); @@ -2495,6 +2519,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) N_(pack compression level)), OPT_SET_INT(0, keep-true-parents, grafts_replace_parents, N_(do not hide commits by grafts), 0), + OPT_BOOL(0, bitmaps, bitmap_support, + N_(enable support for bitmap optimizations)), Please match this with the name of configuration variable, i.e. --use-bitmaps OPT_END(), }; @@ -2561,6 +2587,11 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) if (keep_unreachable unpack_unreachable) die(--keep-unreachable and --unpack-unreachable are incompatible.); + if (bitmap_support) { + if (use_internal_rev_list pack_to_stdout) + use_bitmap_index = 1; OK, so only when some internal condition is met, the user request to use bitmap is honored and the deision is kept in use_bitmap_index. It may be easier to read if you get rid of bitmap_support, set user_bitmap_index directly from the command line and config, and did this here instead: if (!(use_internal_rev_list pack_to_stdout)) use_bitmap_index = 0; -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/16] pack-objects: use bitmaps when packing objects
On Wed, Jun 26, 2013 at 1:06 AM, Junio C Hamano gits...@pobox.com wrote: @@ -83,6 +84,9 @@ static struct progress *progress_state; static int pack_compression_level = Z_DEFAULT_COMPRESSION; static int pack_compression_seen; +static int bitmap_support; +static int use_bitmap_index; OK. @@ -2131,6 +2135,10 @@ static int git_pack_config(const char *k, const char *v, void *cb) cache_max_small_delta_size = git_config_int(k, v); return 0; } + if (!strcmp(k, pack.usebitmaps)) { + bitmap_support = git_config_bool(k, v); + return 0; + } Hmph, so bitmap_support, not use_bitmap_index, keeps track of the user request? Somewhat confusing. if (!strcmp(k, pack.threads)) { delta_search_threads = git_config_int(k, v); if (delta_search_threads 0) @@ -2366,8 +2374,24 @@ static void get_object_list(int ac, const char **av) die(bad revision '%s', line); } + if (use_bitmap_index) { + uint32_t size_hint; + + if (!prepare_bitmap_walk(revs, size_hint)) { + khint_t new_hash_size = (size_hint * (1.0 / __ac_HASH_UPPER)) + 0.5; What is __ac_HASH_UPPER? That is a very unusual name for a variable or a constant. Also it is mildly annoying to see unnecessary use of float like this. See the updated patch at: https://github.com/vmg/git/blob/vmg/bitmaps-master/builtin/pack-objects.c#L2422 + kh_resize_sha1(packed_objects, new_hash_size); + + nr_alloc = (size_hint + 63) ~63; + objects = xrealloc(objects, nr_alloc * sizeof(struct object_entry *)); + + traverse_bitmap_commit_list(add_object_entry_1); + return; + } + } + if (prepare_revision_walk(revs)) die(revision walk setup failed); + mark_edges_uninteresting(revs.commits, revs, show_edge); traverse_commit_list(revs, show_commit, show_object, NULL); @@ -2495,6 +2519,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) N_(pack compression level)), OPT_SET_INT(0, keep-true-parents, grafts_replace_parents, N_(do not hide commits by grafts), 0), + OPT_BOOL(0, bitmaps, bitmap_support, + N_(enable support for bitmap optimizations)), Please match this with the name of configuration variable, i.e. --use-bitmaps OPT_END(), }; @@ -2561,6 +2587,11 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) if (keep_unreachable unpack_unreachable) die(--keep-unreachable and --unpack-unreachable are incompatible.); + if (bitmap_support) { + if (use_internal_rev_list pack_to_stdout) + use_bitmap_index = 1; OK, so only when some internal condition is met, the user request to use bitmap is honored and the deision is kept in use_bitmap_index. It may be easier to read if you get rid of bitmap_support, set user_bitmap_index directly from the command line and config, and did this here instead: if (!(use_internal_rev_list pack_to_stdout)) use_bitmap_index = 0; Yeah, I'm not particularly happy with the way these flags are implemented. I'll update this. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/16] pack-objects: use bitmaps when packing objects
A bitmap index is used, if available, to speed up the Counting Objects phase during `pack-objects`. The bitmap index is a `.bitmap` file that can be found inside `$GIT_DIR/objects/pack/`, next to its corresponding packfile, and contains precalculated reachability information for selected commits. The full specification of the format for these bitmap indexes can be found in `Documentation/technical/bitmap-format.txt`. For a given commit SHA1, if it happens to be available in the bitmap index, its bitmap will represent every single object that is reachable from the commit itself. The nth bit in the bitmap is the nth object in the index of the packfile; if it's set to 1, the object is reachable. By using the bitmaps available in the index, this commit implements a new pair of functions: - `prepare_bitmap_walk` - `traverse_bitmap_commit_list` This first function tries to build a bitmap of all the objects that can be reached from the commit roots of a given `rev_info` struct by using the following algorithm: - If all the interesting commits for a revision walk are available in the index, the resulting reachability bitmap is the bitwise OR of all the individual bitmaps. - When the full set of WANTs is not available in the index, we perform a partial revision walk using the commits that don't have bitmaps as roots, and limiting the revision walk as soon as we reach a commit that has a corresponding bitmap. The earlier OR'ed bitmap with all the indexed commits can now be completed as this walk progresses, so the end result is the full reachability list. - For revision walks with a HAVEs set (a set of commits that are deemed uninteresting), first we perform the same method as for the WANTs, but using our HAVEs as roots, in order to obtain a full reachability bitmap of all the uninteresting commits. This bitmap then can be used to: a) limit the subsequent walk when building the WANTs bitmap b) finding the final set of interesting commits by performing an AND-NOT of the WANTs and the HAVEs. If `prepare_bitmap_walk` runs successfully, the resulting bitmap is stored and the equivalent of a `traverse_commit_list` call can be performed by using `traverse_bitmap_commit_list`; the bitmap version of this call yields the objects straight from the packfile index (without having to look them up or parse them) and hence is several orders of magnitude faster. If the `prepare_bitmap_walk` call fails (e.g. because no bitmap files are available), the `rev_info` struct is left untouched, and can be used to perform a manual rev-walk using `traverse_commit_list`. Hence, this new pair of functions are a generic API that allows to perform the equivalent of git rev-list --objects [roots...] [^uninteresting...] for any set of commits, even if they don't have specific bitmaps generated for them. In this specific commit, we use the API to perform the `Counting Objects` phase in `builtin/pack-objects.c`, although it could be used to speed up other parts of Git that use the same mechanism. If the pack-objects invocation is being piped to `stdout` (like a normal `pack-objects` from `upload-pack` would be used) and bitmaps are enabled, the new `bitmap_walk` API will be used instead of `traverse_commit_list`. There are two ways to enable bitmaps for pack-objecs: - Pass the `--use-bitmaps` flag when calling `pack-objects` - Set `pack.usebitmaps` to `true` in the git config for the repository. Of course, simply enabling the bitmaps is not enought to perform the optimization: a bitmap index must be available on disk. If no bitmap index can be found, we'll silently fall back to the slow counting objects phase. The point of speeding up the Counting Objects phase of `pack-objects` is to reduce fetch and clone times for big repositories, which right now are definitely dominated by the rev-walk algorithm during the Counting Objects phase. Here are some sample timings from a full pack of `torvalds/linux` (i.e. something very similar to what would be generated for a clone of the repository): $ time ../git/git pack-objects --all --stdout Counting objects: 3053537, done. Compressing objects: 100% (495706/495706), done. Total 3053537 (delta 2529614), reused 3053537 (delta 2529614) real0m36.686s user0m34.440s sys 0m2.184s $ time ../git/git pack-objects --all --stdout Counting objects: 3053537, done. Compressing objects: 100% (495706/495706), done. Total 3053537 (delta 2529614), reused 3053537 (delta 2529614) real0m7.255s user0m6.892s sys 0m0.444s From a hotspot profiling run, we can see how the counting objects phase has been reduced to about 400ms (down from 28s). The remaining time is spent finding deltas and writing the packfile, the optimization of which is out of the scope of this topic. --- Makefile |