Re: [PATCH 10/16] pack-objects: use bitmaps when packing objects

2013-06-25 Thread Ramkumar Ramachandra
Vicent Marti wrote:
 $ time ../git/git pack-objects --all --stdout
 Counting objects: 3053537, done.
 Compressing objects: 100% (495706/495706), done.
 Total 3053537 (delta 2529614), reused 3053537 (delta 2529614)

 real0m36.686s
 user0m34.440s
 sys 0m2.184s

 $ time ../git/git pack-objects --all --stdout
 Counting objects: 3053537, done.
 Compressing objects: 100% (495706/495706), done.
 Total 3053537 (delta 2529614), reused 3053537 (delta 2529614)

 real0m7.255s
 user0m6.892s
 sys 0m0.444s

Awesome work!  Can you put up this series on gh:vmg so I can try it
out for myself?

 diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
 index b7cab18..469b8da 100644
 --- a/builtin/pack-objects.c
 +++ b/builtin/pack-objects.c
 +   if (!strcmp(k, pack.usebitmaps)) {
 +   bitmap_support = git_config_bool(k, v);
 +   return 0;
 +   }

Not using config_error_nonbool() to indicate an error?

 +   if (use_bitmap_index) {
 +   uint32_t size_hint;
 +
 +   if (!prepare_bitmap_walk(revs, size_hint)) {
 +   khint_t new_hash_size = (size_hint * (1.0 / 
 __ac_HASH_UPPER)) + 0.5;

How does this work?  You've taken the inverse of __ac_HASH_UPPER,
multiplied it by the size_hint you get from prepare_bitmap_walk(), and
add 0.5?

 +   kh_resize_sha1(packed_objects, new_hash_size);

So packed_objects is a hashtable of type kh_sha1_t * (you introduced
in [03/16]) that you're now resizing to new_hash_size.  To find out
what the significance of this new_hash_size is, it looks like I have
to read prepare_bitmap_walk().

 +   nr_alloc = (size_hint + 63)  ~63;
 +   objects = xrealloc(objects, nr_alloc * sizeof(struct 
 object_entry *));

Interesting.  The only other place where we realloc the objects in
this file is in pack-objects.c:949, and we do that because nr_object
= nr_alloc.  What is this 63 magic?

 if (prepare_revision_walk(revs))
 die(revision walk setup failed);
 +

Stray newline.

 +   if (bitmap_support) {
 +   if (use_internal_rev_list  pack_to_stdout)
 +   use_bitmap_index = 1;
 +   }
 +

Wait, what does pack_to_stdout have to do with deciding whether or not
to walk the bitmap?

 diff --git a/pack-bitmap.c b/pack-bitmap.c
 new file mode 100644
 index 000..090db15
 --- /dev/null
 +++ b/pack-bitmap.c
 +struct stored_bitmap {
 +   unsigned char sha1[20];
 +   struct ewah_bitmap *root;
 +   struct stored_bitmap *xor;
 +   int flags;
 +};

What exactly is this?  What is stored_bitmap *xor?  It looks like some
sort of next-pointer, but why is it named xor?

 +struct bitmap_index {

Okay, the bitmap index.

 +   struct ewah_bitmap *commits;
 +   struct ewah_bitmap *trees;
 +   struct ewah_bitmap *blobs;
 +   struct ewah_bitmap *tags;

I might be asking a really stupid question here, but why do you have
different bitmaps for different object types?  Unless I'm mistaken,
the packfile index doesn't make this differentiation: it sorts and
stores the SHA-1s of the various objects; you request a SHA-1, it does
a binary search and returns the object.

 +   khash_sha1 *bitmaps;

A hashmap keyed with the SHA-1, I presume.

 +   struct packed_git *pack;

You're defining which pack this bitmap index is for, right?

 +   struct {
 +   struct object_array entries;
 +   khash_sha1 *map;
 +   } fake_index;

What is this?

 +   struct bitmap *result;

No clue what this is about.

 +   int entry_count;

No clue what this is, but I'm assuming it can't be important because
it's an int.

 +   char pack_checksum[20];
 +
 +   int version;

Use something invariant like uint32_t?  Also, there is no clear
indication about where this information is going to go (header,
presumably?).  Look at pack.h:

struct pack_idx_header {
uint32_t idx_signature;
uint32_t idx_version;
};

 +   unsigned loaded : 1,
 +native_bitmaps : 1,
 +has_hash_cache : 1;

Booleans, but I don't know what they're doing here even after reading
your bitmap-format.txt.

 +   struct ewah_bitmap *(*read_bitmap)(struct bitmap_index *index);

I'm very confused now.  Each bitmap_index has a specialized read_bitmap()?

 +   void *map;
 +   size_t map_size, map_pos;
 +
 +   uint32_t *delta_hashes;

I'll give up on the rest.

 +static struct bitmap_index bitmap_git;

You could have made the struct static to begin with and ended it with
a } bitmap_git;

 +static struct ewah_bitmap *
 +lookup_stored_bitmap(struct stored_bitmap *st)

Please conform to Linux style and make it easier for us to grep by
putting this in one line?

 +{
 +   struct ewah_bitmap *parent;
 +   struct 

Re: [PATCH 10/16] pack-objects: use bitmaps when packing objects

2013-06-25 Thread Thomas Rast
Vicent Marti tan...@gmail.com writes:

 diff --git a/Makefile b/Makefile
 index e03c773..0f2e72b 100644
 --- a/Makefile
 +++ b/Makefile
 @@ -703,6 +703,7 @@ LIB_H += notes.h
  LIB_H += object.h
  LIB_H += pack-revindex.h
  LIB_H += pack.h
 +LIB_H += pack-bitmap.h
  LIB_H += parse-options.h
  LIB_H += patch-ids.h
  LIB_H += pathspec.h
 @@ -838,6 +839,7 @@ LIB_OBJS += notes.o
  LIB_OBJS += notes-cache.o
  LIB_OBJS += notes-merge.o
  LIB_OBJS += object.o
 +LIB_OBJS += pack-bitmap.o
  LIB_OBJS += pack-check.o
  LIB_OBJS += pack-revindex.o
  LIB_OBJS += pack-write.o

What does this apply on?  When starting with the series from
origin/master, git-am fails, and 'git am -3' tells me I don't have the
necessary blobs (from the 'index' line above).

Not that it's super hard to fix this up as long as it's in the Makefile
only, but still.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/16] pack-objects: use bitmaps when packing objects

2013-06-25 Thread Junio C Hamano
Vicent Marti tan...@gmail.com writes:

 @@ -83,6 +84,9 @@ static struct progress *progress_state;
  static int pack_compression_level = Z_DEFAULT_COMPRESSION;
  static int pack_compression_seen;
  
 +static int bitmap_support;
 +static int use_bitmap_index;

OK.

 @@ -2131,6 +2135,10 @@ static int git_pack_config(const char *k, const char 
 *v, void *cb)
   cache_max_small_delta_size = git_config_int(k, v);
   return 0;
   }
 + if (!strcmp(k, pack.usebitmaps)) {
 + bitmap_support = git_config_bool(k, v);
 + return 0;
 + }

Hmph, so bitmap_support, not use_bitmap_index, keeps track of the
user request?  Somewhat confusing.

   if (!strcmp(k, pack.threads)) {
   delta_search_threads = git_config_int(k, v);
   if (delta_search_threads  0)
 @@ -2366,8 +2374,24 @@ static void get_object_list(int ac, const char **av)
   die(bad revision '%s', line);
   }
  
 + if (use_bitmap_index) {
 + uint32_t size_hint;
 +
 + if (!prepare_bitmap_walk(revs, size_hint)) {
 + khint_t new_hash_size = (size_hint * (1.0 / 
 __ac_HASH_UPPER)) + 0.5;

What is __ac_HASH_UPPER?  That is a very unusual name for a variable
or a constant.  Also it is mildly annoying to see unnecessary use of
float like this.

 + kh_resize_sha1(packed_objects, new_hash_size);
 +
 + nr_alloc = (size_hint + 63)  ~63;
 + objects = xrealloc(objects, nr_alloc * sizeof(struct 
 object_entry *));
 +
 + traverse_bitmap_commit_list(add_object_entry_1);
 + return;
 + }
 + }
 +
   if (prepare_revision_walk(revs))
   die(revision walk setup failed);
 +
   mark_edges_uninteresting(revs.commits, revs, show_edge);
   traverse_commit_list(revs, show_commit, show_object, NULL);
  
 @@ -2495,6 +2519,8 @@ int cmd_pack_objects(int argc, const char **argv, const 
 char *prefix)
   N_(pack compression level)),
   OPT_SET_INT(0, keep-true-parents, grafts_replace_parents,
   N_(do not hide commits by grafts), 0),
 + OPT_BOOL(0, bitmaps, bitmap_support,
 +  N_(enable support for bitmap optimizations)),

Please match this with the name of configuration variable, i.e. --use-bitmaps

   OPT_END(),
   };
  
 @@ -2561,6 +2587,11 @@ int cmd_pack_objects(int argc, const char **argv, 
 const char *prefix)
   if (keep_unreachable  unpack_unreachable)
   die(--keep-unreachable and --unpack-unreachable are 
 incompatible.);
  
 + if (bitmap_support) {
 + if (use_internal_rev_list  pack_to_stdout)
 + use_bitmap_index = 1;

OK, so only when some internal condition is met, the user request to
use bitmap is honored and the deision is kept in use_bitmap_index.

It may be easier to read if you get rid of bitmap_support, set
user_bitmap_index directly from the command line and config, and did
this here instead:

if (!(use_internal_rev_list  pack_to_stdout))
use_bitmap_index = 0;

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/16] pack-objects: use bitmaps when packing objects

2013-06-25 Thread Vicent Martí
On Wed, Jun 26, 2013 at 1:06 AM, Junio C Hamano gits...@pobox.com wrote:
 @@ -83,6 +84,9 @@ static struct progress *progress_state;
  static int pack_compression_level = Z_DEFAULT_COMPRESSION;
  static int pack_compression_seen;

 +static int bitmap_support;
 +static int use_bitmap_index;

 OK.

 @@ -2131,6 +2135,10 @@ static int git_pack_config(const char *k, const char 
 *v, void *cb)
   cache_max_small_delta_size = git_config_int(k, v);
   return 0;
   }
 + if (!strcmp(k, pack.usebitmaps)) {
 + bitmap_support = git_config_bool(k, v);
 + return 0;
 + }

 Hmph, so bitmap_support, not use_bitmap_index, keeps track of the
 user request?  Somewhat confusing.

   if (!strcmp(k, pack.threads)) {
   delta_search_threads = git_config_int(k, v);
   if (delta_search_threads  0)
 @@ -2366,8 +2374,24 @@ static void get_object_list(int ac, const char **av)
   die(bad revision '%s', line);
   }

 + if (use_bitmap_index) {
 + uint32_t size_hint;
 +
 + if (!prepare_bitmap_walk(revs, size_hint)) {
 + khint_t new_hash_size = (size_hint * (1.0 / 
 __ac_HASH_UPPER)) + 0.5;

 What is __ac_HASH_UPPER?  That is a very unusual name for a variable
 or a constant.  Also it is mildly annoying to see unnecessary use of
 float like this.

See the updated patch at:

https://github.com/vmg/git/blob/vmg/bitmaps-master/builtin/pack-objects.c#L2422


 + kh_resize_sha1(packed_objects, new_hash_size);
 +
 + nr_alloc = (size_hint + 63)  ~63;
 + objects = xrealloc(objects, nr_alloc * sizeof(struct 
 object_entry *));
 +
 + traverse_bitmap_commit_list(add_object_entry_1);
 + return;
 + }
 + }
 +
   if (prepare_revision_walk(revs))
   die(revision walk setup failed);
 +
   mark_edges_uninteresting(revs.commits, revs, show_edge);
   traverse_commit_list(revs, show_commit, show_object, NULL);

 @@ -2495,6 +2519,8 @@ int cmd_pack_objects(int argc, const char **argv, 
 const char *prefix)
   N_(pack compression level)),
   OPT_SET_INT(0, keep-true-parents, grafts_replace_parents,
   N_(do not hide commits by grafts), 0),
 + OPT_BOOL(0, bitmaps, bitmap_support,
 +  N_(enable support for bitmap optimizations)),

 Please match this with the name of configuration variable, i.e. --use-bitmaps

   OPT_END(),
   };

 @@ -2561,6 +2587,11 @@ int cmd_pack_objects(int argc, const char **argv, 
 const char *prefix)
   if (keep_unreachable  unpack_unreachable)
   die(--keep-unreachable and --unpack-unreachable are 
 incompatible.);

 + if (bitmap_support) {
 + if (use_internal_rev_list  pack_to_stdout)
 + use_bitmap_index = 1;

 OK, so only when some internal condition is met, the user request to
 use bitmap is honored and the deision is kept in use_bitmap_index.

 It may be easier to read if you get rid of bitmap_support, set
 user_bitmap_index directly from the command line and config, and did
 this here instead:

 if (!(use_internal_rev_list  pack_to_stdout))
 use_bitmap_index = 0;

Yeah, I'm not particularly happy with the way these flags are
implemented. I'll update this.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/16] pack-objects: use bitmaps when packing objects

2013-06-24 Thread Vicent Marti
A bitmap index is used, if available, to speed up the Counting Objects
phase during `pack-objects`.

The bitmap index is a `.bitmap` file that can be found inside
`$GIT_DIR/objects/pack/`, next to its corresponding packfile, and
contains precalculated reachability information for selected commits.
The full specification of the format for these bitmap indexes can be found
in `Documentation/technical/bitmap-format.txt`.

For a given commit SHA1, if it happens to be available in the bitmap
index, its bitmap will represent every single object that is reachable
from the commit itself. The nth bit in the bitmap is the nth object in
the index of the packfile; if it's set to 1, the object is reachable.

By using the bitmaps available in the index, this commit implements a new
pair of functions:

- `prepare_bitmap_walk`
- `traverse_bitmap_commit_list`

This first function tries to build a bitmap of all the objects that can be
reached from the commit roots of a given `rev_info` struct by using
the following algorithm:

- If all the interesting commits for a revision walk are available in
the index, the resulting reachability bitmap is the bitwise OR of all
the individual bitmaps.

- When the full set of WANTs is not available in the index, we perform a
partial revision walk using the commits that don't have bitmaps as
roots, and limiting the revision walk as soon as we reach a commit that
has a corresponding bitmap. The earlier OR'ed bitmap with all the
indexed commits can now be completed as this walk progresses, so the end
result is the full reachability list.

- For revision walks with a HAVEs set (a set of commits that are deemed
uninteresting), first we perform the same method as for the WANTs, but
using our HAVEs as roots, in order to obtain a full reachability bitmap
of all the uninteresting commits. This bitmap then can be used to:

a) limit the subsequent walk when building the WANTs bitmap
b) finding the final set of interesting commits by performing an
   AND-NOT of the WANTs and the HAVEs.

If `prepare_bitmap_walk` runs successfully, the resulting bitmap is
stored and the equivalent of a `traverse_commit_list` call can be
performed by using `traverse_bitmap_commit_list`; the bitmap version
of this call yields the objects straight from the packfile index
(without having to look them up or parse them) and hence is several
orders of magnitude faster.

If the `prepare_bitmap_walk` call fails (e.g. because no bitmap files
are available), the `rev_info` struct is left untouched, and can be used
to perform a manual rev-walk using `traverse_commit_list`.

Hence, this new pair of functions are a generic API that allows to
perform the equivalent of

git rev-list --objects [roots...] [^uninteresting...]

for any set of commits, even if they don't have specific bitmaps
generated for them.

In this specific commit, we use the API to perform the
`Counting Objects` phase in `builtin/pack-objects.c`, although it could
be used to speed up other parts of Git that use the same mechanism.

If the pack-objects invocation is being piped to `stdout` (like a normal
`pack-objects` from `upload-pack` would be used) and bitmaps are
enabled, the new `bitmap_walk` API will be used instead of
`traverse_commit_list`.

There are two ways to enable bitmaps for pack-objecs:

- Pass the `--use-bitmaps` flag when calling `pack-objects`
- Set `pack.usebitmaps` to `true` in the git config for the
repository.

Of course, simply enabling the bitmaps is not enought to perform the
optimization: a bitmap index must be available on disk. If no bitmap
index can be found, we'll silently fall back to the slow counting
objects phase.

The point of speeding up the Counting Objects phase of `pack-objects` is
to reduce fetch and clone times for big repositories, which right now
are definitely dominated by the rev-walk algorithm during the Counting
Objects phase.

Here are some sample timings from a full pack of `torvalds/linux` (i.e.
something very similar to what would be generated for a clone of the
repository):

$ time ../git/git pack-objects --all --stdout
Counting objects: 3053537, done.
Compressing objects: 100% (495706/495706), done.
Total 3053537 (delta 2529614), reused 3053537 (delta 2529614)

real0m36.686s
user0m34.440s
sys 0m2.184s

$ time ../git/git pack-objects --all --stdout
Counting objects: 3053537, done.
Compressing objects: 100% (495706/495706), done.
Total 3053537 (delta 2529614), reused 3053537 (delta 2529614)

real0m7.255s
user0m6.892s
sys 0m0.444s

From a hotspot profiling run, we can see how the counting
objects phase has been reduced to about 400ms (down from 28s).
The remaining time is spent finding deltas and writing the packfile, the
optimization of which is out of the scope of this topic.
---
 Makefile   |