[PATCH v4 1/7] Add delta-islands.{c,h}

2018-08-11 Thread Christian Couder
From: Jeff King 

Hosting providers that allow users to "fork" existing
repos want those forks to share as much disk space as
possible.

Alternates are an existing solution to keep all the
objects from all the forks into a unique central repo,
but this can have some drawbacks. Especially when
packing the central repo, deltas will be created
between objects from different forks.

This can make cloning or fetching a fork much slower
and much more CPU intensive as Git might have to
compute new deltas for many objects to avoid sending
objects from a different fork.

Because the inefficiency primarily arises when an
object is delitified against another object that does
not exist in the same fork, we partition objects into
sets that appear in the same fork, and define
"delta islands". When finding delta base, we do not
allow an object outside the same island to be
considered as its base.

So "delta islands" is a way to store objects from
different forks in the same repo and packfile without
having deltas between objects from different forks.

This patch implements the delta islands mechanism in
"delta-islands.{c,h}", but does not yet make use of it.

A few new fields are added in 'struct object_entry'
in "pack-objects.h" though.

The documentation will follow in a patch that actually
uses delta islands in "builtin/pack-objects.c".

Signed-off-by: Jeff King 
Signed-off-by: Christian Couder 
---
 Makefile|   1 +
 delta-islands.c | 493 
 delta-islands.h |  11 ++
 pack-objects.h  |   4 +
 4 files changed, 509 insertions(+)
 create mode 100644 delta-islands.c
 create mode 100644 delta-islands.h

diff --git a/Makefile b/Makefile
index bc4fc8eeab..e7994888e8 100644
--- a/Makefile
+++ b/Makefile
@@ -841,6 +841,7 @@ LIB_OBJS += csum-file.o
 LIB_OBJS += ctype.o
 LIB_OBJS += date.o
 LIB_OBJS += decorate.o
+LIB_OBJS += delta-islands.o
 LIB_OBJS += diffcore-break.o
 LIB_OBJS += diffcore-delta.o
 LIB_OBJS += diffcore-order.o
diff --git a/delta-islands.c b/delta-islands.c
new file mode 100644
index 00..c0b0da6837
--- /dev/null
+++ b/delta-islands.c
@@ -0,0 +1,493 @@
+#include "cache.h"
+#include "attr.h"
+#include "object.h"
+#include "blob.h"
+#include "commit.h"
+#include "tag.h"
+#include "tree.h"
+#include "delta.h"
+#include "pack.h"
+#include "tree-walk.h"
+#include "diff.h"
+#include "revision.h"
+#include "list-objects.h"
+#include "progress.h"
+#include "refs.h"
+#include "khash.h"
+#include "pack-bitmap.h"
+#include "pack-objects.h"
+#include "delta-islands.h"
+#include "sha1-array.h"
+#include "config.h"
+
+KHASH_INIT(str, const char *, void *, 1, kh_str_hash_func, kh_str_hash_equal)
+
+static khash_sha1 *island_marks;
+static unsigned island_counter;
+static unsigned island_counter_core;
+
+static kh_str_t *remote_islands;
+
+struct remote_island {
+   uint64_t hash;
+   struct oid_array oids;
+};
+
+struct island_bitmap {
+   uint32_t refcount;
+   uint32_t bits[];
+};
+
+static uint32_t island_bitmap_size;
+
+/*
+ * Allocate a new bitmap; if "old" is not NULL, the new bitmap will be a copy
+ * of "old". Otherwise, the new bitmap is empty.
+ */
+static struct island_bitmap *island_bitmap_new(const struct island_bitmap *old)
+{
+   size_t size = sizeof(struct island_bitmap) + (island_bitmap_size * 4);
+   struct island_bitmap *b = xcalloc(1, size);
+
+   if (old)
+   memcpy(b, old, size);
+
+   b->refcount = 1;
+   return b;
+}
+
+static void island_bitmap_or(struct island_bitmap *a, const struct 
island_bitmap *b)
+{
+   uint32_t i;
+
+   for (i = 0; i < island_bitmap_size; ++i)
+   a->bits[i] |= b->bits[i];
+}
+
+static int island_bitmap_is_subset(struct island_bitmap *self,
+   struct island_bitmap *super)
+{
+   uint32_t i;
+
+   if (self == super)
+   return 1;
+
+   for (i = 0; i < island_bitmap_size; ++i) {
+   if ((self->bits[i] & super->bits[i]) != self->bits[i])
+   return 0;
+   }
+
+   return 1;
+}
+
+#define ISLAND_BITMAP_BLOCK(x) (x / 32)
+#define ISLAND_BITMAP_MASK(x) (1 << (x % 32))
+
+static void island_bitmap_set(struct island_bitmap *self, uint32_t i)
+{
+   self->bits[ISLAND_BITMAP_BLOCK(i)] |= ISLAND_BITMAP_MASK(i);
+}
+
+static int island_bitmap_get(struct island_bitmap *self, uint32_t i)
+{
+   return (self->bits[ISLAND_BITMAP_BLOCK(i)] & ISLAND_BITMAP_MASK(i)) != 
0;
+}
+
+int in_same_island(const struct object_id *trg_oid, const struct object_id 
*src_oid)
+{
+   khiter_t trg_pos, src_pos;
+
+   /* If we aren't using islands, assume everything goes together. */
+   if (!island_marks)
+   return 1;
+
+   /*
+* If we don't have a bitmap for the target, we can delta it
+* against anything -- it's not an important object
+*/
+   trg_pos = kh_get_sha1(island_marks, trg_oid->hash);
+   if (trg_pos >= kh_end(island_marks))
+  

Re: [PATCH v4 1/7] Add delta-islands.{c,h}

2018-08-12 Thread Ramsay Jones



On 12/08/18 06:11, Christian Couder wrote:
> From: Jeff King 
> 
> Hosting providers that allow users to "fork" existing
> repos want those forks to share as much disk space as
> possible.
> 
> Alternates are an existing solution to keep all the
> objects from all the forks into a unique central repo,
> but this can have some drawbacks. Especially when
> packing the central repo, deltas will be created
> between objects from different forks.
> 
> This can make cloning or fetching a fork much slower
> and much more CPU intensive as Git might have to
> compute new deltas for many objects to avoid sending
> objects from a different fork.
> 
> Because the inefficiency primarily arises when an
> object is delitified against another object that does

s/delitified/deltified/ ?

> not exist in the same fork, we partition objects into
> sets that appear in the same fork, and define
> "delta islands". When finding delta base, we do not
> allow an object outside the same island to be
> considered as its base.
> 
> So "delta islands" is a way to store objects from
> different forks in the same repo and packfile without
> having deltas between objects from different forks.
> 
> This patch implements the delta islands mechanism in
> "delta-islands.{c,h}", but does not yet make use of it.
> 
> A few new fields are added in 'struct object_entry'
> in "pack-objects.h" though.
> 
> The documentation will follow in a patch that actually
> uses delta islands in "builtin/pack-objects.c".
> 
> Signed-off-by: Jeff King 
> Signed-off-by: Christian Couder 
> ---
>  Makefile|   1 +
>  delta-islands.c | 493 
>  delta-islands.h |  11 ++
>  pack-objects.h  |   4 +
>  4 files changed, 509 insertions(+)
>  create mode 100644 delta-islands.c
>  create mode 100644 delta-islands.h
> 
> diff --git a/Makefile b/Makefile
> index bc4fc8eeab..e7994888e8 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -841,6 +841,7 @@ LIB_OBJS += csum-file.o
>  LIB_OBJS += ctype.o
>  LIB_OBJS += date.o
>  LIB_OBJS += decorate.o
> +LIB_OBJS += delta-islands.o
>  LIB_OBJS += diffcore-break.o
>  LIB_OBJS += diffcore-delta.o
>  LIB_OBJS += diffcore-order.o
> diff --git a/delta-islands.c b/delta-islands.c
> new file mode 100644
> index 00..c0b0da6837
> --- /dev/null
> +++ b/delta-islands.c
> @@ -0,0 +1,493 @@
> +#include "cache.h"
> +#include "attr.h"
> +#include "object.h"
> +#include "blob.h"
> +#include "commit.h"
> +#include "tag.h"
> +#include "tree.h"
> +#include "delta.h"
> +#include "pack.h"
> +#include "tree-walk.h"
> +#include "diff.h"
> +#include "revision.h"
> +#include "list-objects.h"
> +#include "progress.h"
> +#include "refs.h"
> +#include "khash.h"

I was wondering how many copies of the inline functions
introduced by this header we had, so:

  $ nm git | grep ' t ' | cut -d' ' -f3 | sort | uniq -c | sort -rn | grep kh_
3 kh_resize_sha1
3 kh_put_sha1
3 kh_init_sha1
3 kh_get_sha1
1 kh_resize_str
1 kh_resize_sha1_pos
1 kh_put_str
1 kh_put_sha1_pos
1 kh_init_str
1 kh_init_sha1_pos
1 kh_get_str
1 kh_get_sha1_pos
1 kh_destroy_sha1
  $ 

Looking at the individual object files, we see:

  $ nm pack-bitmap-write.o | grep ' t ' | grep kh_
  01cc t kh_get_sha1
  01b7 t kh_init_sha1
  085e t kh_put_sha1
  0310 t kh_resize_sha1
  $ 

So, the two instances of the sha1 hash-map are never
destroyed (kh_destroy_sha1 is not present in the object
file).

  $ nm pack-bitmap.o | grep ' t ' | grep kh_
  02d9 t kh_destroy_sha1
  032b t kh_get_sha1
  0daa t kh_get_sha1_pos
  02c4 t kh_init_sha1
  0d95 t kh_init_sha1_pos
  09bd t kh_put_sha1
  1432 t kh_put_sha1_pos
  046f t kh_resize_sha1
  0eee t kh_resize_sha1_pos
  $ 

The sha1_pos hash-map is not destroyed here.

  $ nm delta-islands.o | grep ' t ' | grep kh_
  02be t kh_get_sha1
  0e52 t kh_get_str
  02a9 t kh_init_sha1
  0e3d t kh_init_str
  0950 t kh_put_sha1
  14e4 t kh_put_str
  0402 t kh_resize_sha1
  0f96 t kh_resize_str
  $ 

And neither the sha1 or str hash-maps are destroyed here.
(That is not necessarily a problem, of course! ;-) )
   
> +#include "pack-bitmap.h"
> +#include "pack-objects.h"
> +#include "delta-islands.h"
> +#include "sha1-array.h"
> +#include "config.h"
> +
> +KHASH_INIT(str, const char *, void *, 1, kh_str_hash_func, kh_str_hash_equal)
> +
> +static khash_sha1 *island_marks;
> +static unsigned island_counter;
> +static unsigned island_counter_core;
> +
> +static kh_str_t *remote_islands;
> +
> +struct remote_island {
> + uint64_t hash;
> + struct oid_array oids;
> +};
> +
> +struct island_bitmap {
> + uint32_t refcount;
> + uint32_t bits[];

Use FLEX_ARRAY here?

Re: [PATCH v4 1/7] Add delta-islands.{c,h}

2018-08-12 Thread Christian Couder
On Mon, Aug 13, 2018 at 3:14 AM, Ramsay Jones
 wrote:
> On 12/08/18 06:11, Christian Couder wrote:

>> Because the inefficiency primarily arises when an
>> object is delitified against another object that does
>
> s/delitified/deltified/ ?

Ok, this will be in the next reroll if any.

>> not exist in the same fork, we partition objects into
>> sets that appear in the same fork, and define
>> "delta islands". When finding delta base, we do not
>> allow an object outside the same island to be
>> considered as its base.

>> --- /dev/null
>> +++ b/delta-islands.c
>> @@ -0,0 +1,493 @@
>> +#include "cache.h"
>> +#include "attr.h"
>> +#include "object.h"
>> +#include "blob.h"
>> +#include "commit.h"
>> +#include "tag.h"
>> +#include "tree.h"
>> +#include "delta.h"
>> +#include "pack.h"
>> +#include "tree-walk.h"
>> +#include "diff.h"
>> +#include "revision.h"
>> +#include "list-objects.h"
>> +#include "progress.h"
>> +#include "refs.h"
>> +#include "khash.h"
>
> I was wondering how many copies of the inline functions
> introduced by this header we had, so:
>
>   $ nm git | grep ' t ' | cut -d' ' -f3 | sort | uniq -c | sort -rn | grep kh_
> 3 kh_resize_sha1
> 3 kh_put_sha1
> 3 kh_init_sha1
> 3 kh_get_sha1
> 1 kh_resize_str
> 1 kh_resize_sha1_pos
> 1 kh_put_str
> 1 kh_put_sha1_pos
> 1 kh_init_str
> 1 kh_init_sha1_pos
> 1 kh_get_str
> 1 kh_get_sha1_pos
> 1 kh_destroy_sha1
>   $
>
> Looking at the individual object files, we see:
>
>   $ nm pack-bitmap-write.o | grep ' t ' | grep kh_
>   01cc t kh_get_sha1
>   01b7 t kh_init_sha1
>   085e t kh_put_sha1
>   0310 t kh_resize_sha1
>   $
>
> So, the two instances of the sha1 hash-map are never
> destroyed (kh_destroy_sha1 is not present in the object
> file).

This is interesting (even though it seems related to more code than
the current patch series).

As those hash maps are in 'struct bitmap_writer' and a static instance is used:

  static struct bitmap_writer writer;

it maybe ok.

>   $ nm pack-bitmap.o | grep ' t ' | grep kh_
>   02d9 t kh_destroy_sha1
>   032b t kh_get_sha1
>   0daa t kh_get_sha1_pos
>   02c4 t kh_init_sha1
>   0d95 t kh_init_sha1_pos
>   09bd t kh_put_sha1
>   1432 t kh_put_sha1_pos
>   046f t kh_resize_sha1
>   0eee t kh_resize_sha1_pos
>   $
>
> The sha1_pos hash-map is not destroyed here.

Yeah, maybe a line like:

kh_destroy_pos(b->ext_index.positions);

is missing from free_bitmap_index()?

Adding that should be in a separate patch from this series though.

>   $ nm delta-islands.o | grep ' t ' | grep kh_
>   02be t kh_get_sha1
>   0e52 t kh_get_str
>   02a9 t kh_init_sha1
>   0e3d t kh_init_str
>   0950 t kh_put_sha1
>   14e4 t kh_put_str
>   0402 t kh_resize_sha1
>   0f96 t kh_resize_str
>   $
>
> And neither the sha1 or str hash-maps are destroyed here.
> (That is not necessarily a problem, of course! ;-) )

The instances are declared as static:

  static khash_sha1 *island_marks;

  static kh_str_t *remote_islands;

so it maybe ok.

>> +struct island_bitmap {
>> + uint32_t refcount;
>> + uint32_t bits[];
>
> Use FLEX_ARRAY here? We are slowly moving toward requiring
> certain C99 features, but I can't remember a flex array
> weather-balloon patch.

This was already discussed by Junio and Peff there:

https://public-inbox.org/git/20180727130229.gb18...@sigill.intra.peff.net/

>> +};

>> +int in_same_island(const struct object_id *trg_oid, const struct object_id 
>> *src_oid)
>
> Hmm, what does the trg_ prefix stand for?
>
>> +{
>> + khiter_t trg_pos, src_pos;
>> +
>> + /* If we aren't using islands, assume everything goes together. */
>> + if (!island_marks)
>> + return 1;
>> +
>> + /*
>> +  * If we don't have a bitmap for the target, we can delta it
>
> ... Ah, OK, trg_ => target.

I am ok to replace "trg" with "target" (or maybe "dst"? or something
else) and "src" with "source" if you think it would make things
clearer.

>> +static void add_ref_to_island(const char *island_name, const struct 
>> object_id *oid)
>> +{
>> + uint64_t sha_core;
>> + struct remote_island *rl = NULL;
>> +
>> + int hash_ret;
>> + khiter_t pos = kh_put_str(remote_islands, island_name, &hash_ret);
>> +
>> + if (hash_ret) {
>> + kh_key(remote_islands, pos) = xstrdup(island_name);
>> + kh_value(remote_islands, pos) = xcalloc(1, sizeof(struct 
>> remote_island));
>> + }
>> +
>> + rl = kh_value(remote_islands, pos);
>> + oid_array_append(&rl->oids, oid);
>> +
>> + memcpy(&sha_core, oid->hash, sizeof(uint64_t));
>> + rl->hash += sha_core;
>
> Hmm, so the first 64-bits of the oid of each ref that is part of
> this is

Re: [PATCH v4 1/7] Add delta-islands.{c,h}

2018-08-13 Thread Ramsay Jones



On 13/08/18 04:33, Christian Couder wrote:
> On Mon, Aug 13, 2018 at 3:14 AM, Ramsay Jones
[snip]
>> And neither the sha1 or str hash-maps are destroyed here.
>> (That is not necessarily a problem, of course! ;-) )
> 
> The instances are declared as static:
> 
>   static khash_sha1 *island_marks;
> 
>   static kh_str_t *remote_islands;
> 
> so it maybe ok.

Yes, that's fine.

> 
>>> +struct island_bitmap {
>>> + uint32_t refcount;
>>> + uint32_t bits[];
>>
>> Use FLEX_ARRAY here? We are slowly moving toward requiring
>> certain C99 features, but I can't remember a flex array
>> weather-balloon patch.
> 
> This was already discussed by Junio and Peff there:
> 
> https://public-inbox.org/git/20180727130229.gb18...@sigill.intra.peff.net/
> 

That is a fine discussion, without a firm conclusion, but I don't
think you can simply do nothing here:

  $ cat -n junk.c
   1#include 
   2
   3struct island_bitmap {
   4uint32_t refcount;
   5uint32_t bits[];
   6};
   7
  $ gcc --std=c89 --pedantic -c junk.c
  junk.c:5:11: warning: ISO C90 does not support flexible array members 
[-Wpedantic]
uint32_t bits[];
 ^~~~
  $ gcc --std=c99 --pedantic -c junk.c
  $ 
  
>>> +};
> 
>>> +int in_same_island(const struct object_id *trg_oid, const struct object_id 
>>> *src_oid)
>>
>> Hmm, what does the trg_ prefix stand for?
>>
>>> +{
>>> + khiter_t trg_pos, src_pos;
>>> +
>>> + /* If we aren't using islands, assume everything goes together. */
>>> + if (!island_marks)
>>> + return 1;
>>> +
>>> + /*
>>> +  * If we don't have a bitmap for the target, we can delta it
>>
>> ... Ah, OK, trg_ => target.
> 
> I am ok to replace "trg" with "target" (or maybe "dst"? or something
> else) and "src" with "source" if you think it would make things
> clearer.

If it had been dst_ (or target), I would not have had a 'huh?'
moment, but it is not all that important.

> 
>>> +static void add_ref_to_island(const char *island_name, const struct 
>>> object_id *oid)
>>> +{
>>> + uint64_t sha_core;
>>> + struct remote_island *rl = NULL;
>>> +
>>> + int hash_ret;
>>> + khiter_t pos = kh_put_str(remote_islands, island_name, &hash_ret);
>>> +
>>> + if (hash_ret) {
>>> + kh_key(remote_islands, pos) = xstrdup(island_name);
>>> + kh_value(remote_islands, pos) = xcalloc(1, sizeof(struct 
>>> remote_island));
>>> + }
>>> +
>>> + rl = kh_value(remote_islands, pos);
>>> + oid_array_append(&rl->oids, oid);
>>> +
>>> + memcpy(&sha_core, oid->hash, sizeof(uint64_t));
>>> + rl->hash += sha_core;
>>
>> Hmm, so the first 64-bits of the oid of each ref that is part of
>> this island is added together as a 'hash' for the island. And this
>> is used to de-duplicate the islands? Any false positives? (does it
>> matter - it would only affect performance, not correctness, right?)
> 
> I would think that a false positive from pure chance is very unlikely.
> We would need to approach billions of delta islands (as 2 to the power
> 64/2 is in the order of billions) for the probability to be
> significant. GitHub has less than 50 millions users and it is very
> unlikely that a significant proportion of these users will fork the
> same repo.
> 
> Now if there is a false positive because two forks have exactly the
> same refs, then it is not a problem if they are considered the same,
> because they are actually the same.

Yep, good point.

ATB,
Ramsay Jones


Re: [PATCH v4 1/7] Add delta-islands.{c,h}

2018-08-13 Thread Jeff King
On Mon, Aug 13, 2018 at 01:17:18PM +0100, Ramsay Jones wrote:

> >>> +struct island_bitmap {
> >>> + uint32_t refcount;
> >>> + uint32_t bits[];
> >>
> >> Use FLEX_ARRAY here? We are slowly moving toward requiring
> >> certain C99 features, but I can't remember a flex array
> >> weather-balloon patch.
> > 
> > This was already discussed by Junio and Peff there:
> > 
> > https://public-inbox.org/git/20180727130229.gb18...@sigill.intra.peff.net/
> > 
> 
> That is a fine discussion, without a firm conclusion, but I don't
> think you can simply do nothing here:
> 
>   $ cat -n junk.c
>1  #include 
>2  
>3  struct island_bitmap {
>4  uint32_t refcount;
>5  uint32_t bits[];
>6  };
>7  
>   $ gcc --std=c89 --pedantic -c junk.c
>   junk.c:5:11: warning: ISO C90 does not support flexible array members 
> [-Wpedantic]
> uint32_t bits[];
>  ^~~~
>   $ gcc --std=c99 --pedantic -c junk.c

Right, whether we use the FLEX_ALLOC macros or not, this needs to be
declared with FLEX_ARRAY, not an empty "[]".

I'm fine either way on using the FLEX_ALLOC macros.

> >> ... Ah, OK, trg_ => target.
> > 
> > I am ok to replace "trg" with "target" (or maybe "dst"? or something
> > else) and "src" with "source" if you think it would make things
> > clearer.
> 
> If it had been dst_ (or target), I would not have had a 'huh?'
> moment, but it is not all that important.

FWIW, these are all inherited from try_delta(), etc, in the existing
code. I'm fine with using another term, but we should probably do it
universally. And if we do, probably "base" is a better name than "src",
since the direction depends on which part of the relationship you are
considering. I'm not sure what that makes "dst".

-Peff


Re: [PATCH v4 1/7] Add delta-islands.{c,h}

2018-08-13 Thread Jeff King
On Mon, Aug 13, 2018 at 05:33:59AM +0200, Christian Couder wrote:

> >> + memcpy(&sha_core, oid->hash, sizeof(uint64_t));
> >> + rl->hash += sha_core;
> >
> > Hmm, so the first 64-bits of the oid of each ref that is part of
> > this island is added together as a 'hash' for the island. And this
> > is used to de-duplicate the islands? Any false positives? (does it
> > matter - it would only affect performance, not correctness, right?)
> 
> I would think that a false positive from pure chance is very unlikely.
> We would need to approach billions of delta islands (as 2 to the power
> 64/2 is in the order of billions) for the probability to be
> significant. GitHub has less than 50 millions users and it is very
> unlikely that a significant proportion of these users will fork the
> same repo.
> 
> Now if there is a false positive because two forks have exactly the
> same refs, then it is not a problem if they are considered the same,
> because they are actually the same.

Right, the idea is to find such same-ref setups to avoid spending a
pointless bit in the per-object bitmap. In the GitHub setup, it would be
an indication that two people forked at exactly the same time, so they
have the same refs and the same delta requirements. If one of them later
updates, that relationship would change at the next repack.

I don't know that we ever collected numbers for how often this happens.
So let me see if I can dig some up.

On our git/git repository network, it looks like we have ~14k forks, and
~4k are unique by this hashing scheme. So it really is saving us
10k-bits per bitmap. That's over 1k-byte per object in the worst case.
There are ~24M objects (many times what is in git.git, but people push
lots of random things to their forks), so that's saving us up to 24GB in
RAM. Of course it almost certainly isn't that helpful in practice, since
we copy-on-write the bitmaps to avoid the full cost per object. But I
think it's fair to say it is helping (more numbers below).

The distribution of buckets itself looks pretty power-law-ish from
eyeballing it. The biggest one has 92 forks, and then it trails off
quickly to smaller numbers, and then eventually a long tail of
singletons.

The numbers for torvalds/linux are similar. There are ~23k forks, only
~8k of which are unique. rails/rails has 12k/17k unique. So there's some
fluctuation, but the notion that there will be a bunch of identical
forks seems solid.

If you're curious what the actual memory usage is for a repack of
git/git, here are some rough numbers from eyeballing RSS via "top" while
watching the repack progress meter (I didn't want to do a full valgrind
run because it's S-L-O-W):

  - after loading the islands, we're at ~3GB RSS. Some of that is shared
(the mmap'd packed-refs file is 1GB by itself), but probably 1-1.5GB
is real heap, likely from the gigantic fork list.

  - we grow to around 6GB RSS before starting "counting objects". I
think this is largely prepare_revision_walk(), since we're making
"struct commit" objects for all the ref tips (and this is mmap-ing
packfiles, too, which counts against our RSS)

  - by the time "counting objects" is done, we're at ~10GB. We're not
yet running with Duy's memory-slimming for pack-objects, so it's
100+ bytes per object_entry. Which means we're probably only
spending a GB or so on island bitmaps, but at this point we've only
marked commits and root-trees.

  - By the time "propagating island marks" is done, we'll have marked
all the trees and blobs, and RSS is up to ~11GB.

So all in all (and I'd emphasize this is extremely rough) I think it
probably costs about 2GB for the feature in this particular case. But
you need much more to repack at this size sanely anyway.

If I were doing this patch series from scratch, I might implement the
fork de-duping and copy-on-write bits successively so I could take good
measurements of how much they help (and brag about it in the commit
messages). But I don't know if it's worth untangling now or not.

-Peff


Re: [PATCH v4 1/7] Add delta-islands.{c,h}

2018-08-15 Thread Christian Couder
On Mon, Aug 13, 2018 at 8:11 PM, Jeff King  wrote:
> On Mon, Aug 13, 2018 at 01:17:18PM +0100, Ramsay Jones wrote:
>
>> >>> +struct island_bitmap {
>> >>> + uint32_t refcount;
>> >>> + uint32_t bits[];
>> >>
>> >> Use FLEX_ARRAY here? We are slowly moving toward requiring
>> >> certain C99 features, but I can't remember a flex array
>> >> weather-balloon patch.
>> >
>> > This was already discussed by Junio and Peff there:
>> >
>> > https://public-inbox.org/git/20180727130229.gb18...@sigill.intra.peff.net/
>>
>> That is a fine discussion, without a firm conclusion, but I don't
>> think you can simply do nothing here:
>>
>>   $ cat -n junk.c
>>1  #include 
>>2
>>3  struct island_bitmap {
>>4  uint32_t refcount;
>>5  uint32_t bits[];
>>6  };
>>7
>>   $ gcc --std=c89 --pedantic -c junk.c
>>   junk.c:5:11: warning: ISO C90 does not support flexible array members 
>> [-Wpedantic]
>> uint32_t bits[];
>>  ^~~~
>>   $ gcc --std=c99 --pedantic -c junk.c
>
> Right, whether we use the FLEX_ALLOC macros or not, this needs to be
> declared with FLEX_ARRAY, not an empty "[]".

Ok, it will use FLEX_ARRAY in the reroll I will send soon.

Thanks Ramsay and Peff!


Re: [PATCH v4 1/7] Add delta-islands.{c,h}

2018-08-15 Thread Christian Couder
On Mon, Aug 13, 2018 at 9:00 PM, Jeff King  wrote:
> On Mon, Aug 13, 2018 at 05:33:59AM +0200, Christian Couder wrote:
>
>> >> + memcpy(&sha_core, oid->hash, sizeof(uint64_t));
>> >> + rl->hash += sha_core;
>> >
>> > Hmm, so the first 64-bits of the oid of each ref that is part of
>> > this island is added together as a 'hash' for the island. And this
>> > is used to de-duplicate the islands? Any false positives? (does it
>> > matter - it would only affect performance, not correctness, right?)
>>
>> I would think that a false positive from pure chance is very unlikely.
>> We would need to approach billions of delta islands (as 2 to the power
>> 64/2 is in the order of billions) for the probability to be
>> significant. GitHub has less than 50 millions users and it is very
>> unlikely that a significant proportion of these users will fork the
>> same repo.
>>
>> Now if there is a false positive because two forks have exactly the
>> same refs, then it is not a problem if they are considered the same,
>> because they are actually the same.
>
> Right, the idea is to find such same-ref setups to avoid spending a
> pointless bit in the per-object bitmap. In the GitHub setup, it would be
> an indication that two people forked at exactly the same time, so they
> have the same refs and the same delta requirements. If one of them later
> updates, that relationship would change at the next repack.
>
> I don't know that we ever collected numbers for how often this happens.
> So let me see if I can dig some up.
>
> On our git/git repository network, it looks like we have ~14k forks, and
> ~4k are unique by this hashing scheme. So it really is saving us
> 10k-bits per bitmap. That's over 1k-byte per object in the worst case.
> There are ~24M objects (many times what is in git.git, but people push
> lots of random things to their forks), so that's saving us up to 24GB in
> RAM. Of course it almost certainly isn't that helpful in practice, since
> we copy-on-write the bitmaps to avoid the full cost per object. But I
> think it's fair to say it is helping (more numbers below).

[...]

> So all in all (and I'd emphasize this is extremely rough) I think it
> probably costs about 2GB for the feature in this particular case. But
> you need much more to repack at this size sanely anyway.

Thanks for the interesting numbers!