Re: 2.18.0 Regression: packing performance and effectiveness

2018-07-20 Thread Elijah Newren
On Thu, Jul 19, 2018 at 10:28 PM, Jeff King  wrote:
> On Thu, Jul 19, 2018 at 04:11:01PM -0700, Elijah Newren wrote:
>
>> Looking at the output from Peff's instrumentation elsewhere in this
>> thread, I see a lot of lines like
>>mismatched get: 32889efd307c7be376da9e3d45a78305f14ba73a = (, 28)
>> Does that mean it was reading the array when it wasn't ready?
>
> Yes, it looks like we saw a "get" without a "set". Though this could
> also be due to threading. The tracing isn't atomic with respect to the
> actual get/set operation, so it's possible that the ordering of the
> trace output does not match the ordering of the actual operations.
>
>> However, it's interesting to also look at the effect on packing
>> linux.git (on the same beefy hardware):
>>
>> Version  Pack (MB)  MaxRSS(kB)  Time (s)
>> ---  -  --  
>>  2.17.0 1279 11382932  632.24
>>  2.18.0 1279 10817568  621.97
>>  fiv-v4 1279 11484168 1193.67
>>
>> While the pack size is nice and small, the original memory savings
>> added in 2.18.0 are gone and the performance is much worse.  :-(
>
> Interesting. I can't reproduce here. The fix-v4 case is only slightly
> slower than 2.18.0. Can you double check that your compiler flags, etc,
> were the same? Many times I've accidentally compared -O0 to -O0. :)

Same build flags each time, but as Duy points out, this was on a
40-processor box.

> You might also try the patch below (on top of fix-v4), which moves the
> locking to its own dedicated mutex. That should reduce lock contention,
> and it fixes the remaining realloc where I think we're still racy. On my
> repack of linux.git, it dropped the runtime from 6m3s to 5m41s, almost
> entirely in system CPU.

I had to add an include of pthread.h to get it to compile, but
otherwise I'm trying it out.  Re-running a few times with each version
to see how big the run-to-run variance is.

> I didn't measure my max rss. However, I'd caution slightly against
> drawing too much conclusion from it, for two reasons:
>
>   1. RSS includes mmap'd packfiles, which is subject to whatever pages
>  the OS feels like keeping in RAM. So using more heap can sometimes
>  go unnoticed in that count, since you're just trading heap pages
>  for mmap pages. Although that implies some memory pressure, and it
>  sounds like your machine is sufficiently beefy to avoid that.
>
>   2. Peak heap is going to depend on the threading. You have one thread
>  per CPU working on a window of objects, each of which will be in
>  memory at once. So I'd expect a fair bit of fluctuation in the peak
>  just depending on how the threads line up with each other (some of
>  it random, and some of it maybe impacted by what the code does, but
>  in a way that just happens to fall out for this specific workload).
>
> Which isn't to say measuring it is useless. The trends may override the
> noise from those two things. I've just run into problems in the past
> trying to get consistent measurements.

Good to know.  I was actually measuring it due to commit 3b13a5f26387
("pack-objects: reorder members to shrink struct object_entry",
2018-04-14) referencing
https://public-inbox.org/git/87po42cwql@evledraar.gmail.com/ as
being the measurement methodology to back up the results of the
original purpose of the change.  (I did miss out on the "run 3 times,
report the lowest value" that Ævar did, though.)  I figured that as
long as we were further tweaking things along the lines of that patch
series, it made sense to try to do similar measurements to see if we
were at least improving things relative to 2.17.0.

> Here's the lock patch.
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index ba14a1bfbc..b76ce04cb9 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -1926,12 +1926,12 @@ static unsigned long oe_delta_size(struct 
> packing_data *pack,
>  {
> unsigned long size;
>
> -   read_lock(); /* to protect access to pack->delta_size[] */
> +   pack_delta_lock(pack);
> if (pack->delta_size)
> size = pack->delta_size[e - pack->objects];
> else
> size = e->delta_size_;
> -   read_unlock();
> +   pack_delta_unlock(pack);
> return size;
>  }
>
> @@ -1939,10 +1939,10 @@ static void oe_set_delta_size(struct packing_data 
> *pack,
>   struct object_entry *e,
>   unsigned long size)
>  {
> -   read_lock(); /* to protect access to pack->delta_size[] */
> +   pack_delta_lock(pack);
> if (!pack->delta_size && size < pack->oe_delta_size_limit) {
> e->delta_size_ = size;
> -   read_unlock();
> +   pack_delta_unlock(pack);
> return;
> }
> /*
> @@ -1963,7 +1963,7 @@ static void oe_set_delta_size(struct packing_data *pack,
> 

Re: 2.18.0 Regression: packing performance and effectiveness

2018-07-19 Thread Duy Nguyen
On Fri, Jul 20, 2018 at 7:28 AM Jeff King  wrote:
>
> On Thu, Jul 19, 2018 at 04:11:01PM -0700, Elijah Newren wrote:
>
> > Looking at the output from Peff's instrumentation elsewhere in this
> > thread, I see a lot of lines like
> >mismatched get: 32889efd307c7be376da9e3d45a78305f14ba73a = (, 28)
> > Does that mean it was reading the array when it wasn't ready?
>
> Yes, it looks like we saw a "get" without a "set". Though this could
> also be due to threading. The tracing isn't atomic with respect to the
> actual get/set operation, so it's possible that the ordering of the
> trace output does not match the ordering of the actual operations.
>
> > However, it's interesting to also look at the effect on packing
> > linux.git (on the same beefy hardware):
> >
> > Version  Pack (MB)  MaxRSS(kB)  Time (s)
> > ---  -  --  
> >  2.17.0 1279 11382932  632.24
> >  2.18.0 1279 10817568  621.97
> >  fiv-v4 1279 11484168 1193.67
> >
> > While the pack size is nice and small, the original memory savings
> > added in 2.18.0 are gone and the performance is much worse.  :-(
>
> Interesting. I can't reproduce here. The fix-v4 case is only slightly
> slower than 2.18.0. Can you double check that your compiler flags, etc,
> were the same? Many times I've accidentally compared -O0 to -O0. :)

He ran 40 threads though. That number of threads can make lock
contention very expensive. Yeah my money is also on lock contention.

> You might also try the patch below (on top of fix-v4), which moves the

Another thing Elijah could try is watch CPU utilization. If this is
lock contention, I think core utilization should be much lower because
we spend more time waiting than actually doing things.

> locking to its own dedicated mutex. That should reduce lock contention,

I think we could use cache_lock() which is for non-odb shared data
(and delta_size[] fits this category)

> and it fixes the remaining realloc where I think we're still racy. On my

Yeah it's not truly racy as you also noted in another mail. I'll make
a note about this in the commit message.

> repack of linux.git, it dropped the runtime from 6m3s to 5m41s, almost
> entirely in system CPU.
-- 
Duy


Re: 2.18.0 Regression: packing performance and effectiveness

2018-07-19 Thread Jeff King
On Fri, Jul 20, 2018 at 01:28:29AM -0400, Jeff King wrote:

> @@ -162,8 +166,10 @@ struct object_entry *packlist_alloc(struct packing_data 
> *pdata,
>  
>   if (!pdata->in_pack_by_idx)
>   REALLOC_ARRAY(pdata->in_pack, pdata->nr_alloc);
> + pack_delta_lock(pdata);
>   if (pdata->delta_size)
>   REALLOC_ARRAY(pdata->delta_size, pdata->nr_alloc);
> + pack_delta_unlock(pdata);
>   }

This made me wonder if in_pack_by_idx needs a similar lock. But
actually, I don't think either is necessary. We only allocate new
entries during the counting phase, not during the threaded delta search.

So this hunk could be dropped (and the system CPU benefit I saw is just
from reduced lock contention).

-Peff


Re: 2.18.0 Regression: packing performance and effectiveness

2018-07-19 Thread Jeff King
On Thu, Jul 19, 2018 at 04:11:01PM -0700, Elijah Newren wrote:

> Looking at the output from Peff's instrumentation elsewhere in this
> thread, I see a lot of lines like
>mismatched get: 32889efd307c7be376da9e3d45a78305f14ba73a = (, 28)
> Does that mean it was reading the array when it wasn't ready?

Yes, it looks like we saw a "get" without a "set". Though this could
also be due to threading. The tracing isn't atomic with respect to the
actual get/set operation, so it's possible that the ordering of the
trace output does not match the ordering of the actual operations.

> However, it's interesting to also look at the effect on packing
> linux.git (on the same beefy hardware):
> 
> Version  Pack (MB)  MaxRSS(kB)  Time (s)
> ---  -  --  
>  2.17.0 1279 11382932  632.24
>  2.18.0 1279 10817568  621.97
>  fiv-v4 1279 11484168 1193.67
> 
> While the pack size is nice and small, the original memory savings
> added in 2.18.0 are gone and the performance is much worse.  :-(

Interesting. I can't reproduce here. The fix-v4 case is only slightly
slower than 2.18.0. Can you double check that your compiler flags, etc,
were the same? Many times I've accidentally compared -O0 to -O0. :)

You might also try the patch below (on top of fix-v4), which moves the
locking to its own dedicated mutex. That should reduce lock contention,
and it fixes the remaining realloc where I think we're still racy. On my
repack of linux.git, it dropped the runtime from 6m3s to 5m41s, almost
entirely in system CPU.

I didn't measure my max rss. However, I'd caution slightly against
drawing too much conclusion from it, for two reasons:

  1. RSS includes mmap'd packfiles, which is subject to whatever pages
 the OS feels like keeping in RAM. So using more heap can sometimes
 go unnoticed in that count, since you're just trading heap pages
 for mmap pages. Although that implies some memory pressure, and it
 sounds like your machine is sufficiently beefy to avoid that.

  2. Peak heap is going to depend on the threading. You have one thread
 per CPU working on a window of objects, each of which will be in
 memory at once. So I'd expect a fair bit of fluctuation in the peak
 just depending on how the threads line up with each other (some of
 it random, and some of it maybe impacted by what the code does, but
 in a way that just happens to fall out for this specific workload).

Which isn't to say measuring it is useless. The trends may override the
noise from those two things. I've just run into problems in the past
trying to get consistent measurements.

Here's the lock patch.

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index ba14a1bfbc..b76ce04cb9 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1926,12 +1926,12 @@ static unsigned long oe_delta_size(struct packing_data 
*pack,
 {
unsigned long size;
 
-   read_lock(); /* to protect access to pack->delta_size[] */
+   pack_delta_lock(pack);
if (pack->delta_size)
size = pack->delta_size[e - pack->objects];
else
size = e->delta_size_;
-   read_unlock();
+   pack_delta_unlock(pack);
return size;
 }
 
@@ -1939,10 +1939,10 @@ static void oe_set_delta_size(struct packing_data *pack,
  struct object_entry *e,
  unsigned long size)
 {
-   read_lock(); /* to protect access to pack->delta_size[] */
+   pack_delta_lock(pack);
if (!pack->delta_size && size < pack->oe_delta_size_limit) {
e->delta_size_ = size;
-   read_unlock();
+   pack_delta_unlock(pack);
return;
}
/*
@@ -1963,7 +1963,7 @@ static void oe_set_delta_size(struct packing_data *pack,
pack->delta_size[i] = pack->objects[i].delta_size_;
}
pack->delta_size[e - pack->objects] = size;
-   read_unlock();
+   pack_delta_unlock(pack);
 }
 
 static int try_delta(struct unpacked *trg, struct unpacked *src,
diff --git a/pack-objects.c b/pack-objects.c
index e3c32bbfc2..8d9c2dfb82 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -148,6 +148,10 @@ void prepare_packing_data(struct packing_data *pdata)
 1U << OE_SIZE_BITS);
pdata->oe_delta_size_limit = git_env_ulong("GIT_TEST_OE_DELTA_SIZE",
   1U << OE_DELTA_SIZE_BITS);
+
+#ifndef NO_PTHREADS
+   pthread_mutex_init(>delta_lock, NULL);
+#endif
 }
 
 struct object_entry *packlist_alloc(struct packing_data *pdata,
@@ -162,8 +166,10 @@ struct object_entry *packlist_alloc(struct packing_data 
*pdata,
 
if (!pdata->in_pack_by_idx)
REALLOC_ARRAY(pdata->in_pack, pdata->nr_alloc);
+   pack_delta_lock(pdata);
if (pdata->delta_size)
 

Re: 2.18.0 Regression: packing performance and effectiveness

2018-07-19 Thread Elijah Newren
On Thu, Jul 19, 2018 at 11:24 AM, Duy Nguyen  wrote:
> On Thu, Jul 19, 2018 at 07:31:35PM +0200, Duy Nguyen wrote:
>> On Thu, Jul 19, 2018 at 01:23:58PM -0400, Jeff King wrote:
>> > On Thu, Jul 19, 2018 at 09:42:00AM -0700, Elijah Newren wrote:
>> >
>> > > Thanks for the quick turnaround.  Unfortunately, I have some bad news.
>> > > With this patch, I get the following:
>> > >
>> > > $ /usr/bin/time -f 'MaxRSS:%M Time:%e' git gc --aggressive
>> > > Enumerating objects: 4460703, done.
>> > > Counting objects: 100% (4460703/4460703), done.
>> > > Delta compression using up to 40 threads.
>> > > Compressing objects: 100% (3807140/3807140), done.
>> > > Writing objects: 100% (4460703/4460703), done.
>> > > Total 4460703 (delta 2831383), reused 1587071 (delta 0)
>> > > error: failed to unpack compressed delta at offset 183854150 from
>> > > .git/objects/pack/pack-30d4f0b0e5a03dc91a658a0586f4e74cdf4a94d6.pack
>> > > fatal: packed object 20ce811e53dabbb8ef9368c108cbbdfa65639c03 (stored
>> > > in .git/objects/pack/pack-30d4f0b0e5a03dc91a658a0586f4e74cdf4a94d6.pack)
>> > > is corrupt
>> > > error: failed to run prune
>> > > MaxRSS:40025196 Time:2531.52
>> >
>> > Looking at that output, my _guess_ is that we somehow end up with a
>> > bogus delta_size value and write out a truncated entry. But I couldn't
>> > reproduce the issue with smaller test cases.
>>
>> Could it be a race condition?
>
> I'm convinced my code is racy (between two writes). I created a broken
> pack once with 32 threads. Elijah please try again with this new
> patch. It should fix this (I only tried repack a few times so far but
> will continue)
>
> The race is this
>
> 1. Thread one sees a large delta size and NULL delta_size[] array,
>allocates the new array and in the middle of copying old delta
>sizes over.
>
> 2. Thread two wants to write a new (large) delta size. It sees that
>delta_size[] is already allocated, it writes the correct size there
>(and truncated one in object_entry->delta_size_)
>
> 3. Back to thread one, it now copies the truncated value in
>delta_size_ from step 2 to delta_size[] array, overwriting the good
>value that thread two wrote.
>
> There is also a potential read/write race where a read from
> pack_size[] happens when the array is not ready. But I don't think it
> can happen with current try_delta() code. I protect it anyway to be
> safe.

Looking at the output from Peff's instrumentation elsewhere in this
thread, I see a lot of lines like
   mismatched get: 32889efd307c7be376da9e3d45a78305f14ba73a = (, 28)
Does that mean it was reading the array when it wasn't ready?


Anyway, with your latest patch (which I'm labelling fix-v4), git gc
--aggressive completes, git fsck likes the result, and the new table
of stats on this repo becomes:

Version  Pack (MB)  MaxRSS(kB)  Time (s)
---  -  --  
 2.17.0 5498 435136282494.85
 2.18.010531 404495964168.94
 fix-v1 5509 425097842480.74
 fiv-v2 5509 416441042468.25
 fiv-v4 5500 444009482761.74


So, the pack size is back to what is expected.  The code takes about
10% longer and requires 2% more memory than git-2.17.0, but the pack
size was the main issue.


However, it's interesting to also look at the effect on packing
linux.git (on the same beefy hardware):

Version  Pack (MB)  MaxRSS(kB)  Time (s)
---  -  --  
 2.17.0 1279 11382932  632.24
 2.18.0 1279 10817568  621.97
 fiv-v4 1279 11484168 1193.67

While the pack size is nice and small, the original memory savings
added in 2.18.0 are gone and the performance is much worse.  :-(


Re: 2.18.0 Regression: packing performance and effectiveness

2018-07-19 Thread Junio C Hamano
Junio C Hamano  writes:

> Imagine that you create a delta and set its size (which happens to
> fit in the entry) and then create another whose size does not fit in
> the entry.  How does oe_delta_size() know not to look at
> pack->delta_size[] and instead look at e->delta_size_ when it wants
> to know the delta for the first one?  IOW, shouldn't there be a
> "backfill" procedure when oe_set_delta_size() notices that we now
> switch to pack->delta_size[] array?

Ah, ignore this.  That is what happens in the prepare_ thing.



Re: 2.18.0 Regression: packing performance and effectiveness

2018-07-19 Thread Junio C Hamano
Duy Nguyen  writes:

> @@ -330,20 +330,27 @@ static inline void oe_set_size(struct packing_data 
> *pack,
>  static inline unsigned long oe_delta_size(struct packing_data *pack,
> const struct object_entry *e)
>  {
> - if (e->delta_size_valid)
> + if (!pack->delta_size)
>   return e->delta_size_;
> - return oe_size(pack, e);
> + return pack->delta_size[e - pack->objects];
>  }
>  
> +void oe_prepare_delta_size_array(struct packing_data *pack);
>  static inline void oe_set_delta_size(struct packing_data *pack,
>struct object_entry *e,
>unsigned long size)
>  {
>   e->delta_size_ = size;
> - e->delta_size_valid = e->delta_size_ == size;
> - if (!e->delta_size_valid && size != oe_size(pack, e))
> - BUG("this can only happen in check_object() "
> - "where delta size is the same as entry size");
> + if (!pack->delta_size && e->delta_size_ == size)
> + return;
> + /*
> +  * We have had at least one delta size exceeding OE_DELTA_SIZE_BITS
> +  * limit. delta_size_ will not be used anymore. All delta sizes are now
> +  * from the delta_size[] array.
> +  */
> + if (!pack->delta_size)
> + oe_prepare_delta_size_array(pack);
> + pack->delta_size[e - pack->objects] = size;
>  }

Imagine that you create a delta and set its size (which happens to
fit in the entry) and then create another whose size does not fit in
the entry.  How does oe_delta_size() know not to look at
pack->delta_size[] and instead look at e->delta_size_ when it wants
to know the delta for the first one?  IOW, shouldn't there be a
"backfill" procedure when oe_set_delta_size() notices that we now
switch to pack->delta_size[] array?



Re: 2.18.0 Regression: packing performance and effectiveness

2018-07-19 Thread Jeff King
On Thu, Jul 19, 2018 at 08:24:42PM +0200, Duy Nguyen wrote:

> > > Looking at that output, my _guess_ is that we somehow end up with a
> > > bogus delta_size value and write out a truncated entry. But I couldn't
> > > reproduce the issue with smaller test cases.
> > 
> > Could it be a race condition?
> 
> I'm convinced my code is racy (between two writes). I created a broken
> pack once with 32 threads. Elijah please try again with this new
> patch. It should fix this (I only tried repack a few times so far but
> will continue)

Good thinking, it's definitely racy. And that's why my tiny reproduction
didn't work. I even tried bumping it up to 10 blobs instead of 2, but
that's not nearly enough.

> The race is this
> 
> 1. Thread one sees a large delta size and NULL delta_size[] array,
>allocates the new array and in the middle of copying old delta
>sizes over.
> 
> 2. Thread two wants to write a new (large) delta size. It sees that
>delta_size[] is already allocated, it writes the correct size there
>(and truncated one in object_entry->delta_size_)
> 
> 3. Back to thread one, it now copies the truncated value in
>delta_size_ from step 2 to delta_size[] array, overwriting the good
>value that thread two wrote.

Right. Or we could even allocate two delta_size arrays, since the
NULL-check and the allocation are not atomic.

> There is also a potential read/write race where a read from
> pack_size[] happens when the array is not ready. But I don't think it
> can happen with current try_delta() code. I protect it anyway to be
> safe.

Hrm. That one's disappointing, because we read much more often than we
write, and this introduces potential lock contention. It may not matter
much in practice, though.

> +static unsigned long oe_delta_size(struct packing_data *pack,
> +const struct object_entry *e)
> +{
> + unsigned long size;
> +
> + read_lock(); /* to protect access to pack->delta_size[] */
> + if (pack->delta_size)
> + size = pack->delta_size[e - pack->objects];
> + else
> + size = e->delta_size_;
> + read_unlock();
> + return size;
> +}

Yuck, we even have to pay the read_lock() cost when we don't overflow
into the pack->delta_size array (but I agree we have to for
correctness).

Again, though, this amount of contention probably doesn't make a big
difference, since we're holding the lock for such a short time
(especially compared to all the work of computing the deltas).

This could be separate from the read_lock(), though, since that one does
block for much longer (e.g., while zlib inflating objects from disk).

> +static void oe_set_delta_size(struct packing_data *pack,
> +   struct object_entry *e,
> +   unsigned long size)
> +{
> + read_lock(); /* to protect access to pack->delta_size[] */
> + if (!pack->delta_size && size < pack->oe_delta_size_limit) {
> + e->delta_size_ = size;
> + read_unlock();
> + return;
> + }

And ditto for this one. I thought we could get away with the "fast case"
skipping the lock, but we have to check pack->delta_size atomically to
even use it.

If each individual delta_size had an overflow bit that indicates "use me
literally" or "look me up in the array", then I think the "quick" ones
could avoid locking. It may not be worth the complexity though.

> @@ -160,6 +162,8 @@ struct object_entry *packlist_alloc(struct packing_data 
> *pdata,
>  
>   if (!pdata->in_pack_by_idx)
>   REALLOC_ARRAY(pdata->in_pack, pdata->nr_alloc);
> + if (pdata->delta_size)
> + REALLOC_ARRAY(pdata->delta_size, pdata->nr_alloc);
>   }
>  

This realloc needs to happen under the lock, too, I think. It would be
OK without locking for an in-place realloc, but if the chunk has to be
moved, somebody in oe_set_delta_size() might write to the old memory.

This is in a file that doesn't even know about read_lock(), of course.
Probably you need a delta mutex as part of the "struct packing_data",
and then it can just be handled inside pack-objects.c entirely.

-Peff


Re: 2.18.0 Regression: packing performance and effectiveness

2018-07-19 Thread Duy Nguyen
On Thu, Jul 19, 2018 at 07:31:35PM +0200, Duy Nguyen wrote:
> On Thu, Jul 19, 2018 at 01:23:58PM -0400, Jeff King wrote:
> > On Thu, Jul 19, 2018 at 09:42:00AM -0700, Elijah Newren wrote:
> > 
> > > Thanks for the quick turnaround.  Unfortunately, I have some bad news.
> > > With this patch, I get the following:
> > > 
> > > $ /usr/bin/time -f 'MaxRSS:%M Time:%e' git gc --aggressive
> > > Enumerating objects: 4460703, done.
> > > Counting objects: 100% (4460703/4460703), done.
> > > Delta compression using up to 40 threads.
> > > Compressing objects: 100% (3807140/3807140), done.
> > > Writing objects: 100% (4460703/4460703), done.
> > > Total 4460703 (delta 2831383), reused 1587071 (delta 0)
> > > error: failed to unpack compressed delta at offset 183854150 from
> > > .git/objects/pack/pack-30d4f0b0e5a03dc91a658a0586f4e74cdf4a94d6.pack
> > > fatal: packed object 20ce811e53dabbb8ef9368c108cbbdfa65639c03 (stored
> > > in .git/objects/pack/pack-30d4f0b0e5a03dc91a658a0586f4e74cdf4a94d6.pack)
> > > is corrupt
> > > error: failed to run prune
> > > MaxRSS:40025196 Time:2531.52
> > 
> > Looking at that output, my _guess_ is that we somehow end up with a
> > bogus delta_size value and write out a truncated entry. But I couldn't
> > reproduce the issue with smaller test cases.
> 
> Could it be a race condition?

I'm convinced my code is racy (between two writes). I created a broken
pack once with 32 threads. Elijah please try again with this new
patch. It should fix this (I only tried repack a few times so far but
will continue)

The race is this

1. Thread one sees a large delta size and NULL delta_size[] array,
   allocates the new array and in the middle of copying old delta
   sizes over.

2. Thread two wants to write a new (large) delta size. It sees that
   delta_size[] is already allocated, it writes the correct size there
   (and truncated one in object_entry->delta_size_)

3. Back to thread one, it now copies the truncated value in
   delta_size_ from step 2 to delta_size[] array, overwriting the good
   value that thread two wrote.

There is also a potential read/write race where a read from
pack_size[] happens when the array is not ready. But I don't think it
can happen with current try_delta() code. I protect it anyway to be
safe.

-- 8< --
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index ebc8cefb53..d67997f11c 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -32,6 +32,12 @@
 #include "object-store.h"
 #include "dir.h"
 
+static unsigned long oe_delta_size(struct packing_data *pack,
+  const struct object_entry *e);
+static void oe_set_delta_size(struct packing_data *pack,
+ struct object_entry *e,
+ unsigned long size);
+
 #define IN_PACK(obj) oe_in_pack(_pack, obj)
 #define SIZE(obj) oe_size(_pack, obj)
 #define SET_SIZE(obj,size) oe_set_size(_pack, obj, size)
@@ -1915,6 +1921,51 @@ unsigned long oe_get_size_slow(struct packing_data *pack,
return size;
 }
 
+static unsigned long oe_delta_size(struct packing_data *pack,
+  const struct object_entry *e)
+{
+   unsigned long size;
+
+   read_lock(); /* to protect access to pack->delta_size[] */
+   if (pack->delta_size)
+   size = pack->delta_size[e - pack->objects];
+   else
+   size = e->delta_size_;
+   read_unlock();
+   return size;
+}
+
+static void oe_set_delta_size(struct packing_data *pack,
+ struct object_entry *e,
+ unsigned long size)
+{
+   read_lock(); /* to protect access to pack->delta_size[] */
+   if (!pack->delta_size && size < pack->oe_delta_size_limit) {
+   e->delta_size_ = size;
+   read_unlock();
+   return;
+   }
+   /*
+* We have had at least one delta size exceeding OE_DELTA_SIZE_BITS
+* limit. delta_size_ will not be used anymore. All delta sizes are now
+* from the delta_size[] array.
+*/
+   if (!pack->delta_size) {
+   uint32_t i;
+
+   /*
+* nr_alloc, not nr_objects to align with realloc() strategy in
+* packlist_alloc()
+*/
+   ALLOC_ARRAY(pack->delta_size, pack->nr_alloc);
+
+   for (i = 0; i < pack->nr_objects; i++)
+   pack->delta_size[i] = pack->objects[i].delta_size_;
+   }
+   pack->delta_size[e - pack->objects] = size;
+   read_unlock();
+}
+
 static int try_delta(struct unpacked *trg, struct unpacked *src,
 unsigned max_depth, unsigned long *mem_usage)
 {
@@ -2023,10 +2074,6 @@ static int try_delta(struct unpacked *trg, struct 
unpacked *src,
delta_buf = create_delta(src->index, trg->data, trg_size, _size, 
max_size);
if (!delta_buf)
return 0;
-   if (delta_size >= (1U << 

Re: 2.18.0 Regression: packing performance and effectiveness

2018-07-19 Thread Duy Nguyen
On Thu, Jul 19, 2018 at 01:23:58PM -0400, Jeff King wrote:
> On Thu, Jul 19, 2018 at 09:42:00AM -0700, Elijah Newren wrote:
> 
> > Thanks for the quick turnaround.  Unfortunately, I have some bad news.
> > With this patch, I get the following:
> > 
> > $ /usr/bin/time -f 'MaxRSS:%M Time:%e' git gc --aggressive
> > Enumerating objects: 4460703, done.
> > Counting objects: 100% (4460703/4460703), done.
> > Delta compression using up to 40 threads.
> > Compressing objects: 100% (3807140/3807140), done.
> > Writing objects: 100% (4460703/4460703), done.
> > Total 4460703 (delta 2831383), reused 1587071 (delta 0)
> > error: failed to unpack compressed delta at offset 183854150 from
> > .git/objects/pack/pack-30d4f0b0e5a03dc91a658a0586f4e74cdf4a94d6.pack
> > fatal: packed object 20ce811e53dabbb8ef9368c108cbbdfa65639c03 (stored
> > in .git/objects/pack/pack-30d4f0b0e5a03dc91a658a0586f4e74cdf4a94d6.pack)
> > is corrupt
> > error: failed to run prune
> > MaxRSS:40025196 Time:2531.52
> 
> Looking at that output, my _guess_ is that we somehow end up with a
> bogus delta_size value and write out a truncated entry. But I couldn't
> reproduce the issue with smaller test cases.

Could it be a race condition? I ran the whole test suite with this
fallback code activated (by forcing the delta size limit down to two
bytes) and nothing failed. Perhaps something like this on top? I'll
need to see if helgrind could spot anything...

-- 8< --
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 7f3546057d..1eccbc91d2 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2054,7 +2054,16 @@ static int try_delta(struct unpacked *trg, struct 
unpacked *src,
}
 
SET_DELTA(trg_entry, src_entry);
+
+   /*
+* Locking is needed because SET_DELTA_SIZE() internally call
+* oe_prepare_delta_size_array() which may touch other entries,
+* which are updated in parallel.
+*/
+   cache_lock();
SET_DELTA_SIZE(trg_entry, delta_size);
+   cache_unlock();
+
trg->depth = src->depth + 1;
 
return 1;
diff --git a/pack-objects.c b/pack-objects.c
index 89699cf15c..9e52af32c3 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -185,13 +185,16 @@ struct object_entry *packlist_alloc(struct packing_data 
*pdata,
 void oe_prepare_delta_size_array(struct packing_data *pack)
 {
uint32_t i;
+   uint32_t *delta_size;
 
/*
 * nr_alloc, not nr_objects to align with realloc() strategy in
 * packlist_alloc()
 */
-   ALLOC_ARRAY(pack->delta_size, pack->nr_alloc);
+   ALLOC_ARRAY(delta_size, pack->nr_alloc);
 
for (i = 0; i < pack->nr_objects; i++)
-   pack->delta_size[i] = pack->objects[i].delta_size_;
+   delta_size[i] = pack->objects[i].delta_size_;
+
+   pack->delta_size = delta_size;
 }
-- 8< --

Elijah, another thing you could try (if you have plenty of time to
spare) is run this repack with a single thread. It's going to take
forever though...

--
Duy


Re: 2.18.0 Regression: packing performance and effectiveness

2018-07-19 Thread Jeff King
On Thu, Jul 19, 2018 at 09:42:00AM -0700, Elijah Newren wrote:

> Thanks for the quick turnaround.  Unfortunately, I have some bad news.
> With this patch, I get the following:
> 
> $ /usr/bin/time -f 'MaxRSS:%M Time:%e' git gc --aggressive
> Enumerating objects: 4460703, done.
> Counting objects: 100% (4460703/4460703), done.
> Delta compression using up to 40 threads.
> Compressing objects: 100% (3807140/3807140), done.
> Writing objects: 100% (4460703/4460703), done.
> Total 4460703 (delta 2831383), reused 1587071 (delta 0)
> error: failed to unpack compressed delta at offset 183854150 from
> .git/objects/pack/pack-30d4f0b0e5a03dc91a658a0586f4e74cdf4a94d6.pack
> fatal: packed object 20ce811e53dabbb8ef9368c108cbbdfa65639c03 (stored
> in .git/objects/pack/pack-30d4f0b0e5a03dc91a658a0586f4e74cdf4a94d6.pack)
> is corrupt
> error: failed to run prune
> MaxRSS:40025196 Time:2531.52

Looking at that output, my _guess_ is that we somehow end up with a
bogus delta_size value and write out a truncated entry. But I couldn't
reproduce the issue with smaller test cases.

Maybe instrumenting Git with the patch below and running:

  GIT_TRACE_DELTA=$PWD/delta.out git gc --aggressive
  perl -lne '
/(get|put) ([0-9a-f]{40}) (\d+)/ or next;
if ($1 eq "put") {
$h{$2} and print "double put: $2 = ($h{$2}, $3)";
$h{$2} = $3;
} else {
$h{$2} == $3 or print "mismatched get: $2 = ($h{$2}, $3)"
}
  ' delta_size)
@@ -335,11 +337,23 @@ static inline unsigned long oe_delta_size(struct 
packing_data *pack,
return pack->delta_size[e - pack->objects];
 }
 
+static inline unsigned long oe_delta_size(struct packing_data *pack,
+ const struct object_entry *e)
+{
+   unsigned long ret = oe_delta_size_1(pack, e);
+   trace_printf_key(_delta, "get %s %lu",
+oid_to_hex(>idx.oid), ret);
+   return ret;
+}
+
 void oe_prepare_delta_size_array(struct packing_data *pack);
 static inline void oe_set_delta_size(struct packing_data *pack,
 struct object_entry *e,
 unsigned long size)
 {
+   trace_printf_key(_delta, "put %s %lu",
+oid_to_hex(>idx.oid), size);
+
e->delta_size_ = size;
if (!pack->delta_size && e->delta_size_ == size)
return;


Re: 2.18.0 Regression: packing performance and effectiveness

2018-07-19 Thread Jeff King
On Thu, Jul 19, 2018 at 05:16:42PM +0200, Duy Nguyen wrote:

> On Thu, Jul 19, 2018 at 07:57:37AM +0200, Duy Nguyen wrote:
> > I thought 2M was generous but I was obviously wrong. I'd like to see
> > the biggest delta size in this pack and whether it's still reasonable
> > to increase OE_DELTA_SIZE_BITS. But if it's very close to 4GB limit
> > then we could just store 64-bit delta size in a separate array.
> 
> I realize now that these two options don't have to be mutually
> exclusive and I should provide a good fallback in terms of performance
> anyway.

Yeah, this is what I had assumed was happening in the original code. :)

> +void oe_prepare_delta_size_array(struct packing_data *pack)
> +{
> + int i;
> +
> + /*
> +  * nr_alloc, not nr_objects to align with realloc() strategy in
> +  * packlist_alloc()
> +  */
> + ALLOC_ARRAY(pack->delta_size, pack->nr_alloc);
> +
> + for (i = 0; i < pack->nr_objects; i++)
> + pack->delta_size[i] = pack->objects[i].delta_size_;
> +}

This iterator should probably be a uint32_t, to match nr_objects.

The rest of the patch looks OK to me. From Elijah's failure report there
clearly is _some_ problem where we end up with a truncated write of a
delta. But I don't see it from code inspection, nor could I reproduce
it.

-Peff


Re: 2.18.0 Regression: packing performance and effectiveness

2018-07-19 Thread Elijah Newren
On Thu, Jul 19, 2018 at 8:16 AM, Duy Nguyen  wrote:
> On Thu, Jul 19, 2018 at 07:57:37AM +0200, Duy Nguyen wrote:
>> I thought 2M was generous but I was obviously wrong. I'd like to see
>> the biggest delta size in this pack and whether it's still reasonable
>> to increase OE_DELTA_SIZE_BITS. But if it's very close to 4GB limit
>> then we could just store 64-bit delta size in a separate array.
>
> I realize now that these two options don't have to be mutually
> exclusive and I should provide a good fallback in terms of performance
> anyway.
>
> So Elijah, could you try this patch and see if performance and pack
> size go back to 1.7.0? I can write proper commit message and test
> support later if it proves a good improvement.

Thanks for the quick turnaround.  Unfortunately, I have some bad news.
With this patch, I get the following:

$ /usr/bin/time -f 'MaxRSS:%M Time:%e' git gc --aggressive
Enumerating objects: 4460703, done.
Counting objects: 100% (4460703/4460703), done.
Delta compression using up to 40 threads.
Compressing objects: 100% (3807140/3807140), done.
Writing objects: 100% (4460703/4460703), done.
Total 4460703 (delta 2831383), reused 1587071 (delta 0)
error: failed to unpack compressed delta at offset 183854150 from
.git/objects/pack/pack-30d4f0b0e5a03dc91a658a0586f4e74cdf4a94d6.pack
fatal: packed object 20ce811e53dabbb8ef9368c108cbbdfa65639c03 (stored
in .git/objects/pack/pack-30d4f0b0e5a03dc91a658a0586f4e74cdf4a94d6.pack)
is corrupt
error: failed to run prune
MaxRSS:40025196 Time:2531.52


It looks like this comes after it deleted the old packs, so running
git fsck afterwards prints lots of errors.  Just in case they're
helpful at all, there are 26 "error: failed to unpack compressed delta
at offset..." messages, 26 "error: cannot unpack ... from ... at
offset ..." messages, 17 "error: failed to read delta base object..."
messages, 7 "broken link from ... to ..." messages, four missing
blobs, and three missing trees.

(This was run on a copy of the repo, so no harm done; I can still use
the original.)


> -- 8< --
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index ebc8cefb53..7f3546057d 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -2023,10 +2023,6 @@ static int try_delta(struct unpacked *trg, struct 
> unpacked *src,
> delta_buf = create_delta(src->index, trg->data, trg_size, 
> _size, max_size);
> if (!delta_buf)
> return 0;
> -   if (delta_size >= (1U << OE_DELTA_SIZE_BITS)) {
> -   free(delta_buf);
> -   return 0;
> -   }
>
> if (DELTA(trg_entry)) {
> /* Prefer only shallower same-sized deltas. */
> diff --git a/pack-objects.c b/pack-objects.c
> index 92708522e7..191ed45faf 100644
> --- a/pack-objects.c
> +++ b/pack-objects.c
> @@ -160,6 +160,8 @@ struct object_entry *packlist_alloc(struct packing_data 
> *pdata,
>
> if (!pdata->in_pack_by_idx)
> REALLOC_ARRAY(pdata->in_pack, pdata->nr_alloc);
> +   if (pdata->delta_size)
> +   REALLOC_ARRAY(pdata->delta_size, pdata->nr_alloc);
> }
>
> new_entry = pdata->objects + pdata->nr_objects++;
> @@ -177,3 +179,17 @@ struct object_entry *packlist_alloc(struct packing_data 
> *pdata,
>
> return new_entry;
>  }
> +
> +void oe_prepare_delta_size_array(struct packing_data *pack)
> +{
> +   int i;
> +
> +   /*
> +* nr_alloc, not nr_objects to align with realloc() strategy in
> +* packlist_alloc()
> +*/
> +   ALLOC_ARRAY(pack->delta_size, pack->nr_alloc);
> +
> +   for (i = 0; i < pack->nr_objects; i++)
> +   pack->delta_size[i] = pack->objects[i].delta_size_;
> +}
> diff --git a/pack-objects.h b/pack-objects.h
> index edf74dabdd..978500e474 100644
> --- a/pack-objects.h
> +++ b/pack-objects.h
> @@ -14,7 +14,7 @@
>   * above this limit. Don't lower it too much.
>   */
>  #define OE_SIZE_BITS   31
> -#define OE_DELTA_SIZE_BITS 20
> +#define OE_DELTA_SIZE_BITS 21
>
>  /*
>   * State flags for depth-first search used for analyzing delta cycles.
> @@ -93,7 +93,6 @@ struct object_entry {
>  * uses the same base as me
>  */
> unsigned delta_size_:OE_DELTA_SIZE_BITS; /* delta data size 
> (uncompressed) */
> -   unsigned delta_size_valid:1;
> unsigned in_pack_idx:OE_IN_PACK_BITS;   /* already in pack */
> unsigned z_delta_size:OE_Z_DELTA_BITS;
> unsigned type_valid:1;
> @@ -130,6 +129,7 @@ struct packing_data {
> uint32_t index_size;
>
> unsigned int *in_pack_pos;
> +   unsigned long *delta_size;
>
> /*
>  * Only one of these can be non-NULL and they have different
> @@ -330,20 +330,27 @@ static inline void oe_set_size(struct packing_data 
> *pack,
>  static inline unsigned long oe_delta_size(struct packing_data *pack,
> 

Re: 2.18.0 Regression: packing performance and effectiveness

2018-07-19 Thread Duy Nguyen
On Thu, Jul 19, 2018 at 5:27 PM Elijah Newren  wrote:
>
> On Wed, Jul 18, 2018 at 10:41 PM, Duy Nguyen  wrote:
> > On Thu, Jul 19, 2018 at 12:51 AM Elijah Newren  wrote:
> >>
> >> I had a user report some poor behavior of 'git gc --aggressive' on a
> >> certain repo (which I sadly cannot share).  Turns out that on this
> >> repo, this operation takes about 60% longer and produces a pack
> >> roughly twice the expected size.
> >
> > The intention was to make life better for weaker machines but
> > definitely should not slow down beefier ones, so yes this is
> > definitely a regression.
>
> Not sure if it matters, but the original discovery was on a laptop.
> Probably 4 cores and 16 GB RAM.  I duplicated on my workstation (8
> cores, 24 GB RAM).  However, since the original patch was memory
> related and I noticed the repacking using all available memory, I
> repeated the testcases while measuring memory but did it on a machine
> that wouldn't be memory limited.

That should be plentiful for a 5GB pack with default cache settings, I think.

> > Is it possible to share "verify-pack -v " output of the
> > pack produced by 2.17.0 and 2.18.0? The only sensitive info there is
> > sha-1, which you can replace with just "SHA-1" if you want. I'm more
> > interested in delta sizes and distribution.
>
> For the deltas, assuming the size-in-pack field (4th column) is the relevant 
> one
>
> Number of objects in repo (lines of verify-pack output): 4460755
> Number of deltas: 2831384
> Number of deltas greater than 1MB: 72
> Min: 14
> Median: 47
> Mean: 586
> 99.999 percentile: 11366280.6 (10.8 MB)
> Max: 141664210 (135.1 MB)
>
> If the size field (3rd column) is the relevant one, then the numbers
> change slightly to:
>
> Number of deltas greater than 1MB: 101
> Min: 4
> Median: 33
> Mean: 660
> 99.999 percentile: 12245551.7 (11.7 MB)

I think size is the right one because even deltas are compressed
before stored in the pack file (I probably should check the code...)
but anyway this number is bigger and that sounds to me like 16MB as
delta size limit should cover common case nicely. 32MB to be on the
safe side, but given default cache size of 256MB, a single delta 1/8
the size of the cache sounds too much.

> Max: 144658342 (138.0 MB)

This one would trigger the patch I just sent, which should also handle
it nicely (I hope).

> I checked out the blob which had the biggest delta, as well as the
> blob it was a delta against.  One was a 280 MB file, the other a 278
> MB file.
-- 
Duy


Re: 2.18.0 Regression: packing performance and effectiveness

2018-07-19 Thread Elijah Newren
On Wed, Jul 18, 2018 at 10:41 PM, Duy Nguyen  wrote:
> On Thu, Jul 19, 2018 at 12:51 AM Elijah Newren  wrote:
>>
>> I had a user report some poor behavior of 'git gc --aggressive' on a
>> certain repo (which I sadly cannot share).  Turns out that on this
>> repo, this operation takes about 60% longer and produces a pack
>> roughly twice the expected size.
>
> The intention was to make life better for weaker machines but
> definitely should not slow down beefier ones, so yes this is
> definitely a regression.

Not sure if it matters, but the original discovery was on a laptop.
Probably 4 cores and 16 GB RAM.  I duplicated on my workstation (8
cores, 24 GB RAM).  However, since the original patch was memory
related and I noticed the repacking using all available memory, I
repeated the testcases while measuring memory but did it on a machine
that wouldn't be memory limited.

> Is it possible to share "verify-pack -v " output of the
> pack produced by 2.17.0 and 2.18.0? The only sensitive info there is
> sha-1, which you can replace with just "SHA-1" if you want. I'm more
> interested in delta sizes and distribution.

For the deltas, assuming the size-in-pack field (4th column) is the relevant one

Number of objects in repo (lines of verify-pack output): 4460755
Number of deltas: 2831384
Number of deltas greater than 1MB: 72
Min: 14
Median: 47
Mean: 586
99.999 percentile: 11366280.6 (10.8 MB)
Max: 141664210 (135.1 MB)

If the size field (3rd column) is the relevant one, then the numbers
change slightly to:

Number of deltas greater than 1MB: 101
Min: 4
Median: 33
Mean: 660
99.999 percentile: 12245551.7 (11.7 MB)
Max: 144658342 (138.0 MB)

I checked out the blob which had the biggest delta, as well as the
blob it was a delta against.  One was a 280 MB file, the other a 278
MB file.


Re: 2.18.0 Regression: packing performance and effectiveness

2018-07-19 Thread Duy Nguyen
On Thu, Jul 19, 2018 at 07:57:37AM +0200, Duy Nguyen wrote:
> I thought 2M was generous but I was obviously wrong. I'd like to see
> the biggest delta size in this pack and whether it's still reasonable
> to increase OE_DELTA_SIZE_BITS. But if it's very close to 4GB limit
> then we could just store 64-bit delta size in a separate array.

I realize now that these two options don't have to be mutually
exclusive and I should provide a good fallback in terms of performance
anyway.

So Elijah, could you try this patch and see if performance and pack
size go back to 1.7.0? I can write proper commit message and test
support later if it proves a good improvement.

-- 8< --
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index ebc8cefb53..7f3546057d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2023,10 +2023,6 @@ static int try_delta(struct unpacked *trg, struct 
unpacked *src,
delta_buf = create_delta(src->index, trg->data, trg_size, _size, 
max_size);
if (!delta_buf)
return 0;
-   if (delta_size >= (1U << OE_DELTA_SIZE_BITS)) {
-   free(delta_buf);
-   return 0;
-   }
 
if (DELTA(trg_entry)) {
/* Prefer only shallower same-sized deltas. */
diff --git a/pack-objects.c b/pack-objects.c
index 92708522e7..191ed45faf 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -160,6 +160,8 @@ struct object_entry *packlist_alloc(struct packing_data 
*pdata,
 
if (!pdata->in_pack_by_idx)
REALLOC_ARRAY(pdata->in_pack, pdata->nr_alloc);
+   if (pdata->delta_size)
+   REALLOC_ARRAY(pdata->delta_size, pdata->nr_alloc);
}
 
new_entry = pdata->objects + pdata->nr_objects++;
@@ -177,3 +179,17 @@ struct object_entry *packlist_alloc(struct packing_data 
*pdata,
 
return new_entry;
 }
+
+void oe_prepare_delta_size_array(struct packing_data *pack)
+{
+   int i;
+
+   /*
+* nr_alloc, not nr_objects to align with realloc() strategy in
+* packlist_alloc()
+*/
+   ALLOC_ARRAY(pack->delta_size, pack->nr_alloc);
+
+   for (i = 0; i < pack->nr_objects; i++)
+   pack->delta_size[i] = pack->objects[i].delta_size_;
+}
diff --git a/pack-objects.h b/pack-objects.h
index edf74dabdd..978500e474 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -14,7 +14,7 @@
  * above this limit. Don't lower it too much.
  */
 #define OE_SIZE_BITS   31
-#define OE_DELTA_SIZE_BITS 20
+#define OE_DELTA_SIZE_BITS 21
 
 /*
  * State flags for depth-first search used for analyzing delta cycles.
@@ -93,7 +93,6 @@ struct object_entry {
 * uses the same base as me
 */
unsigned delta_size_:OE_DELTA_SIZE_BITS; /* delta data size 
(uncompressed) */
-   unsigned delta_size_valid:1;
unsigned in_pack_idx:OE_IN_PACK_BITS;   /* already in pack */
unsigned z_delta_size:OE_Z_DELTA_BITS;
unsigned type_valid:1;
@@ -130,6 +129,7 @@ struct packing_data {
uint32_t index_size;
 
unsigned int *in_pack_pos;
+   unsigned long *delta_size;
 
/*
 * Only one of these can be non-NULL and they have different
@@ -330,20 +330,27 @@ static inline void oe_set_size(struct packing_data *pack,
 static inline unsigned long oe_delta_size(struct packing_data *pack,
  const struct object_entry *e)
 {
-   if (e->delta_size_valid)
+   if (!pack->delta_size)
return e->delta_size_;
-   return oe_size(pack, e);
+   return pack->delta_size[e - pack->objects];
 }
 
+void oe_prepare_delta_size_array(struct packing_data *pack);
 static inline void oe_set_delta_size(struct packing_data *pack,
 struct object_entry *e,
 unsigned long size)
 {
e->delta_size_ = size;
-   e->delta_size_valid = e->delta_size_ == size;
-   if (!e->delta_size_valid && size != oe_size(pack, e))
-   BUG("this can only happen in check_object() "
-   "where delta size is the same as entry size");
+   if (!pack->delta_size && e->delta_size_ == size)
+   return;
+   /*
+* We have had at least one delta size exceeding OE_DELTA_SIZE_BITS
+* limit. delta_size_ will not be used anymore. All delta sizes are now
+* from the delta_size[] array.
+*/
+   if (!pack->delta_size)
+   oe_prepare_delta_size_array(pack);
+   pack->delta_size[e - pack->objects] = size;
 }
 
 #endif
-- 8< --


--
Duy


Re: 2.18.0 Regression: packing performance and effectiveness

2018-07-18 Thread Duy Nguyen
On Thu, Jul 19, 2018 at 7:44 AM Jeff King  wrote:
>
> On Wed, Jul 18, 2018 at 03:51:08PM -0700, Elijah Newren wrote:
>
> > I had a user report some poor behavior of 'git gc --aggressive' on a
> > certain repo (which I sadly cannot share).  Turns out that on this
> > repo, this operation takes about 60% longer and produces a pack
> > roughly twice the expected size.
> >
> > Naturally, bisecting takes a while but it points to this commit:
>
> Thanks for finding and bisecting.
>
> > To put some numbers behind this, I got on a very beefy box (40 cores,
> > 160GB RAM) and ran some comparisons:
> >
> > Version  Pack (MB)  MaxRSS(kB)  Time (s)
> > ---  -  --  
> >  2.17.0 5498 435136282494.85
> >  2.18.010531 404495964168.94
> >  fix-v1 5509 425097842480.74
> >  fiv-v2 5509 416441042468.25
> >
> > where fix-v1 and fix-v2 are different patches on top of git-2.18.0
> > that I'll follow up to this email with.  The patches are just meant as
> > discussion starters.  I'll add signoffs and proper commit messages if
> > folks actually want those fixes, but I suspect they're just starting
> > points for discussion.
>
> Hmm. So what's the mechanism causing the issue here? Looking at the
> patch from 0aca34e826 (pack-objects: shrink delta_size field in struct
> object_entry, 2018-04-14), it should _mostly_ be about how we store data
> in memory, and should not impact the deltas we find.
>
> But reading the patch and looking at your v2, it seems clear that this
> hunk is the culprit:
>
> @@ -2006,10 +2008,14 @@ static int try_delta(struct unpacked *trg, struct 
> unpacked *src,
> delta_buf = create_delta(src->index, trg->data, trg_size, 
> _size, max_size);
> if (!delta_buf)
> return 0;
> +   if (delta_size >= (1U << OE_DELTA_SIZE_BITS)) {
> +   free(delta_buf);
> +   return 0;
> +   }
>
> So we're punting on large deltas, and presumably your test case has some
> files that are large but delta well (and we miss those opportunities,
> which wastes further time in the delta search and creates a lousy pack).
>
> I'm not sure why we need this hunk, though. Eventually we'd hit
> SET_DELTA_SIZE(), and I thought the point is that it should handle large
> sizes with some kind of fallback (the commit message even mentions an
> "overflow bit"). Looking at oe_set_delta_size(), though, I don't think
> the fallback would really work here; we'd hid the BUG().

The fallback is for existing on-disk deltas where we can actually get
the size from. For new deltas we have to store the size somewhere in
memory and we can't because this field's size is reduced.

I thought 2M was generous but I was obviously wrong. I'd like to see
the biggest delta size in this pack and whether it's still reasonable
to increase OE_DELTA_SIZE_BITS. But if it's very close to 4GB limit
then we could just store 64-bit delta size in a separate array.
-- 
Duy


Re: 2.18.0 Regression: packing performance and effectiveness

2018-07-18 Thread Jeff King
On Thu, Jul 19, 2018 at 07:41:03AM +0200, Duy Nguyen wrote:

> On Thu, Jul 19, 2018 at 12:51 AM Elijah Newren  wrote:
> >
> > I had a user report some poor behavior of 'git gc --aggressive' on a
> > certain repo (which I sadly cannot share).  Turns out that on this
> > repo, this operation takes about 60% longer and produces a pack
> > roughly twice the expected size.
> 
> The intention was to make life better for weaker machines but
> definitely should not slow down beefier ones, so yes this is
> definitely a regression.
> 
> Is it possible to share "verify-pack -v " output of the
> pack produced by 2.17.0 and 2.18.0? The only sensitive info there is
> sha-1, which you can replace with just "SHA-1" if you want. I'm more
> interested in delta sizes and distribution.

Try this:

-- >8 --
#!/bin/sh

rm -rf repo

git init repo
cd repo

dd if=/dev/urandom of=one.rand bs=1M count=2
dd if=/dev/urandom of=two.rand bs=1M count=2
dd if=/dev/urandom of=big.rand bs=1M count=20

cat one.rand big.rand >file
git add file
git commit -m one

cat two.rand big.rand >file
git add file
git commit -m two

git repack -ad
git cat-file --batch-all-objects --batch-check='%(objectname) %(deltabase)'
-- 8< --

Using git 2.17 for the repack results in a single delta (which should be
about 2MB, because it will say "delete one.rand and add two.rand" or
vice versa).

Using git 2.18, I get no delta at all.

-Peff


Re: 2.18.0 Regression: packing performance and effectiveness

2018-07-18 Thread Jeff King
On Wed, Jul 18, 2018 at 03:51:08PM -0700, Elijah Newren wrote:

> I had a user report some poor behavior of 'git gc --aggressive' on a
> certain repo (which I sadly cannot share).  Turns out that on this
> repo, this operation takes about 60% longer and produces a pack
> roughly twice the expected size.
> 
> Naturally, bisecting takes a while but it points to this commit:

Thanks for finding and bisecting.

> To put some numbers behind this, I got on a very beefy box (40 cores,
> 160GB RAM) and ran some comparisons:
> 
> Version  Pack (MB)  MaxRSS(kB)  Time (s)
> ---  -  --  
>  2.17.0 5498 435136282494.85
>  2.18.010531 404495964168.94
>  fix-v1 5509 425097842480.74
>  fiv-v2 5509 416441042468.25
> 
> where fix-v1 and fix-v2 are different patches on top of git-2.18.0
> that I'll follow up to this email with.  The patches are just meant as
> discussion starters.  I'll add signoffs and proper commit messages if
> folks actually want those fixes, but I suspect they're just starting
> points for discussion.

Hmm. So what's the mechanism causing the issue here? Looking at the
patch from 0aca34e826 (pack-objects: shrink delta_size field in struct
object_entry, 2018-04-14), it should _mostly_ be about how we store data
in memory, and should not impact the deltas we find.

But reading the patch and looking at your v2, it seems clear that this
hunk is the culprit:

@@ -2006,10 +2008,14 @@ static int try_delta(struct unpacked *trg, struct 
unpacked *src,
delta_buf = create_delta(src->index, trg->data, trg_size, _size, 
max_size);
if (!delta_buf)
return 0;
+   if (delta_size >= (1U << OE_DELTA_SIZE_BITS)) {
+   free(delta_buf);
+   return 0;
+   }

So we're punting on large deltas, and presumably your test case has some
files that are large but delta well (and we miss those opportunities,
which wastes further time in the delta search and creates a lousy pack).

I'm not sure why we need this hunk, though. Eventually we'd hit
SET_DELTA_SIZE(), and I thought the point is that it should handle large
sizes with some kind of fallback (the commit message even mentions an
"overflow bit"). Looking at oe_set_delta_size(), though, I don't think
the fallback would really work here; we'd hid the BUG().

At any rate, it should be easy to construct a test case.

-Peff


Re: 2.18.0 Regression: packing performance and effectiveness

2018-07-18 Thread Duy Nguyen
On Thu, Jul 19, 2018 at 12:51 AM Elijah Newren  wrote:
>
> I had a user report some poor behavior of 'git gc --aggressive' on a
> certain repo (which I sadly cannot share).  Turns out that on this
> repo, this operation takes about 60% longer and produces a pack
> roughly twice the expected size.

The intention was to make life better for weaker machines but
definitely should not slow down beefier ones, so yes this is
definitely a regression.

Is it possible to share "verify-pack -v " output of the
pack produced by 2.17.0 and 2.18.0? The only sensitive info there is
sha-1, which you can replace with just "SHA-1" if you want. I'm more
interested in delta sizes and distribution.
-- 
Duy


2.18.0 Regression: packing performance and effectiveness

2018-07-18 Thread Elijah Newren
I had a user report some poor behavior of 'git gc --aggressive' on a
certain repo (which I sadly cannot share).  Turns out that on this
repo, this operation takes about 60% longer and produces a pack
roughly twice the expected size.

Naturally, bisecting takes a while but it points to this commit:

0aca34e8269514ebb67676e0470a314615494ae8 is the first bad commit
commit 0aca34e8269514ebb67676e0470a314615494ae8
Author: Nguyễn Thái Ngọc Duy 
Date:   Sat Apr 14 17:35:11 2018 +0200

pack-objects: shrink delta_size field in struct object_entry

Allowing a delta size of 64 bits is crazy. Shrink this field down to
20 bits with one overflow bit.

If we find an existing delta larger than 1MB, we do not cache
delta_size at all and will get the value from oe_size(), potentially
from disk if it's larger than 4GB.

Note, since DELTA_SIZE() is used in try_delta() code, it must be
thread-safe. Luckily oe_size() does guarantee this so we it is
thread-safe.

Signed-off-by: Nguyễn Thái Ngọc Duy 
Signed-off-by: Junio C Hamano 

To put some numbers behind this, I got on a very beefy box (40 cores,
160GB RAM) and ran some comparisons:

Version  Pack (MB)  MaxRSS(kB)  Time (s)
---  -  --  
 2.17.0 5498 435136282494.85
 2.18.010531 404495964168.94
 fix-v1 5509 425097842480.74
 fiv-v2 5509 416441042468.25

where fix-v1 and fix-v2 are different patches on top of git-2.18.0
that I'll follow up to this email with.  The patches are just meant as
discussion starters.  I'll add signoffs and proper commit messages if
folks actually want those fixes, but I suspect they're just starting
points for discussion.