Re: [PATCH v2 1/5] test-list-objects: List a subset of object ids

2017-10-07 Thread Junio C Hamano
Jeff King  writes:

> On Sat, Oct 07, 2017 at 03:12:08PM -0400, Derrick Stolee wrote:
>
>> In my local copy, I added a test to p4211-line-log.sh that runs "git log
>> --raw -r" and tested it on three copies of the Linux repo. In order, they
>> have 1 packfile (0 loose), 24 packfiles (0 loose), and 23 packfiles
>> (~324,000 loose).
>> 
>> 4211.6: git log --raw -r  43.34(42.62+0.65)   40.47(40.16+0.27)  -6.6%
>> 4211.6: git log --raw -r  88.77(86.54+2.12)   82.44(81.87+0.52)  -7.1%
>> 4211.6: git log --raw -r 108.86(103.97+4.81) 103.92(100.63+3.19) -4.5%
>> 
>> We have moderate performance gains for this command, despite the command
>> doing many more things than just checking abbreviations.
>
> Yeah, while it's less exciting than seeing the 90% numbers for a
> micro-benchmark, I think this represents real-world gains (and 5-7% is
> nothing to sneeze at).

Yes!  I would even say 5-7% is much better than "nothing to sneeze
at".  We do prefer workload closer to the real-world usage over
micro benchmarks, and consider changes that gain by a few percent as
real improvements.

> You might also try adding "--format=%h" or --oneline to your invocation,
> which would compute abbreviations for each commit (making your workload
> more abbrev-heavy and possibly showing off the difference more).

Again, agreed, and I would not consider it would be inflating the
benchmark artificially in favor of the change.  "log --oneline" is
not something people use rarely---I'd think it would be quite a
normal thing to do.

> I also think "-r" isn't doing anything. Recursive diffs are the default
> for the "log" porcelain (even for --raw).

That's my fault writing "-r" ;-) Together with your "log --oneline"
suggestion,

git log --oneline --raw

would be a reasonable thing to measure.

Thanks, both.


Re: [PATCH v2 1/5] test-list-objects: List a subset of object ids

2017-10-07 Thread Jeff King
On Sat, Oct 07, 2017 at 03:12:08PM -0400, Derrick Stolee wrote:

> In my local copy, I added a test to p4211-line-log.sh that runs "git log
> --raw -r" and tested it on three copies of the Linux repo. In order, they
> have 1 packfile (0 loose), 24 packfiles (0 loose), and 23 packfiles
> (~324,000 loose).
> 
> 4211.6: git log --raw -r  43.34(42.62+0.65)   40.47(40.16+0.27)  -6.6%
> 4211.6: git log --raw -r  88.77(86.54+2.12)   82.44(81.87+0.52)  -7.1%
> 4211.6: git log --raw -r 108.86(103.97+4.81) 103.92(100.63+3.19) -4.5%
> 
> We have moderate performance gains for this command, despite the command
> doing many more things than just checking abbreviations.

Yeah, while it's less exciting than seeing the 90% numbers for a
micro-benchmark, I think this represents real-world gains (and 5-7% is
nothing to sneeze at).

You might also try adding "--format=%h" or --oneline to your invocation,
which would compute abbreviations for each commit (making your workload
more abbrev-heavy and possibly showing off the difference more).

I also think "-r" isn't doing anything. Recursive diffs are the default
for the "log" porcelain (even for --raw).

> I plan to re-roll my patch on Monday including the following feedback items:
> 
> * Remove test-list-objects and test-abbrev in favor of a new "git log"
>   performance test.
> 
> * Fix binary search overflow error.
> 
> * Check response from open_pack_index(p) in find_abbrev_len_for_pack().
>   I plan to return without failure on non-zero result, which results in
>   no failure on a bad pack and the abbreviation length will be the
>   minimum required among all valid packs. (Thanks Stefan!)

That all sounds reasonable to me.

> - Teach 'cat-file' to --batch-check='%(objectsize:short)'. (Peff already
>   included a patch, perhaps that could be reviewed separately.)

I think I'll let it lie in the list archive for now unless somebody has
a real use case for it (though I'm tempted to add it purely for
completionism, since it's so easy).

-Peff


Re: [PATCH v2 1/5] test-list-objects: List a subset of object ids

2017-10-07 Thread Derrick Stolee

On 10/6/2017 10:11 AM, Jeff King wrote:

On Thu, Oct 05, 2017 at 08:39:42AM -0400, Derrick Stolee wrote:

I'll run some perf numbers for these commands you recommend, and also see if
I can replicate some of the pain points that triggered this change using the
Linux repo.


Thanks!

-Peff



In my local copy, I added a test to p4211-line-log.sh that runs "git log 
--raw -r" and tested it on three copies of the Linux repo. In order, 
they have 1 packfile (0 loose), 24 packfiles (0 loose), and 23 packfiles 
(~324,000 loose).


4211.6: git log --raw -r  43.34(42.62+0.65)   40.47(40.16+0.27)  -6.6%
4211.6: git log --raw -r  88.77(86.54+2.12)   82.44(81.87+0.52)  -7.1%
4211.6: git log --raw -r 108.86(103.97+4.81) 103.92(100.63+3.19) -4.5%

We have moderate performance gains for this command, despite the command 
doing many more things than just checking abbreviations.


I plan to re-roll my patch on Monday including the following feedback items:

* Remove test-list-objects and test-abbrev in favor of a new "git log"
  performance test.

* Fix binary search overflow error.

* Check response from open_pack_index(p) in find_abbrev_len_for_pack().
  I plan to return without failure on non-zero result, which results in
  no failure on a bad pack and the abbreviation length will be the
  minimum required among all valid packs. (Thanks Stefan!)

I made note of a few things, but will not include them in my re-roll. 
I'll revisit them later if they are valuable:


- nth_packed_object_sha1() could be simplified in
  find_abbrev_len_for_pack().

- Teach 'cat-file' to --batch-check='%(objectsize:short)'. (Peff already
  included a patch, perhaps that could be reviewed separately.)

- Ramsay caught my lack of "static" in test-list-objects.c, but that
  file will be removed in the next patch. I'll make sure to use "static"
  in the future.

I'm not re-rolling immediately to allow for some extra review over the 
weekend, if anyone is so inclined.


Thanks,
-Stolee


Re: [PATCH v2 1/5] test-list-objects: List a subset of object ids

2017-10-06 Thread Jeff King
On Thu, Oct 05, 2017 at 08:39:42AM -0400, Derrick Stolee wrote:

> > Actually, I'd just as soon see timings for "git log --format=%h" or "git
> > log --raw", as opposed to patches 1 and 2.
> > 
> > You won't see a 90% speedup there, but you will see the actual
> > improvement that real-world users are going to experience, which is way
> > more important, IMHO.
> > 
> Thanks for thinking hard about this.
> 
> For some real-user context: Some engineers using Git for the Windows repo
> were seeing extremely slow commands, such as 'fetch' or 'commit', and when
> we took a trace we saw most of the time spinning in this abbreviation code.

That's surprising to me. I'd expect fetch to generate up to two
abbreviations per fetched ref (for the status table). And for commit to
generate only one for the summary. But maybe it was just so painfully
slow on your repo that it was noticeable.

If there's a case where we're generating a bunch of abbreviated oids,
that might be worth looking into. Do you happen to have a stack trace of
one of the slow cases showing who is calling into the function?

> Our workaround so far has been to set core.abbrev=40.

I actually wondered about this. With the new auto-scaling, we'd
typically jump straight to 12 or 13 hex-chars on a large repo, and that
would be sufficient. So we'd really only need one iteration of the loop.

> I'll run some perf numbers for these commands you recommend, and also see if
> I can replicate some of the pain points that triggered this change using the
> Linux repo.

Thanks!

-Peff


Re: [PATCH v2 1/5] test-list-objects: List a subset of object ids

2017-10-05 Thread Derrick Stolee



On 10/5/2017 6:00 AM, Jeff King wrote:

On Thu, Oct 05, 2017 at 06:48:10PM +0900, Junio C Hamano wrote:


Jeff King  writes:


This is weirdly specific. Can we accomplish the same thing with existing
tools?

E.g., could:

   git cat-file --batch-all-objects --batch-check='%(objectname)' |
   shuffle |
   head -n 100

do the same thing?

I know that "shuffle" isn't available everywhere, but I'd much rather
see us fill in portability gaps in a general way, rather than
introducing one-shot C code that needs to be maintained (and you
wouldn't _think_ that t/helper programs need much maintenance, but try
perusing "git log t/helper" output; they have to adapt to the same
tree-wide changes as the rest of the code).

I was thinking about this a bit more, and came to the conclusion
that "sort -R" and "shuf" are wrong tools to use.  We would want to
measure with something close to real world workload.  for example,
letting

git rev-list --all --objects

produce the listof objects in traversal order (i.e. this is very
similar to the order in which "git log -p" needs to access the
objects) and chomping at the number of sample objects you need in
your test would give you such a list.

Actually, I'd just as soon see timings for "git log --format=%h" or "git
log --raw", as opposed to patches 1 and 2.

You won't see a 90% speedup there, but you will see the actual
improvement that real-world users are going to experience, which is way
more important, IMHO.

-Peff

Thanks for thinking hard about this.

For some real-user context: Some engineers using Git for the Windows 
repo were seeing extremely slow commands, such as 'fetch' or 'commit', 
and when we took a trace we saw most of the time spinning in this 
abbreviation code. Our workaround so far has been to set core.abbrev=40.


I'll run some perf numbers for these commands you recommend, and also 
see if I can replicate some of the pain points that triggered this 
change using the Linux repo.


Thanks,
-Stolee


Re: [PATCH v2 1/5] test-list-objects: List a subset of object ids

2017-10-05 Thread Junio C Hamano
On Thu, Oct 5, 2017 at 7:00 PM, Jeff King  wrote:
>
>> Jeff King  writes:
>>
> Actually, I'd just as soon see timings for "git log --format=%h" or "git
> log --raw", as opposed to patches 1 and 2.
>
> You won't see a 90% speedup there, but you will see the actual
> improvement that real-world users are going to experience, which is way
> more important, IMHO.

Yup, "log --raw -r" would really exercise the abbreviation machinery
without spending time on the diff machinery, and it would be a good
test.


Re: [PATCH v2 1/5] test-list-objects: List a subset of object ids

2017-10-05 Thread Jeff King
On Thu, Oct 05, 2017 at 06:48:10PM +0900, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > This is weirdly specific. Can we accomplish the same thing with existing
> > tools?
> >
> > E.g., could:
> >
> >   git cat-file --batch-all-objects --batch-check='%(objectname)' |
> >   shuffle |
> >   head -n 100
> >
> > do the same thing?
> >
> > I know that "shuffle" isn't available everywhere, but I'd much rather
> > see us fill in portability gaps in a general way, rather than
> > introducing one-shot C code that needs to be maintained (and you
> > wouldn't _think_ that t/helper programs need much maintenance, but try
> > perusing "git log t/helper" output; they have to adapt to the same
> > tree-wide changes as the rest of the code).
> 
> I was thinking about this a bit more, and came to the conclusion
> that "sort -R" and "shuf" are wrong tools to use.  We would want to
> measure with something close to real world workload.  for example,
> letting
> 
>   git rev-list --all --objects
> 
> produce the listof objects in traversal order (i.e. this is very
> similar to the order in which "git log -p" needs to access the
> objects) and chomping at the number of sample objects you need in
> your test would give you such a list.

Actually, I'd just as soon see timings for "git log --format=%h" or "git
log --raw", as opposed to patches 1 and 2.

You won't see a 90% speedup there, but you will see the actual
improvement that real-world users are going to experience, which is way
more important, IMHO.

-Peff


Re: [PATCH v2 1/5] test-list-objects: List a subset of object ids

2017-10-05 Thread Junio C Hamano
Jeff King  writes:

> This is weirdly specific. Can we accomplish the same thing with existing
> tools?
>
> E.g., could:
>
>   git cat-file --batch-all-objects --batch-check='%(objectname)' |
>   shuffle |
>   head -n 100
>
> do the same thing?
>
> I know that "shuffle" isn't available everywhere, but I'd much rather
> see us fill in portability gaps in a general way, rather than
> introducing one-shot C code that needs to be maintained (and you
> wouldn't _think_ that t/helper programs need much maintenance, but try
> perusing "git log t/helper" output; they have to adapt to the same
> tree-wide changes as the rest of the code).

I was thinking about this a bit more, and came to the conclusion
that "sort -R" and "shuf" are wrong tools to use.  We would want to
measure with something close to real world workload.  for example,
letting

git rev-list --all --objects

produce the listof objects in traversal order (i.e. this is very
similar to the order in which "git log -p" needs to access the
objects) and chomping at the number of sample objects you need in
your test would give you such a list.


Re: [PATCH v2 1/5] test-list-objects: List a subset of object ids

2017-10-05 Thread Jeff King
On Mon, Sep 25, 2017 at 05:54:48AM -0400, Derrick Stolee wrote:

> Create test-list-objects helper program to output a random sample of
> OIDs present in the repo.
> 
> If no command line arguments are provided, all OIDs are output.

This is weirdly specific. Can we accomplish the same thing with existing
tools?

E.g., could:

  git cat-file --batch-all-objects --batch-check='%(objectname)' |
  shuffle |
  head -n 100

do the same thing?

I know that "shuffle" isn't available everywhere, but I'd much rather
see us fill in portability gaps in a general way, rather than
introducing one-shot C code that needs to be maintained (and you
wouldn't _think_ that t/helper programs need much maintenance, but try
perusing "git log t/helper" output; they have to adapt to the same
tree-wide changes as the rest of the code).

-Peff


Re: [PATCH v2 1/5] test-list-objects: List a subset of object ids

2017-09-26 Thread Junio C Hamano
Derrick Stolee  writes:

> diff --git a/t/helper/test-list-objects.c b/t/helper/test-list-objects.c
> new file mode 100644
> index 0..83b1250fe
> --- /dev/null
> +++ b/t/helper/test-list-objects.c
> @@ -0,0 +1,85 @@
> +#include "cache.h"
> +#include "packfile.h"
> +#include 

If you include "cache.h", like the ordinary "git" program, you
should not have to include any system headers like stdio.h
yourself.  The whole point of "git-compat-util.h must be the first
to be included (indirectly including it via cache.h and friends is
also OK)" rule is to make sure that system headers are included at
the right place in the include sequence (e.g. defining feature
macros before including certain headers, etc.).

> +int cmd_main(int ac, const char **av)
> +{
> + unsigned int hash_delt = 0xFDB97531;
> + unsigned int hash_base = 0x01020304;
> +...
> + const int u_per_oid = sizeof(struct object_id) / sizeof(unsigned int);

Because the above will not work as you expect, unless your uint is
32-bit, you would probably want to make hash_delt and hash_base
explicitly uint32_t, I think.

Alternatively, because you assume that uint is at least 8-hex-digit
wide in the output loop, you can keep "unsigned int" above, but
change the divisor from sizeof(unsigned int) to a hardcoded 4.

Either would fix the issue.

> + c.total = 0;
> + if (ac > 1)
> + n = atoi(av[1]);

Otherwise n will stay 0 as initialized.  Not checking the result of
atoi() is probably permissible, as this is merely a test helper and
we can assume that the caller will not do anything silly, like
passing "-3" in av[1].  But it certainly would be a good discipline
to make sure av[1] is a reasonable number.

I am afraid that I may be misreading the code, but the following
code seems to require that the caller must know roughly how many
objects there are (so that a large-enough value can be passed to
av[1] to force c.select_mod to 1) when it wants to show everything,
as the "missing" loop will run 0 times showing nothing, and
"listing" case will divide by zero.

"Not passing anything will show everything" the proposed log message
promises is a better design than what the code actually seems to be
doing.

> +
> + if (ac > 2 && !strcmp(av[2], "--missing")) {
> + while (c.total++ < n) {

When n==0, this does nothing.  I am not sure what the desired
behaviour should be for "When no count is given, we show everything"
in this codepath, though.

Also, doesn't "test-list-objects 5 --missing" look very wrong?
Shouldn't it be more like "test-list-objects --missing 5"?

> + for (i = 0; i < u_per_oid; i++) {
> + printf("%08x", hash_base);
> + hash_base += hash_delt;
> + }
> + printf("\n");
> + }
> + } else {
> + setup_git_directory();
> +
> + for_each_loose_object(count_loose, , 0);
> + for_each_packed_object(count_packed, , 0);
> +
> + c.select_mod = 1 + c.total / n;

When n==0, this would divide by zero.
Perhaps

c.select_mod = n ? (1 + c.total / n) : 1;

or something to keep the promise of showing everything?

> + for_each_loose_object(output_loose, , 0);
> + for_each_packed_object(output_packed, , 0);
> + }
> +
> + exit(0);
> +}

Thanks.


[PATCH v2 1/5] test-list-objects: List a subset of object ids

2017-09-25 Thread Derrick Stolee
Create test-list-objects helper program to output a random sample of
OIDs present in the repo.

If no command line arguments are provided, all OIDs are output.

The first command line argument specifies how many samples to output.
Samples are collected across the entire set of available OIDs to avoid
clustering and therefore quite uniformly distributed.

If a second command line argument "--missing" is given, then a list of
OIDs is generated without examining the repo.

Signed-off-by: Derrick Stolee 
---
 Makefile |  1 +
 t/helper/.gitignore  |  1 +
 t/helper/test-list-objects.c | 85 
 3 files changed, 87 insertions(+)
 create mode 100644 t/helper/test-list-objects.c

diff --git a/Makefile b/Makefile
index f2bb7f2f6..af94b655a 100644
--- a/Makefile
+++ b/Makefile
@@ -647,6 +647,7 @@ TEST_PROGRAMS_NEED_X += test-hashmap
 TEST_PROGRAMS_NEED_X += test-index-version
 TEST_PROGRAMS_NEED_X += test-lazy-init-name-hash
 TEST_PROGRAMS_NEED_X += test-line-buffer
+TEST_PROGRAMS_NEED_X += test-list-objects
 TEST_PROGRAMS_NEED_X += test-match-trees
 TEST_PROGRAMS_NEED_X += test-mergesort
 TEST_PROGRAMS_NEED_X += test-mktemp
diff --git a/t/helper/.gitignore b/t/helper/.gitignore
index 721650256..9dd1c9f3c 100644
--- a/t/helper/.gitignore
+++ b/t/helper/.gitignore
@@ -13,6 +13,7 @@
 /test-index-version
 /test-lazy-init-name-hash
 /test-line-buffer
+/test-list-objects
 /test-match-trees
 /test-mergesort
 /test-mktemp
diff --git a/t/helper/test-list-objects.c b/t/helper/test-list-objects.c
new file mode 100644
index 0..83b1250fe
--- /dev/null
+++ b/t/helper/test-list-objects.c
@@ -0,0 +1,85 @@
+#include "cache.h"
+#include "packfile.h"
+#include 
+
+struct count {
+   int total;
+   int select_mod;
+};
+
+int count_loose(const struct object_id *oid,
+   const char *path,
+   void *data)
+{
+   ((struct count*)data)->total++;
+   return 0;
+}
+
+int count_packed(const struct object_id *oid,
+struct packed_git *pack,
+uint32_t pos,
+void* data)
+{
+   ((struct count*)data)->total++;
+   return 0;
+}
+
+void output(const struct object_id *oid,
+   struct count *c)
+{
+   if (!(c->total % c->select_mod))
+   printf("%s\n", oid_to_hex(oid));
+   c->total--;
+}
+
+int output_loose(const struct object_id *oid,
+const char *path,
+void *data)
+{
+   output(oid, (struct count*)data);
+   return 0;
+}
+
+int output_packed(const struct object_id *oid,
+ struct packed_git *pack,
+ uint32_t pos,
+ void* data)
+{
+   output(oid, (struct count*)data);
+   return 0;
+}
+
+int cmd_main(int ac, const char **av)
+{
+   unsigned int hash_delt = 0xFDB97531;
+   unsigned int hash_base = 0x01020304;
+   int i, n = 0;
+   struct count c;
+   const int u_per_oid = sizeof(struct object_id) / sizeof(unsigned int);
+
+   c.total = 0;
+   if (ac > 1)
+   n = atoi(av[1]);
+
+   if (ac > 2 && !strcmp(av[2], "--missing")) {
+   while (c.total++ < n) {
+   for (i = 0; i < u_per_oid; i++) {
+   printf("%08x", hash_base);
+   hash_base += hash_delt;
+   }
+   printf("\n");
+   }
+   } else {
+   setup_git_directory();
+
+   for_each_loose_object(count_loose, , 0);
+   for_each_packed_object(count_packed, , 0);
+
+   c.select_mod = 1 + c.total / n;
+
+   for_each_loose_object(output_loose, , 0);
+   for_each_packed_object(output_packed, , 0);
+   }
+
+   exit(0);
+}
-- 
2.14.1.538.g56ec8fc98.dirty