[PATCH 20/20] abbrev: add a core.validateAbbrev setting

2018-06-08 Thread Ævar Arnfjörð Bjarmason
Operations that need to abbreviate a lot of SHA-1s such as 'git log
--oneline' experience degraded performance when traversing a lot of
packs. See [1] and [2] for some relevant performance numbers.

One way to address this is something like the MIDX to make looking up
the SHA-1s cheaper.

This change adds an alternate method of achieving some of the same
ends (but possibly not all, see [3] and replies to the original thread
at [1]).

Instead of trying really hard to find an unambiguous SHA-1 we can with
core.validateAbbrev=false, and preferably combined with the new
support for relative core.abbrev values (such as +4) unconditionally
print a short SHA-1 without doing any disambiguation check. I.e. it
allows for picking a trade-off between performance, and the odds that
future or remote (or current and local) short SHA-1 will be ambiguous.

With the included performance test my git.git repacked with with `git
repack -A -d --max-pack-size=X` gives the following results against
git.git itself with X=64M:

TestHEAD~ HEAD


0014.1: git log --oneline --raw --parents   2.53(2.48+0.05)   
2.20(2.14+0.05) -13.0%

With one big pack -7.6%, and with 16M packs -23.8%.

1. https://public-inbox.org/git/20180107181459.222909-1-dsto...@microsoft.com/
2. https://public-inbox.org/git/20180607140338.32440-1-dsto...@microsoft.com/
3. https://public-inbox.org/git/87lgbsz61p@evledraar.gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 Documentation/config.txt| 43 ++
 cache.h |  1 +
 config.c|  7 +
 environment.c   |  1 +
 sha1-name.c |  4 +++
 t/perf/p0014-abbrev.sh  | 13 
 t/t1512-rev-parse-disambiguation.sh | 47 +
 7 files changed, 116 insertions(+)
 create mode 100755 t/perf/p0014-abbrev.sh

diff --git a/Documentation/config.txt b/Documentation/config.txt
index abf07be7b6..df31d1351f 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -925,6 +925,49 @@ means to add or subtract N characters from the SHA-1 that 
Git would
 otherwise print, this allows for producing more future-proof SHA-1s
 for use within a given project, while adjusting the value for the
 current approximate number of objects.
++
+This is especially useful in combination with the
+`core.validateAbbrev` setting, or to get more future-proof hashes to
+reference in the future in a repository whose number of objects is
+expected to grow.
+
+core.validateAbbrev::
+   If set to false (true by default) don't do any validation when
+   printing abbreviated object names to see if they're really
+   unique. This makes printing objects more performant at the
+   cost of potentially printing object names that aren't unique
+   within the current repository.
++
+When printing abbreviated object names Git needs to look through the
+local object store. This is an `O(log N)` operation assuming all the
+objects are in a single pack file, but `X * O(log N)` given `X` pack
+files, which can get expensive on some larger repositories.
++
+This setting changes that to `O(1)`, but with the trade-off that
+depending on the value of `core.abbrev` we may be printing abbreviated
+hashes that collide. Too see how likely this is, try running:
++
+---
+git log --all --pretty=format:%h --abbrev=4 | perl -nE 'chomp; say length' | 
sort | uniq -c | sort -nr
+---
++
+This shows how many commits were found at each abbreviation length. On
+linux.git in June 2018 this shows a bit more than 750,000 commits,
+with just 4 needing 11 characters to be fully abbreviated, and the
+default heuristic picks a length of 12.
++
+Even without `core.validateAbbrev=false` the results abbreviation
+already a bit of a probability game. They're guaranteed at the moment
+of generation, but as more objects are added, ambiguities may be
+introduced. Likewise, what's unambiguous for you may not be for
+somebody else you're communicating with, if they have their own clone.
++
+Therefore the default of `core.validateAbbrev=true` may not save you
+in practice if you're sharing the SHA-1 or noting it now to use after
+a `git fetch`. You may be better off setting `core.abbrev` to
+e.g. `+2` to add 2 extra characters to the SHA-1, and possibly combine
+that with `core.validateAbbrev=false` to get a reasonable trade-off
+between safety and performance.
 
 add.ignoreErrors::
 add.ignore-errors (deprecated)::
diff --git a/cache.h b/cache.h
index 0fb4211804..6dc5af9482 100644
--- a/cache.h
+++ b/cache.h
@@ -773,6 +773,7 @@ extern int quote_path_

Re: [PATCH 20/20] abbrev: add a core.validateAbbrev setting

2018-06-09 Thread Martin Ågren
On 9 June 2018 at 00:41, Ævar Arnfjörð Bjarmason  wrote:

> Instead of trying really hard to find an unambiguous SHA-1 we can with
> core.validateAbbrev=false, and preferably combined with the new
> support for relative core.abbrev values (such as +4) unconditionally
> print a short SHA-1 without doing any disambiguation check. I.e. it

This first paragraph read weirdly the first time. On the second attempt
I knew how to structure it and got it right. It might be easier to read
if the part about core.appreb=+4 were in a separate second sentence.

That last "it" is "the combination of these two configs", right?

> allows for picking a trade-off between performance, and the odds that
> future or remote (or current and local) short SHA-1 will be ambiguous.

> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index abf07be7b6..df31d1351f 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -925,6 +925,49 @@ means to add or subtract N characters from the SHA-1 
> that Git would
>  otherwise print, this allows for producing more future-proof SHA-1s
>  for use within a given project, while adjusting the value for the
>  current approximate number of objects.
> ++
> +This is especially useful in combination with the
> +`core.validateAbbrev` setting, or to get more future-proof hashes to
> +reference in the future in a repository whose number of objects is
> +expected to grow.

Maybe s/validateAbbrev/validateAbbrev = false/?

> +
> +core.validateAbbrev::
> +   If set to false (true by default) don't do any validation when
> +   printing abbreviated object names to see if they're really
> +   unique. This makes printing objects more performant at the
> +   cost of potentially printing object names that aren't unique
> +   within the current repository.

Good. I understand why I'd want to use it, and why not.

> ++
> +When printing abbreviated object names Git needs to look through the
> +local object store. This is an `O(log N)` operation assuming all the
> +objects are in a single pack file, but `X * O(log N)` given `X` pack
> +files, which can get expensive on some larger repositories.

This might be very close to too much information.

> ++
> +This setting changes that to `O(1)`, but with the trade-off that
> +depending on the value of `core.abbrev` we may be printing abbreviated
> +hashes that collide. Too see how likely this is, try running:
> ++
> +---
> +git log --all --pretty=format:%h --abbrev=4 | perl -nE 'chomp; say length' | 
> sort | uniq -c | sort -nr
> +---
> ++
> +This shows how many commits were found at each abbreviation length. On
> +linux.git in June 2018 this shows a bit more than 750,000 commits,
> +with just 4 needing 11 characters to be fully abbreviated, and the
> +default heuristic picks a length of 12.

These last few paragraphs seem like too much to me.

> ++
> +Even without `core.validateAbbrev=false` the results abbreviation
> +already a bit of a probability game. They're guaranteed at the moment
> +of generation, but as more objects are added, ambiguities may be
> +introduced. Likewise, what's unambiguous for you may not be for
> +somebody else you're communicating with, if they have their own clone.

This seems more useful.

> ++
> +Therefore the default of `core.validateAbbrev=true` may not save you
> +in practice if you're sharing the SHA-1 or noting it now to use after
> +a `git fetch`. You may be better off setting `core.abbrev` to
> +e.g. `+2` to add 2 extra characters to the SHA-1, and possibly combine
> +that with `core.validateAbbrev=false` to get a reasonable trade-off
> +between safety and performance.

Makes sense. As before, I'd suggest s/SHA-1/object ID/.

I do wonder if this documentation could be a bit less verbose without
sacrificing too much correctness and clarity. No brilliant suggestions
on how to do that, sorry.

Martin


Re: [PATCH 20/20] abbrev: add a core.validateAbbrev setting

2018-06-12 Thread Junio C Hamano
Martin Ågren  writes:

>> +This is especially useful in combination with the
>> +`core.validateAbbrev` setting, or to get more future-proof hashes to
>> +reference in the future in a repository whose number of objects is
>> +expected to grow.
>
> Maybe s/validateAbbrev/validateAbbrev = false/?

Perhaps, but even with =true it would equally be useful, as the
point of this setting is to future-proofing.

>> ++
>> +When printing abbreviated object names Git needs to look through the
>> +local object store. This is an `O(log N)` operation assuming all the
>> +objects are in a single pack file, but `X * O(log N)` given `X` pack
>> +files, which can get expensive on some larger repositories.
>
> This might be very close to too much information.

Not very close, but just too much information without crucial detail
(i.e. log N times what constant???).  I'd drop it.

>> ++
>> +This setting changes that to `O(1)`, but with the trade-off that
>> +depending on the value of `core.abbrev` we may be printing abbreviated
>> +hashes that collide.

It may not be technically wrong to say "This changes it to O(1)",
but I think to most people it is more understandable to say "This
changes it to zero" ;-)

Setting this variable to false makes Git not to validate the
abbreviation it produces uniquely identifies an object among the
current set of objects in the repository.  Depending on the
value of `core.abbrev`, we may be printing abbreviated hashes
that collide.  Note that setting this variable to true (or
leaving it unset) does not guarantee that an abbreviated hash
will never collide with future objects in the repository (you
need to set core.abbrevLength to a larger value for that).

would be sufficient to clarify, and also nuke the following
overly-detailed paragraph.

>> ... Too see how likely this is, try running:
>> ++
>> +---
>> +git log --all --pretty=format:%h --abbrev=4 | perl -nE 'chomp; say length' 
>> | sort | uniq -c | sort -nr
>> +---
>> ++
>> +This shows how many commits were found at each abbreviation length. On
>> +linux.git in June 2018 this shows a bit more than 750,000 commits,
>> +with just 4 needing 11 characters to be fully abbreviated, and the
>> +default heuristic picks a length of 12.
>
> These last few paragraphs seem like too much to me.

Yeah, it goes to too low level a detail, especially with the "try
running" part.  I'd remove everything but "On linux.git in June ..."
if I were writing it from the above.

>> ++
>> +Even without `core.validateAbbrev=false` the results abbreviation
>> +already a bit of a probability game. They're guaranteed at the moment
>> +of generation, but as more objects are added, ambiguities may be
>> +introduced. Likewise, what's unambiguous for you may not be for
>> +somebody else you're communicating with, if they have their own clone.
>
> This seems more useful.

Yes, but still overly verbose; I think rolling it in the single
paragraph description like I showed above would be sufficient.

>> ++
>> +Therefore the default of `core.validateAbbrev=true` may not save you
>> +in practice if you're sharing the SHA-1 or noting it now to use after
>> +a `git fetch`. You may be better off setting `core.abbrev` to
>> +e.g. `+2` to add 2 extra characters to the SHA-1, and possibly combine
>> +that with `core.validateAbbrev=false` to get a reasonable trade-off
>> +between safety and performance.
>
> Makes sense. As before, I'd suggest s/SHA-1/object ID/.

Likewise.  If we were to keep it, then s/object ID/object name/.