On Mon, Jan 08 2018, Jeff King jotted:
> On Mon, Jan 08, 2018 at 05:20:29AM -0500, Jeff King wrote:
>
>> I.e., what if we did something like this:
>>
>> diff --git a/sha1_name.c b/sha1_name.c
>> index 611c7d24dd..04c661ba85 100644
>> --- a/sha1_name.c
>> +++ b/sha1_name.c
>> @@ -600,6 +600,15 @@ int find_unique_abbrev_r(char *hex, const unsigned char
>> *sha1, int len)
>> if (len == GIT_SHA1_HEXSZ || !len)
>> return GIT_SHA1_HEXSZ;
>>
>> + /*
>> + * A default length of 10 implies a repository big enough that it's
>> + * getting expensive to double check the ambiguity of each object,
>> + * and the chance that any particular object of interest has a
>> + * collision is low.
>> + */
>> + if (len >= 10)
>> + return len;
>> +
>
> Oops, this really needs to terminate the string in addition to returning
> the length (so it was always printing 40 characters in most cases). The
> correct patch is below, but it performs the same.
>
> diff --git a/sha1_name.c b/sha1_name.c
> index 611c7d24dd..5921298a80 100644
> --- a/sha1_name.c
> +++ b/sha1_name.c
> @@ -600,6 +600,17 @@ int find_unique_abbrev_r(char *hex, const unsigned char
> *sha1, int len)
> if (len == GIT_SHA1_HEXSZ || !len)
> return GIT_SHA1_HEXSZ;
>
> + /*
> + * A default length of 10 implies a repository big enough that it's
> + * getting expensive to double check the ambiguity of each object,
> + * and the chance that any particular object of interest has a
> + * collision is low.
> + */
> + if (len >= 10) {
> + hex[len] = 0;
> + return len;
> + }
> +
> mad.init_len = len;
> mad.cur_len = len;
> mad.hex = hex;
That looks much more sensible, leaving aside other potential benefits of
MIDX.
Given the argument Linus made in e6c587c733 ("abbrev: auto size the
default abbreviation", 2016-09-30) maybe we should add a small integer
to the length for good measure, i.e. something like:
if (len >= 10) {
int extra = 2; /* or just 1? or maybe 0 ... */
hex[len + extra] = 0;
return len + extra;
}
I tried running:
git log --pretty=format:%h --abbrev=7 | perl -nE 'chomp; say
length'|sort|uniq -c|sort -nr
On several large repos, which forces something like the disambiguation
we had before Linus's patch, on e.g. David Turner's
2015-04-03-1M-git.git test repo it's:
952858 7
44541 8
2861 9
168 10
17 11
2 12
And the default abbreviation picks 12. I haven't yet found a case where
it's wrong, but if we wanted to be extra safe we could just add a byte
or two to the SHA-1.