Re: [PATCH v3 5/5] sha1_name: Minimize OID comparisons during disambiguation

2017-10-05 Thread Jeff King
On Mon, Oct 02, 2017 at 10:56:51AM -0400, Derrick Stolee wrote:

> +static void find_abbrev_len_for_pack(struct packed_git *p,
> +  struct min_abbrev_data *mad)
> +{
> + int match = 0;
> + uint32_t num, last, first = 0;
> + struct object_id oid;
> +
> + open_pack_index(p);
> + num = p->num_objects;
> + last = num;
> + while (first < last) {
> + uint32_t mid = (first + last) / 2;

This can overflow if we are looking past 2^31. Probably not very common
(things seem to get pretty painful at hundreds of millions of objects).
But it can be written as:

  uint32_t mid = lo + (hi - lo) / 2;

Sadly, this overflow case seems to be present in many of our binary
searches.

> + const unsigned char *current;
> + int cmp;
> +
> + current = nth_packed_object_sha1(p, mid);

nth_packed_object_sha1() has to do some minor computation to come up
with the correct pointer. The normal packfile lookup precomputes the
initial offset and stride outside the loop. I have no idea if that
micro-optimization is actually noticeable, but I thought I'd mention it
as a possible avenue for investigation since you're obviously interested
in squeezing out performance. ;)

-Peff


Re: [PATCH v3 5/5] sha1_name: Minimize OID comparisons during disambiguation

2017-10-03 Thread Derrick Stolee

On 10/3/2017 11:55 AM, Stefan Beller wrote:

@@ -505,6 +506,65 @@ static int extend_abbrev_len(const struct object_id *oid, 
void *cb_data)
 return 0;
  }

+static void find_abbrev_len_for_pack(struct packed_git *p,
+struct min_abbrev_data *mad)
+{
+   int match = 0;
+   uint32_t num, last, first = 0;
+   struct object_id oid;
+
+   open_pack_index(p);

coverity complained here with
 Calling "open_pack_index" without checking return value
 (as is done elsewhere 13 out of 15 times).
Good catch! This same line is repeated in unique_in_pack() in this same 
file, so if this is worth fixing then we should probably fix it there, too.

I think the easiest way out is just a

 if (open_pack_index(p))
 die(_("Cannot open existing pack idx file for '%s'"), p);

or is there another good approach?

You probably intended to have p->pack_name in the die();

Using `cat *.c | grep -A 2 "if (open_pack_index("` and `cat */*.c | grep 
-A 2 "if (open_pack_index("` I see a few places that return error codes 
or quietly fail. The cases that use die() are inside builtin/ so I don't 
think die() is the right choice here.


Since find_abbrev_len_for_pack() is intended to extend the abbreviation 
length when necessary, I think a silent return is best here:


    if (open_pack_index(p))
        return;

Thanks,
-Stolee


Re: [PATCH v3 5/5] sha1_name: Minimize OID comparisons during disambiguation

2017-10-03 Thread Stefan Beller
> @@ -505,6 +506,65 @@ static int extend_abbrev_len(const struct object_id 
> *oid, void *cb_data)
> return 0;
>  }
>
> +static void find_abbrev_len_for_pack(struct packed_git *p,
> +struct min_abbrev_data *mad)
> +{
> +   int match = 0;
> +   uint32_t num, last, first = 0;
> +   struct object_id oid;
> +
> +   open_pack_index(p);

coverity complained here with
Calling "open_pack_index" without checking return value
(as is done elsewhere 13 out of 15 times).

I think the easiest way out is just a

if (open_pack_index(p))
die(_("Cannot open existing pack idx file for '%s'"), p);

or is there another good approach?


[PATCH v3 5/5] sha1_name: Minimize OID comparisons during disambiguation

2017-10-02 Thread Derrick Stolee
Minimize OID comparisons during disambiguatiosn of packfile OIDs.

Teach git to use binary search with the full OID to find the object's
position (or insertion position, if not present) in the pack-index.
The object before and immediately after (or the one at the insertion
position) give the maximum common prefix.  No subsequent linear search
is required.

Take care of which two to inspect, in case the object id exists in the
packfile.

If the input to find_unique_abbrev_r() is a partial prefix, then the
OID used for the binary search is padded with zeroes so the object will
not exist in the repo (with high probability) and the same logic
applies.

p0008.1: find_unique_abbrev() for existing objects
--

For 10 repeated tests, each checking 100,000 known objects, we find the
following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.05 s | 0.05 s |   0.0% |
| Git   | 5 |  230162 |  0 | 0.08 s | 0.08 s |   0.0% |
| Git   | 4 |  154310 |  75852 | 0.06 s | 0.06 s |   0.0% |
| Linux | 1 | 5606645 |  0 | 0.14 s | 0.05 s | -64.3% |
| Linux |24 | 5606645 |  0 | 0.94 s | 0.88 s | - 6.4% |
| Linux |23 | 5283204 | 323441 | 0.86 s | 0.80 s | - 7.0% |
| VSTS  | 1 | 4355923 |  0 | 0.11 s | 0.05 s | -54.5% |
| VSTS  |32 | 4355923 |  0 | 0.95 s | 0.95 s |   0.0% |
| VSTS  |31 | 4276829 |  79094 | 1.82 s | 1.93 s | + 6.0% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 4.62 s
  New Time: 4.09 s
 Rel %: -11.5%

p0008.2: find_unique_abbrev() for missing objects
-

For 10 repeated tests, each checking 100,000 missing objects, we find
the following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.07 s | 0.07 s |   0.0% |
| Git   | 5 |  230162 |  0 | 0.12 s | 0.12 s |   0.0% |
| Git   | 4 |  154310 |  75852 | 0.09 s | 0.09 s |   0.0% |
| Linux | 1 | 5606645 |  0 | 0.13 s | 0.04 s | -69.2% |
| Linux |24 | 5606645 |  0 | 0.89 s | 0.85 s | - 4.5% |
| Linux |23 | 5283204 | 323441 | 0.83 s | 0.78 s | - 6.0% |
| VSTS  | 1 | 4355923 |  0 | 0.11 s | 0.04 s | -63.6% |
| VSTS  |32 | 4355923 |  0 | 0.99 s | 0.98 s | - 1.0% |
| VSTS  |31 | 4276829 |  79094 | 1.02 s | 1.04 s | + 2.0% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 3.08 s
  New Time: 2.72 s
 Rel %: -11.8

Signed-off-by: Derrick Stolee 
---
 sha1_name.c | 70 +
 1 file changed, 66 insertions(+), 4 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 5081aeb71..54b3a37da 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -478,6 +478,7 @@ struct min_abbrev_data {
unsigned int init_len;
unsigned int cur_len;
char *hex;
+   const unsigned char *hash;
 };
 
 static inline char get_hex_char_from_oid(const struct object_id *oid,
@@ -505,6 +506,65 @@ static int extend_abbrev_len(const struct object_id *oid, 
void *cb_data)
return 0;
 }
 
+static void find_abbrev_len_for_pack(struct packed_git *p,
+struct min_abbrev_data *mad)
+{
+   int match = 0;
+   uint32_t num, last, first = 0;
+   struct object_id oid;
+
+   open_pack_index(p);
+   num = p->num_objects;
+   last = num;
+   while (first < last) {
+   uint32_t mid = (first + last) / 2;
+   const unsigned char *current;
+   int cmp;
+
+   current = nth_packed_object_sha1(p, mid);
+   cmp = hashcmp(mad->hash, current);
+   if (!cmp) {
+   match = 1;
+   first = mid;
+   break;
+   }
+   if (cmp > 0) {
+   first = mid + 1;
+   continue;
+   }
+   last = mid;
+   }
+
+   /*
+* first is now the position in the packfile where we would insert
+* mad->hash if it does not exist (or the position of mad->hash if
+* it does exist). Hence, we consider a maximum of three objects
+* nearby for the abbreviation length.
+*/
+   mad->init_len = 0;
+   if (!match) {
+   nth_packed_object_oid(, p, first);
+   extend_abbrev_len(,