Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation

2017-10-11 Thread Derrick Stolee

On 10/10/2017 9:30 AM, Jeff King wrote:

On Tue, Oct 10, 2017 at 09:11:15AM -0400, Derrick Stolee wrote:


On 10/10/2017 8:56 AM, Junio C Hamano wrote:

Jeff King  writes:


OK, I think that makes more sense. But note the p->num_objects thing I
mentioned. If I do:

git pack-objects .git/objects/pack/pack num_objects)
return;

Technically that also covers open_pack_index() failure, too, but that's
a subtlety I don't think we should rely on.

True.  I notice that the early part of the two functions look almost
identical.  Do we need error condition handling for the other one,
too?

I prefer to fix the problem in all code clones when they cause review
friction, so I'll send a fifth commit showing just the diff for these
packfile issues in sha1_name.c. See patch below.

Ah, that answers my earlier question. Junio mean unique_in_pack(). And
yeah, I think it suffers from the same problem.


Should open_pack_index() return a non-zero status if the packfile is empty?
Or, is there a meaningful reason to have empty packfiles?

I can't think of a compelling reason to have an empty packfile. But nor
do I think we should consider them an error when we can handle them
quietly (and returning non-zero status would cause Git to complain on
many operations in a repository that has such a file).

-Peff


Thanks for the comments. I found some typos in my commit messages, too.

I plan to send a (hopefully) final version tomorrow (Thursday). It will 
include:


* Make 'pos' unsigned in get_hex_char_from_oid()

* Check response from open_pack_index()

* Small typos in commit messages

If there are other issues, then please let me know.

Thanks,
-Stolee


Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation

2017-10-10 Thread Jeff King
On Tue, Oct 10, 2017 at 09:11:15AM -0400, Derrick Stolee wrote:

> On 10/10/2017 8:56 AM, Junio C Hamano wrote:
> > Jeff King  writes:
> > 
> > > OK, I think that makes more sense. But note the p->num_objects thing I
> > > mentioned. If I do:
> > > 
> > >git pack-objects .git/objects/pack/pack  > > 
> > > then I have a pack with zero objects, which I think we'd similarly want
> > > to return early from. I.e., I think we need:
> > > 
> > >if (p->num_objects)
> > >   return;
> > > 
> > > Technically that also covers open_pack_index() failure, too, but that's
> > > a subtlety I don't think we should rely on.
> > True.  I notice that the early part of the two functions look almost
> > identical.  Do we need error condition handling for the other one,
> > too?
> 
> I prefer to fix the problem in all code clones when they cause review
> friction, so I'll send a fifth commit showing just the diff for these
> packfile issues in sha1_name.c. See patch below.

Ah, that answers my earlier question. Junio mean unique_in_pack(). And
yeah, I think it suffers from the same problem.

> Should open_pack_index() return a non-zero status if the packfile is empty?
> Or, is there a meaningful reason to have empty packfiles?

I can't think of a compelling reason to have an empty packfile. But nor
do I think we should consider them an error when we can handle them
quietly (and returning non-zero status would cause Git to complain on
many operations in a repository that has such a file).

-Peff


Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation

2017-10-10 Thread Derrick Stolee

On 10/10/2017 8:56 AM, Junio C Hamano wrote:

Jeff King  writes:


OK, I think that makes more sense. But note the p->num_objects thing I
mentioned. If I do:

   git pack-objects .git/objects/pack/pack num_objects)
return;

Technically that also covers open_pack_index() failure, too, but that's
a subtlety I don't think we should rely on.

True.  I notice that the early part of the two functions look almost
identical.  Do we need error condition handling for the other one,
too?


I prefer to fix the problem in all code clones when they cause review 
friction, so I'll send a fifth commit showing just the diff for these 
packfile issues in sha1_name.c. See patch below.


Should open_pack_index() return a non-zero status if the packfile is 
empty? Or, is there a meaningful reason to have empty packfiles?


Thanks,
-Stolee


diff --git a/sha1_name.c b/sha1_name.c
index 49ba67955..9f8a33e82 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -153,7 +153,9 @@ static void unique_in_pack(struct packed_git *p,
    uint32_t num, last, i, first = 0;
    const struct object_id *current = NULL;

-   open_pack_index(p);
+   if (open_pack_index(p) || !p->num_objects)
+   return;
+
    num = p->num_objects;
    last = num;
    while (first < last) {
@@ -513,7 +515,9 @@ static void find_abbrev_len_for_pack(struct 
packed_git *p,

    uint32_t num, last, first = 0;
    struct object_id oid;

-   open_pack_index(p);
+   if (open_pack_index(p) || !p->num_objects)
+   return;
+
    num = p->num_objects;
    last = num;
    while (first < last) {



Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation

2017-10-10 Thread Jeff King
On Tue, Oct 10, 2017 at 09:56:38PM +0900, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > OK, I think that makes more sense. But note the p->num_objects thing I
> > mentioned. If I do:
> >
> >   git pack-objects .git/objects/pack/pack  >
> > then I have a pack with zero objects, which I think we'd similarly want
> > to return early from. I.e., I think we need:
> >
> >   if (p->num_objects)
> > return;
> >
> > Technically that also covers open_pack_index() failure, too, but that's
> > a subtlety I don't think we should rely on.
> 
> True.  I notice that the early part of the two functions look almost
> identical.  Do we need error condition handling for the other one,
> too?

I'm not sure which two you mean. Do you mean find_pack_entry_one() in
packfile.c as the other one? If so, I think it is fine in the
zero-object case, because it does not do the "this is the sha1 at the
position where it _would_ be found" trick, which is what causes us to
potentially dereference nonsense.

-Peff


Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation

2017-10-10 Thread Junio C Hamano
Jeff King  writes:

> OK, I think that makes more sense. But note the p->num_objects thing I
> mentioned. If I do:
>
>   git pack-objects .git/objects/pack/pack 
> then I have a pack with zero objects, which I think we'd similarly want
> to return early from. I.e., I think we need:
>
>   if (p->num_objects)
>   return;
>
> Technically that also covers open_pack_index() failure, too, but that's
> a subtlety I don't think we should rely on.

True.  I notice that the early part of the two functions look almost
identical.  Do we need error condition handling for the other one,
too?


Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation

2017-10-10 Thread Jeff King
On Tue, Oct 10, 2017 at 08:16:27AM -0400, Derrick Stolee wrote:

> > > + mad->init_len = 0;
> > > + if (!match) {
> > > + nth_packed_object_oid(, p, first);
> > > + extend_abbrev_len(, mad);
> > If we have zero objects in the pack, what would nth_packed_object_oid()
> > be returning here?
> > 
> > So I actually think we do want an early return, not just when
> > open_packed_index() fails, but also when p->num_objects is zero.
> > 
> > -Peff
> 
> Sorry about this. I caught this while I was writing my cover letter and
> amended my last commit to include the following:
> 
>     if (open_pack_index(p))
>         return;
> 
> After I amended the commit, I forgot to 'format-patch' again. I can send a
> diff between the commits after review has calmed.

OK, I think that makes more sense. But note the p->num_objects thing I
mentioned. If I do:

  git pack-objects .git/objects/pack/pack num_objects)
return;

Technically that also covers open_pack_index() failure, too, but that's
a subtlety I don't think we should rely on.

-Peff


Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation

2017-10-10 Thread Derrick Stolee

On 10/9/2017 9:49 AM, Jeff King wrote:

On Sun, Oct 08, 2017 at 02:49:42PM -0400, Derrick Stolee wrote:


@@ -505,6 +506,65 @@ static int extend_abbrev_len(const struct object_id *oid, 
void *cb_data)
return 0;
  }
  
+static void find_abbrev_len_for_pack(struct packed_git *p,

+struct min_abbrev_data *mad)
+{
+   int match = 0;
+   uint32_t num, last, first = 0;
+   struct object_id oid;
+
+   open_pack_index(p);
+   num = p->num_objects;
+   last = num;
+   while (first < last) {
[...]

Your cover letter lists:

   * Silently skip packfiles that fail to open with open_pack_index()

as a change from the previous version. But this looks the same as the
last round. I think this _does_ end up skipping such packfiles because
p->num_objects will be zero. Is it worth having a comment to that
effect (or even just an early return) to make it clear that the
situation is intentional?

Although...


+   /*
+* first is now the position in the packfile where we would insert
+* mad->hash if it does not exist (or the position of mad->hash if
+* it does exist). Hence, we consider a maximum of three objects
+* nearby for the abbreviation length.
+*/
+   mad->init_len = 0;
+   if (!match) {
+   nth_packed_object_oid(, p, first);
+   extend_abbrev_len(, mad);

If we have zero objects in the pack, what would nth_packed_object_oid()
be returning here?

So I actually think we do want an early return, not just when
open_packed_index() fails, but also when p->num_objects is zero.

-Peff


Sorry about this. I caught this while I was writing my cover letter and 
amended my last commit to include the following:


    if (open_pack_index(p))
        return;

After I amended the commit, I forgot to 'format-patch' again. I can send 
a diff between the commits after review has calmed.


Thanks,
-Stolee


Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation

2017-10-09 Thread Jeff King
On Sun, Oct 08, 2017 at 02:49:42PM -0400, Derrick Stolee wrote:

> @@ -505,6 +506,65 @@ static int extend_abbrev_len(const struct object_id 
> *oid, void *cb_data)
>   return 0;
>  }
>  
> +static void find_abbrev_len_for_pack(struct packed_git *p,
> +  struct min_abbrev_data *mad)
> +{
> + int match = 0;
> + uint32_t num, last, first = 0;
> + struct object_id oid;
> +
> + open_pack_index(p);
> + num = p->num_objects;
> + last = num;
> + while (first < last) {
> [...]

Your cover letter lists:

  * Silently skip packfiles that fail to open with open_pack_index()

as a change from the previous version. But this looks the same as the
last round. I think this _does_ end up skipping such packfiles because
p->num_objects will be zero. Is it worth having a comment to that
effect (or even just an early return) to make it clear that the
situation is intentional?

Although...

> + /*
> +  * first is now the position in the packfile where we would insert
> +  * mad->hash if it does not exist (or the position of mad->hash if
> +  * it does exist). Hence, we consider a maximum of three objects
> +  * nearby for the abbreviation length.
> +  */
> + mad->init_len = 0;
> + if (!match) {
> + nth_packed_object_oid(, p, first);
> + extend_abbrev_len(, mad);

If we have zero objects in the pack, what would nth_packed_object_oid()
be returning here?

So I actually think we do want an early return, not just when
open_packed_index() fails, but also when p->num_objects is zero.

-Peff


[PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation

2017-10-08 Thread Derrick Stolee
Minimize OID comparisons during disambiguatiosn of packfile OIDs.

Teach git to use binary search with the full OID to find the object's
position (or insertion position, if not present) in the pack-index.
The object before and immediately after (or the one at the insertion
position) give the maximum common prefix.  No subsequent linear search
is required.

Take care of which two to inspect, in case the object id exists in the
packfile.

If the input to find_unique_abbrev_r() is a partial prefix, then the
OID used for the binary search is padded with zeroes so the object will
not exist in the repo (with high probability) and the same logic
applies.

This commit completes a series of three changes to OID abbreviation
code, and the overall change can be seen using standard commands for
large repos. Below we report performance statistics for perf test 4211.6
from p4211-line-log.sh using three copies of the Linux repo:

| Packs | Loose  | HEAD~3   | HEAD | Rel%  |
|---||--|--|---|
|  1|  0 |  41.27 s |  38.93 s | -4.8% |
| 24|  0 |  98.04 s |  91.35 s | -5.7% |
| 23| 323952 | 117.78 s | 112.18 s | -4.8% |

Signed-off-by: Derrick Stolee 
---
 sha1_name.c | 70 +
 1 file changed, 66 insertions(+), 4 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 5081aeb71..49ba67955 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -478,6 +478,7 @@ struct min_abbrev_data {
unsigned int init_len;
unsigned int cur_len;
char *hex;
+   const unsigned char *hash;
 };
 
 static inline char get_hex_char_from_oid(const struct object_id *oid,
@@ -505,6 +506,65 @@ static int extend_abbrev_len(const struct object_id *oid, 
void *cb_data)
return 0;
 }
 
+static void find_abbrev_len_for_pack(struct packed_git *p,
+struct min_abbrev_data *mad)
+{
+   int match = 0;
+   uint32_t num, last, first = 0;
+   struct object_id oid;
+
+   open_pack_index(p);
+   num = p->num_objects;
+   last = num;
+   while (first < last) {
+   uint32_t mid = first + (last - first) / 2;
+   const unsigned char *current;
+   int cmp;
+
+   current = nth_packed_object_sha1(p, mid);
+   cmp = hashcmp(mad->hash, current);
+   if (!cmp) {
+   match = 1;
+   first = mid;
+   break;
+   }
+   if (cmp > 0) {
+   first = mid + 1;
+   continue;
+   }
+   last = mid;
+   }
+
+   /*
+* first is now the position in the packfile where we would insert
+* mad->hash if it does not exist (or the position of mad->hash if
+* it does exist). Hence, we consider a maximum of three objects
+* nearby for the abbreviation length.
+*/
+   mad->init_len = 0;
+   if (!match) {
+   nth_packed_object_oid(, p, first);
+   extend_abbrev_len(, mad);
+   } else if (first < num - 1) {
+   nth_packed_object_oid(, p, first + 1);
+   extend_abbrev_len(, mad);
+   }
+   if (first > 0) {
+   nth_packed_object_oid(, p, first - 1);
+   extend_abbrev_len(, mad);
+   }
+   mad->init_len = mad->cur_len;
+}
+
+static void find_abbrev_len_packed(struct min_abbrev_data *mad)
+{
+   struct packed_git *p;
+
+   prepare_packed_git();
+   for (p = packed_git; p; p = p->next)
+   find_abbrev_len_for_pack(p, mad);
+}
+
 int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
 {
struct disambiguate_state ds;
@@ -536,19 +596,21 @@ int find_unique_abbrev_r(char *hex, const unsigned char 
*sha1, int len)
if (len == GIT_SHA1_HEXSZ || !len)
return GIT_SHA1_HEXSZ;
 
-   if (init_object_disambiguation(hex, len, ) < 0)
-   return -1;
-
mad.init_len = len;
mad.cur_len = len;
mad.hex = hex;
+   mad.hash = sha1;
+
+   find_abbrev_len_packed();
+
+   if (init_object_disambiguation(hex, mad.cur_len, ) < 0)
+   return -1;
 
ds.fn = extend_abbrev_len;
ds.always_call_fn = 1;
ds.cb_data = (void*)
 
find_short_object_filename();
-   find_short_packed_object();
(void)finish_object_disambiguation(, _ret);
 
hex[mad.cur_len] = 0;
-- 
2.14.1.538.g56ec8fc98.dirty