Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation
On 10/10/2017 9:30 AM, Jeff King wrote: On Tue, Oct 10, 2017 at 09:11:15AM -0400, Derrick Stolee wrote: On 10/10/2017 8:56 AM, Junio C Hamano wrote: Jeff Kingwrites: OK, I think that makes more sense. But note the p->num_objects thing I mentioned. If I do: git pack-objects .git/objects/pack/pack num_objects) return; Technically that also covers open_pack_index() failure, too, but that's a subtlety I don't think we should rely on. True. I notice that the early part of the two functions look almost identical. Do we need error condition handling for the other one, too? I prefer to fix the problem in all code clones when they cause review friction, so I'll send a fifth commit showing just the diff for these packfile issues in sha1_name.c. See patch below. Ah, that answers my earlier question. Junio mean unique_in_pack(). And yeah, I think it suffers from the same problem. Should open_pack_index() return a non-zero status if the packfile is empty? Or, is there a meaningful reason to have empty packfiles? I can't think of a compelling reason to have an empty packfile. But nor do I think we should consider them an error when we can handle them quietly (and returning non-zero status would cause Git to complain on many operations in a repository that has such a file). -Peff Thanks for the comments. I found some typos in my commit messages, too. I plan to send a (hopefully) final version tomorrow (Thursday). It will include: * Make 'pos' unsigned in get_hex_char_from_oid() * Check response from open_pack_index() * Small typos in commit messages If there are other issues, then please let me know. Thanks, -Stolee
Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation
On Tue, Oct 10, 2017 at 09:11:15AM -0400, Derrick Stolee wrote: > On 10/10/2017 8:56 AM, Junio C Hamano wrote: > > Jeff Kingwrites: > > > > > OK, I think that makes more sense. But note the p->num_objects thing I > > > mentioned. If I do: > > > > > >git pack-objects .git/objects/pack/pack > > > > > then I have a pack with zero objects, which I think we'd similarly want > > > to return early from. I.e., I think we need: > > > > > >if (p->num_objects) > > > return; > > > > > > Technically that also covers open_pack_index() failure, too, but that's > > > a subtlety I don't think we should rely on. > > True. I notice that the early part of the two functions look almost > > identical. Do we need error condition handling for the other one, > > too? > > I prefer to fix the problem in all code clones when they cause review > friction, so I'll send a fifth commit showing just the diff for these > packfile issues in sha1_name.c. See patch below. Ah, that answers my earlier question. Junio mean unique_in_pack(). And yeah, I think it suffers from the same problem. > Should open_pack_index() return a non-zero status if the packfile is empty? > Or, is there a meaningful reason to have empty packfiles? I can't think of a compelling reason to have an empty packfile. But nor do I think we should consider them an error when we can handle them quietly (and returning non-zero status would cause Git to complain on many operations in a repository that has such a file). -Peff
Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation
On 10/10/2017 8:56 AM, Junio C Hamano wrote: Jeff Kingwrites: OK, I think that makes more sense. But note the p->num_objects thing I mentioned. If I do: git pack-objects .git/objects/pack/pack num_objects) return; Technically that also covers open_pack_index() failure, too, but that's a subtlety I don't think we should rely on. True. I notice that the early part of the two functions look almost identical. Do we need error condition handling for the other one, too? I prefer to fix the problem in all code clones when they cause review friction, so I'll send a fifth commit showing just the diff for these packfile issues in sha1_name.c. See patch below. Should open_pack_index() return a non-zero status if the packfile is empty? Or, is there a meaningful reason to have empty packfiles? Thanks, -Stolee diff --git a/sha1_name.c b/sha1_name.c index 49ba67955..9f8a33e82 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -153,7 +153,9 @@ static void unique_in_pack(struct packed_git *p, uint32_t num, last, i, first = 0; const struct object_id *current = NULL; - open_pack_index(p); + if (open_pack_index(p) || !p->num_objects) + return; + num = p->num_objects; last = num; while (first < last) { @@ -513,7 +515,9 @@ static void find_abbrev_len_for_pack(struct packed_git *p, uint32_t num, last, first = 0; struct object_id oid; - open_pack_index(p); + if (open_pack_index(p) || !p->num_objects) + return; + num = p->num_objects; last = num; while (first < last) {
Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation
On Tue, Oct 10, 2017 at 09:56:38PM +0900, Junio C Hamano wrote: > Jeff Kingwrites: > > > OK, I think that makes more sense. But note the p->num_objects thing I > > mentioned. If I do: > > > > git pack-objects .git/objects/pack/pack > > > then I have a pack with zero objects, which I think we'd similarly want > > to return early from. I.e., I think we need: > > > > if (p->num_objects) > > return; > > > > Technically that also covers open_pack_index() failure, too, but that's > > a subtlety I don't think we should rely on. > > True. I notice that the early part of the two functions look almost > identical. Do we need error condition handling for the other one, > too? I'm not sure which two you mean. Do you mean find_pack_entry_one() in packfile.c as the other one? If so, I think it is fine in the zero-object case, because it does not do the "this is the sha1 at the position where it _would_ be found" trick, which is what causes us to potentially dereference nonsense. -Peff
Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation
Jeff Kingwrites: > OK, I think that makes more sense. But note the p->num_objects thing I > mentioned. If I do: > > git pack-objects .git/objects/pack/pack > then I have a pack with zero objects, which I think we'd similarly want > to return early from. I.e., I think we need: > > if (p->num_objects) > return; > > Technically that also covers open_pack_index() failure, too, but that's > a subtlety I don't think we should rely on. True. I notice that the early part of the two functions look almost identical. Do we need error condition handling for the other one, too?
Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation
On Tue, Oct 10, 2017 at 08:16:27AM -0400, Derrick Stolee wrote: > > > + mad->init_len = 0; > > > + if (!match) { > > > + nth_packed_object_oid(, p, first); > > > + extend_abbrev_len(, mad); > > If we have zero objects in the pack, what would nth_packed_object_oid() > > be returning here? > > > > So I actually think we do want an early return, not just when > > open_packed_index() fails, but also when p->num_objects is zero. > > > > -Peff > > Sorry about this. I caught this while I was writing my cover letter and > amended my last commit to include the following: > > if (open_pack_index(p)) > return; > > After I amended the commit, I forgot to 'format-patch' again. I can send a > diff between the commits after review has calmed. OK, I think that makes more sense. But note the p->num_objects thing I mentioned. If I do: git pack-objects .git/objects/pack/pack num_objects) return; Technically that also covers open_pack_index() failure, too, but that's a subtlety I don't think we should rely on. -Peff
Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation
On 10/9/2017 9:49 AM, Jeff King wrote: On Sun, Oct 08, 2017 at 02:49:42PM -0400, Derrick Stolee wrote: @@ -505,6 +506,65 @@ static int extend_abbrev_len(const struct object_id *oid, void *cb_data) return 0; } +static void find_abbrev_len_for_pack(struct packed_git *p, +struct min_abbrev_data *mad) +{ + int match = 0; + uint32_t num, last, first = 0; + struct object_id oid; + + open_pack_index(p); + num = p->num_objects; + last = num; + while (first < last) { [...] Your cover letter lists: * Silently skip packfiles that fail to open with open_pack_index() as a change from the previous version. But this looks the same as the last round. I think this _does_ end up skipping such packfiles because p->num_objects will be zero. Is it worth having a comment to that effect (or even just an early return) to make it clear that the situation is intentional? Although... + /* +* first is now the position in the packfile where we would insert +* mad->hash if it does not exist (or the position of mad->hash if +* it does exist). Hence, we consider a maximum of three objects +* nearby for the abbreviation length. +*/ + mad->init_len = 0; + if (!match) { + nth_packed_object_oid(, p, first); + extend_abbrev_len(, mad); If we have zero objects in the pack, what would nth_packed_object_oid() be returning here? So I actually think we do want an early return, not just when open_packed_index() fails, but also when p->num_objects is zero. -Peff Sorry about this. I caught this while I was writing my cover letter and amended my last commit to include the following: if (open_pack_index(p)) return; After I amended the commit, I forgot to 'format-patch' again. I can send a diff between the commits after review has calmed. Thanks, -Stolee
Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation
On Sun, Oct 08, 2017 at 02:49:42PM -0400, Derrick Stolee wrote: > @@ -505,6 +506,65 @@ static int extend_abbrev_len(const struct object_id > *oid, void *cb_data) > return 0; > } > > +static void find_abbrev_len_for_pack(struct packed_git *p, > + struct min_abbrev_data *mad) > +{ > + int match = 0; > + uint32_t num, last, first = 0; > + struct object_id oid; > + > + open_pack_index(p); > + num = p->num_objects; > + last = num; > + while (first < last) { > [...] Your cover letter lists: * Silently skip packfiles that fail to open with open_pack_index() as a change from the previous version. But this looks the same as the last round. I think this _does_ end up skipping such packfiles because p->num_objects will be zero. Is it worth having a comment to that effect (or even just an early return) to make it clear that the situation is intentional? Although... > + /* > + * first is now the position in the packfile where we would insert > + * mad->hash if it does not exist (or the position of mad->hash if > + * it does exist). Hence, we consider a maximum of three objects > + * nearby for the abbreviation length. > + */ > + mad->init_len = 0; > + if (!match) { > + nth_packed_object_oid(, p, first); > + extend_abbrev_len(, mad); If we have zero objects in the pack, what would nth_packed_object_oid() be returning here? So I actually think we do want an early return, not just when open_packed_index() fails, but also when p->num_objects is zero. -Peff
[PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation
Minimize OID comparisons during disambiguatiosn of packfile OIDs. Teach git to use binary search with the full OID to find the object's position (or insertion position, if not present) in the pack-index. The object before and immediately after (or the one at the insertion position) give the maximum common prefix. No subsequent linear search is required. Take care of which two to inspect, in case the object id exists in the packfile. If the input to find_unique_abbrev_r() is a partial prefix, then the OID used for the binary search is padded with zeroes so the object will not exist in the repo (with high probability) and the same logic applies. This commit completes a series of three changes to OID abbreviation code, and the overall change can be seen using standard commands for large repos. Below we report performance statistics for perf test 4211.6 from p4211-line-log.sh using three copies of the Linux repo: | Packs | Loose | HEAD~3 | HEAD | Rel% | |---||--|--|---| | 1| 0 | 41.27 s | 38.93 s | -4.8% | | 24| 0 | 98.04 s | 91.35 s | -5.7% | | 23| 323952 | 117.78 s | 112.18 s | -4.8% | Signed-off-by: Derrick Stolee--- sha1_name.c | 70 + 1 file changed, 66 insertions(+), 4 deletions(-) diff --git a/sha1_name.c b/sha1_name.c index 5081aeb71..49ba67955 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -478,6 +478,7 @@ struct min_abbrev_data { unsigned int init_len; unsigned int cur_len; char *hex; + const unsigned char *hash; }; static inline char get_hex_char_from_oid(const struct object_id *oid, @@ -505,6 +506,65 @@ static int extend_abbrev_len(const struct object_id *oid, void *cb_data) return 0; } +static void find_abbrev_len_for_pack(struct packed_git *p, +struct min_abbrev_data *mad) +{ + int match = 0; + uint32_t num, last, first = 0; + struct object_id oid; + + open_pack_index(p); + num = p->num_objects; + last = num; + while (first < last) { + uint32_t mid = first + (last - first) / 2; + const unsigned char *current; + int cmp; + + current = nth_packed_object_sha1(p, mid); + cmp = hashcmp(mad->hash, current); + if (!cmp) { + match = 1; + first = mid; + break; + } + if (cmp > 0) { + first = mid + 1; + continue; + } + last = mid; + } + + /* +* first is now the position in the packfile where we would insert +* mad->hash if it does not exist (or the position of mad->hash if +* it does exist). Hence, we consider a maximum of three objects +* nearby for the abbreviation length. +*/ + mad->init_len = 0; + if (!match) { + nth_packed_object_oid(, p, first); + extend_abbrev_len(, mad); + } else if (first < num - 1) { + nth_packed_object_oid(, p, first + 1); + extend_abbrev_len(, mad); + } + if (first > 0) { + nth_packed_object_oid(, p, first - 1); + extend_abbrev_len(, mad); + } + mad->init_len = mad->cur_len; +} + +static void find_abbrev_len_packed(struct min_abbrev_data *mad) +{ + struct packed_git *p; + + prepare_packed_git(); + for (p = packed_git; p; p = p->next) + find_abbrev_len_for_pack(p, mad); +} + int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) { struct disambiguate_state ds; @@ -536,19 +596,21 @@ int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) if (len == GIT_SHA1_HEXSZ || !len) return GIT_SHA1_HEXSZ; - if (init_object_disambiguation(hex, len, ) < 0) - return -1; - mad.init_len = len; mad.cur_len = len; mad.hex = hex; + mad.hash = sha1; + + find_abbrev_len_packed(); + + if (init_object_disambiguation(hex, mad.cur_len, ) < 0) + return -1; ds.fn = extend_abbrev_len; ds.always_call_fn = 1; ds.cb_data = (void*) find_short_object_filename(); - find_short_packed_object(); (void)finish_object_disambiguation(, _ret); hex[mad.cur_len] = 0; -- 2.14.1.538.g56ec8fc98.dirty