Re: [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1

Ævar Arnfjörð Bjarmason Tue, 01 May 2018 05:37:22 -0700

On Tue, May 01 2018, Derrick Stolee wrote:

> On 5/1/2018 7:27 AM, Ævar Arnfjörð Bjarmason wrote:
>> On Tue, May 01 2018, Derrick Stolee wrote:
>>
>>> On 4/30/2018 6:07 PM, Ævar Arnfjörð Bjarmason wrote:
>>>> Since we show the commit data in the output that's nicely aligned once
>>>> we sort by object type. The decision to show tags before commits is
>>>> pretty arbitrary, but it's much less likely that we'll display a tag,
>>>> so if there is one it makes sense to show it first.
>>> Here's a non-arbitrary reason: the object types are ordered
>>> topologically (ignoring self-references):
>>>
>>> tag -> commit, tree, blob
>>> commit -> tree
>>> tree -> blob
>> Thanks. I'll add a patch with that comment to v2.
>>
>>>> @@ -421,7 +451,12 @@ static int get_short_oid(const char *name, int len, 
>>>> struct object_id *oid,
>>>>                            ds.fn = NULL;
>>>>                    advise(_("The candidates are:"));
>>>> -          for_each_abbrev(ds.hex_pfx, show_ambiguous_object, &ds);
>>>> +          for_each_abbrev(ds.hex_pfx, collect_ambiguous, &collect);
>>>> +          QSORT(collect.oid, collect.nr, sort_ambiguous);
>>> I was wondering how the old code sorted by SHA even when the ambiguous
>>> objects were loaded from different sources (multiple pack-files, loose
>>> objects). Turns out that for_each_abbrev() does its own sort after
>>> collecting the SHAs and then calls the given function pointer only
>>> once per distinct object. This avoids multiple instances of the same
>>> object, which may appear multiple times across pack-files.
>>>
>>> I only ask because now we are doing two sorts. I wonder if it would be
>>> more elegant to provide your sorting algorithm to for_each_abbrev()
>>> and let it call show_ambiguous_object as before.
>>>
>>> Another question is if we should use this sort generally for all calls
>>> to for_each_abbrev(). The only other case I see is in
>>> builtin/revparse.c.
>> When preparing v2 I realized how confusing this was, so I'd added this
>> to the commit message of my WIP re-roll which should explain this:
>>
>>      A note on the implementation: I started out with something much
>>      simpler which just replaced oid_array_sort() in sha1-array.c with a
>>      custom sort function before calling oid_array_for_each_unique(). But
>>      then dumbly noticed that it doesn't work because the output function
>>      was tangled up with the code added in fad6b9e590 ("for_each_abbrev:
>>      drop duplicate objects", 2016-09-26) to ensure we don't display
>>      duplicate objects.
>>           That's why we're doing two passes here, first we need to
>> sort the list
>>      and de-duplicate the objects, then sort them in our custom order, and
>>      finally output them without re-sorting them. I suppose we could also
>>      make oid_array_for_each_unique() maintain a hashmap of emitted
>>      objects, but that would increase its memory profile and wouldn't be
>>      worth the complexity for this one-off use-case,
>>      oid_array_for_each_unique() is used in many other places.
>
> How would sorting in our custom order before de-duplicating fail the
> de-duplication? We will still pair identical OIDs as consecutive
> elements and oid_array_for_each_unique only cares about consecutive
> elements having distinct OIDs, not lex-ordered OIDs.


Because there's no de-duplication without the array first being sorted
in oidcmp() order, which oid_array_for_each_unique() checks for and
re-sorts if !array->sorted. I.e. its de-duplication is just a state
machine where it won't call the callback if the currently processed
element has the same SHA1 as the last one.

> Perhaps the noise is because we rely on oid_array_sort() to mark the
> array as sorted inside oid_array_for_each_unique(), but that could be
> remedied by calling our QSORT() inside for_each_abbrev() and marking
> the array as sorted before calling oid_array_for_each_unique().

As noted above this won't work, because the function inherently relies
on the array being sorted to be able to de-duplicate. Doing this will
yield duplicate entries.

Re: [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1

Reply via email to