On Tue, May 01 2018, Derrick Stolee wrote: > On 5/1/2018 7:27 AM, Ævar Arnfjörð Bjarmason wrote: >> On Tue, May 01 2018, Derrick Stolee wrote: >> >>> On 4/30/2018 6:07 PM, Ævar Arnfjörð Bjarmason wrote: >>>> Since we show the commit data in the output that's nicely aligned once >>>> we sort by object type. The decision to show tags before commits is >>>> pretty arbitrary, but it's much less likely that we'll display a tag, >>>> so if there is one it makes sense to show it first. >>> Here's a non-arbitrary reason: the object types are ordered >>> topologically (ignoring self-references): >>> >>> tag -> commit, tree, blob >>> commit -> tree >>> tree -> blob >> Thanks. I'll add a patch with that comment to v2. >> >>>> @@ -421,7 +451,12 @@ static int get_short_oid(const char *name, int len, >>>> struct object_id *oid, >>>> ds.fn = NULL; >>>> advise(_("The candidates are:")); >>>> - for_each_abbrev(ds.hex_pfx, show_ambiguous_object, &ds); >>>> + for_each_abbrev(ds.hex_pfx, collect_ambiguous, &collect); >>>> + QSORT(collect.oid, collect.nr, sort_ambiguous); >>> I was wondering how the old code sorted by SHA even when the ambiguous >>> objects were loaded from different sources (multiple pack-files, loose >>> objects). Turns out that for_each_abbrev() does its own sort after >>> collecting the SHAs and then calls the given function pointer only >>> once per distinct object. This avoids multiple instances of the same >>> object, which may appear multiple times across pack-files. >>> >>> I only ask because now we are doing two sorts. I wonder if it would be >>> more elegant to provide your sorting algorithm to for_each_abbrev() >>> and let it call show_ambiguous_object as before. >>> >>> Another question is if we should use this sort generally for all calls >>> to for_each_abbrev(). The only other case I see is in >>> builtin/revparse.c. >> When preparing v2 I realized how confusing this was, so I'd added this >> to the commit message of my WIP re-roll which should explain this: >> >> A note on the implementation: I started out with something much >> simpler which just replaced oid_array_sort() in sha1-array.c with a >> custom sort function before calling oid_array_for_each_unique(). But >> then dumbly noticed that it doesn't work because the output function >> was tangled up with the code added in fad6b9e590 ("for_each_abbrev: >> drop duplicate objects", 2016-09-26) to ensure we don't display >> duplicate objects. >> That's why we're doing two passes here, first we need to >> sort the list >> and de-duplicate the objects, then sort them in our custom order, and >> finally output them without re-sorting them. I suppose we could also >> make oid_array_for_each_unique() maintain a hashmap of emitted >> objects, but that would increase its memory profile and wouldn't be >> worth the complexity for this one-off use-case, >> oid_array_for_each_unique() is used in many other places. > > How would sorting in our custom order before de-duplicating fail the > de-duplication? We will still pair identical OIDs as consecutive > elements and oid_array_for_each_unique only cares about consecutive > elements having distinct OIDs, not lex-ordered OIDs.
Because there's no de-duplication without the array first being sorted in oidcmp() order, which oid_array_for_each_unique() checks for and re-sorts if !array->sorted. I.e. its de-duplication is just a state machine where it won't call the callback if the currently processed element has the same SHA1 as the last one. > Perhaps the noise is because we rely on oid_array_sort() to mark the > array as sorted inside oid_array_for_each_unique(), but that could be > remedied by calling our QSORT() inside for_each_abbrev() and marking > the array as sorted before calling oid_array_for_each_unique(). As noted above this won't work, because the function inherently relies on the array being sorted to be able to de-duplicate. Doing this will yield duplicate entries.