[issue45530] Improve listobject.c's unsafe_tuple_compare()

Tim Peters Sun, 24 Oct 2021 20:22:26 -0700

Tim Peters <[email protected]> added the comment:

To be concrete, here's an implementation of a full-blown, stable lexicographic 
sort based on "bucket refinement". `xs` is a list of sequences to be sorted in 
lexicographic order. The types of the sequences don't matter (lists, tuples, 
strings, ...). Indeed, since list elements are never compared against each 
other directly, they don't even have to be the same sequence type.


This is already faster in pure Python than list.sort() for cases like:

    xs = [random.choices(range(100000), k=random.randrange(5, 30))
          for i in range(1000000)]

However, for cases like strings of the form

    'x' * 10_000 + str(i)

it's very much slower than list.sort(), despite that it "should be" very much 
faster. That appears mostly due to the:

            t.sort(key=lambda x: x[k])
            xs[lo : hi] = t

lines. list.sort() doesn't accept `lo` and `hi` arguments, so sorting _just_ a 
slice requires copying that slice into a temp list, sorting the temp, then 
copying the sorted temp back in. So dealing with a single `x` position in the 
string prefixes effectively requires physically copying the entire list twice 
over - mad overhead to copy the list 20 thousand times.

While the need doesn't come up all that often, I'd support adding optional `lo` 
and `hi` arguments to `list.sort()`. This isn't the first time the lack has 
turned a slick approach into a timing disaster for me ;-)

About list.sort()'s merge strategy, I'm glad to be rid of the original. It's 
not just correctness, it's the effort of _reasoning_ about its consequences. It 
was about a full decade before the first proof was published of that it really 
is a worst-case O(N log N) sort. listsort.txt didn't contain a proof because 
the only proof attempts I had were so complicated not even _I_ found them 
compelling ;-)

Vincent Jugé in particular started at the other end, looking for a merge 
strategy that made proof straightforward instead of Byzantine. It's 
straightforward under the "powersort" strategy too, although it relies on "well 
known" results about approximations to optimal binary search trees.

    def lexisort(xs):
        buckets = [(0, len(xs), 0)]
        while buckets:
            lo, hi, k = buckets.pop()
            t = []
            for i in range(lo, hi):
                x = xs[i]
                if k >= len(x):
                    xs[lo] = x
                    lo += 1
                else:
                    t.append(x)
            t.sort(key=lambda x: x[k])
            xs[lo : hi] = t
            while lo < hi:
                pivotk = xs[lo][k]
                i = lo + 1
                while i < hi and xs[i][k] == pivotk:
                    i += 1
                if i - lo > 1:
                    buckets.append((lo, i, k + 1))
                lo = i
            assert lo == hi

----------

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue45530>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45530] Improve listobject.c's unsafe_tuple_compare()

Reply via email to