Tim Peters <[email protected]> added the comment:
Stefan, thanks - I think I understand what you're driving at now.
You're (re)discovering that sorting by lexicographic ordering rules is
_inherently_ suboptimal if list elements can only be compared "starting at
index 0" each time. Not just tuples, but also, e.g., lists, byte arrays, and
even strings. Any sequence type whose ordering derives from lexicographic
generalization of its elements' ordering.
This is quite independent of whether or not CPython tries to exploit type
homogeneity.
The usual(*) solution is indeed to exploit a kind of bucket sort instead, first
sorting at elements' index 0 alone, saying all those elements with the same
value at index 0 constitute "a bucket", and then applying the same idea to each
bucket but sorting on index 1 instead. Where those in turn are partitioned into
buckets equal at index 1, and those buckets are similarly each sorted on index
2. And so on, and so on.
No doubt that can be a big win, and even in the absence of the type homogeneity
tricks. It's far beyond the scope of _this_ PR, though, and is really an
entirely different approach.
I've thought about it at times, but it's "a real project" to do it right. In
effect, it would implement an optimized generalization of the SO OP's
"automagically convert multikey to multisort" wish.
Fine by me if someone pursues that, but I don't have the energy for it, nor
much interest either in doing a half-hearted job of it.
I always expected that if it came up at all, it would be in the context of
sorting lists of strings. For example, consider:
xs = ['x' * 10_000 + str(i) for i in range(1_000_000)]
random.shuffle(xs)
Even with type homogeneity tricks, Python's sorting of the result is very far
from optimal. Every single comparison starts by burning time skipping over the
(at least) ten thousands equal characters at the start.
The "bucket sort" (or it could be viewed as a form of MSD radix sort) quickly
discovers that all the characters at index 0 are equal, and also all those at
index 1, ..., and all those at index 9999. The "real work" of sorting is then
reduced to working on the (at most) 6 characters remaining.
(*) But I'm lying ;-) The actual usual solution is to use "multikey quicksort",
because it's easy to code and works "in place". But it's not a stable sort.
String sorting doesn't care about that, but sorting other kinds of sequences
can care a lot.
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue45530>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com