Note that CPython's sort is in the business of merging sorted
(sub)lists.  Any mergesort is.  But CPython's adapts to take
advantage, when possible, of "lumpy" distributions.  For example, if
you sort

      list(range(1000000, 2000000)) + list(range(0, 1000000))

it goes very fast (O(N)).  Because it first rediscovers that the list
is a catenation of two sorted sublists, then notices that because the
smallest element of the left sublist is greater than the largest
element of the right sublist, there's no need to compare anything else
_between_ the sublists.  The sorted result has to be the right sublist
followed the left one.

The same would work for a union operation if the sublists were given
as distinct arguments.  Similarly for intersection (which would deduce
almost instantly that the intersection must be empty).  And so on.

That's not by accident - the inspiration for CPython's sort's basic
"galloping" approach was taken from this paper, which wasn't about
sorting at all:

    "Adaptive Set Intersections, Unions, and Differences" (2000)
    Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro

That's concerned with manipulating large sets represented as sorted
sequences.  They noticed too that data was "lumpy" in real life, and
that paper was the start of a number of papers trying ways to take
advantage of that.

So if someone _really_ wants to go gonzo with this, that's a promising
approach.  Do note, though, that it doesn't fit naturally with
iterators.  "Galloping" relies on O(1) access to data at arbitrary
index positions; it's a fancier variant of binary search (see
listsort.txt in your CPython source distribution for details).

If cache misses can be so expensive, how can that win?  Because the
number of index accesses per galloping step is logarithmic in the
number of sequence elements consumed, and "compare" steps can (very
roughly speaking) be skipped entirely for all the sequence elements
_not_ touched by the galloping indices.  Pointers to those are merely
copied, without even looking at the data they point to.

Which often turns out to be a huge win in CPython, because comparison
is CPython is expensive.
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/GJQQBHVEEL7PRMKBRCEA2BKP7CFURELW/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to