Note that CPython's sort is in the business of merging sorted (sub)lists. Any mergesort is. But CPython's adapts to take advantage, when possible, of "lumpy" distributions. For example, if you sort
list(range(1000000, 2000000)) + list(range(0, 1000000)) it goes very fast (O(N)). Because it first rediscovers that the list is a catenation of two sorted sublists, then notices that because the smallest element of the left sublist is greater than the largest element of the right sublist, there's no need to compare anything else _between_ the sublists. The sorted result has to be the right sublist followed the left one. The same would work for a union operation if the sublists were given as distinct arguments. Similarly for intersection (which would deduce almost instantly that the intersection must be empty). And so on. That's not by accident - the inspiration for CPython's sort's basic "galloping" approach was taken from this paper, which wasn't about sorting at all: "Adaptive Set Intersections, Unions, and Differences" (2000) Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro That's concerned with manipulating large sets represented as sorted sequences. They noticed too that data was "lumpy" in real life, and that paper was the start of a number of papers trying ways to take advantage of that. So if someone _really_ wants to go gonzo with this, that's a promising approach. Do note, though, that it doesn't fit naturally with iterators. "Galloping" relies on O(1) access to data at arbitrary index positions; it's a fancier variant of binary search (see listsort.txt in your CPython source distribution for details). If cache misses can be so expensive, how can that win? Because the number of index accesses per galloping step is logarithmic in the number of sequence elements consumed, and "compare" steps can (very roughly speaking) be skipped entirely for all the sequence elements _not_ touched by the galloping indices. Pointers to those are merely copied, without even looking at the data they point to. Which often turns out to be a huge win in CPython, because comparison is CPython is expensive. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/GJQQBHVEEL7PRMKBRCEA2BKP7CFURELW/ Code of Conduct: http://python.org/psf/codeofconduct/