Just submitted a PR implementing this: https://github.com/python/cpython/pull/582 -- just need someone to review it now :)
Thanks for all your feedback, everyone! On Sun, Mar 5, 2017 at 12:19 AM Elliot Gorokhovsky < [email protected]> wrote: > (Summary of results: my patch at https://bugs.python.org/issue28685 makes > list.sort() 30-50% faster in common cases, and at most 1.5% slower in the > uncommon worst case.) > > Hello all, > > You may remember seeing some messages on here about optimizing list.sort() > by exploiting type-homogeneity: since comparing apples and oranges is > uncommon (though possible, i.e. float to int), it pays off to check if the > list is type-homogeneous (as well as homogeneous with respect to some other > properties), and, if it is, to replace calls to PyObject_RichCompareBool > with calls to ob_type->tp_richcompare (or, in common special cases, to > optimized compare functions). The entire patch is less than 250 lines of > code, most of which is pretty boilerplate (i.e. a lot of assertions in > #ifdef Py_DEBUG blocks, etc). > > I originally wrote that patch back in November. I've learned a lot since > then, both about CPython and about mailing list etiquette :). Previous > discussion about this can be found at > https://mail.python.org/pipermail/python-dev/2016-October/146648.html and > https://mail.python.org/pipermail/python-ideas/2016-October/042807.html. > > Anyway, I recently redid the benchmarks much more rigorously (in > preparation for presenting this project at my school's science fair), > achieving a standard deviation of less than 0.5% of the mean for all > measurements. The exact benchmark script used can be found at > https://github.com/embg/python-fastsort-benchmark (it's just sorting > random lists of/lists of tuples of [type]. While listsort.txt talks about > benchmarking different kinds of structured lists, instead of just random > lists, the results here would hold in those cases just as well, because > this makes individual comparisons cheaper, instead of reducing the number > of comparisons based on structure). > > I also made a poster describing the optimization and including a pretty > graph displaying the benchmark data: > https://github.com/embg/python-fastsort-benchmark/blob/master/poster.pdf. > For those who would rather read the results here (though it is a *really* > pretty graph): > > *** > Percent improvement for sorting random lists of [type] > (1-patched/unpatched): > float: 48% > bounded int (magnitude smaller than 2^32): 48.4% > latin string (all characters in [0,255]): 32.7% > general int (reasonably uncommon?): 17.2% > general string (reasonably uncommon?): 9.2% > tuples of float: 63.2% > tuples of bounded int: 64.8% > tuples of latin string: 55.8% > tuples of general int: 50.3% > tuples of general string: 44.1% > tuples of heterogeneous: 41.5% > heterogeneous (lots of float with an int at the end; worst-case): -1.5% > *** > > Essentially, it's a gamble where the payoff is 20-30 times greater than > the cost, and the odds of losing are very small. Sorting is perhaps not a > bottleneck in most applications, but considering how much work has gone > into Python's sort (Timsort, etc; half of listobject.c is sort code), I > think it's interesting that list.sort() can be made essentially twice > faster by a relatively simple optimization. I would also add that Python > dictionaries already implement this optimization: they start out optimizing > based on the assumption that they'll only be seeing string keys, checking > to make sure that assumption holds as they go. If they see a non-string > key, they permanently switch over to the general implementation. So it's > really the same idea, except here it doesn't matter as much what type we're > dealing with, which is important, because lists are commonly used with lots > of different types, as opposed to dictionaries, which overwhelmingly > commonly use string keys, especially internally. (Correct me if I'm wrong > in any of the above). > > I got a lot of great feedback/input on this patch as I was writing it, but > after submitting it, I didn't hear much from anybody. (The reason I took so > long to post was because I wanted to wait until I had the chance to do the > benchmarks *right*). What do you all think? > > Thanks, > Elliot >
_______________________________________________ Python-ideas mailing list [email protected] https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
