[issue28685] Optimizing list.sort() by performing safety checks in advance

Elliot Gorokhovsky Sun, 13 Nov 2016 13:20:00 -0800

New submission from Elliot Gorokhovsky:

When python compares two objects, there are many safety checks that have to be 
performed before the actual comparison can take place. These include checking 
the types of the two objects, as well as checking various type-specific 
properties, like character width for strings or the number of machine-words 
that represent a long.


Obviously, the vast majority of the work done during a list.sort() is spent in 
comparisons, so even small optimizations can be important. What I noticed is 
that since we have n objects, but we're doing O(nlogn) comparisons, it pays off 
to do all the safety checks in a single pass and then select an optimized 
comparison function that leverages the information gained to avoid as many 
sort-time checks as possible. I made the following assumptions:

1. In practice, lists to be sorted are almost always type-homogeneous.
2. Most lists to be sorted consist of homogeneous elements of the following 
types:
    a. Latin strings (i.e. strings whose characters are one machine-word wide)
    b. Longs that fit in a single machine-word
    c. Floats
    d. Tuples of the above
3. The above assumptions also hold for lists constructed by taking a list of 
tuples to be sorted and extracting the first elements.
4. The vast majority of tuple comparisons never get past the first element 
(i.e. the first elements of tuples are rarely equal).

My patch adds the following routine to list.sort():
1. Go through the list and see which of the assumptions, if any, hold (except 
(4), which can't be checked in advance).
2. Replace PyObject_RichCompareBool() with an optimized compare function that 
is able to ignore some safety checks by applying the assumption we've already 
verified to be true.

There are then two questions: when none of the assumptions hold, how expensive 
is (1)? And when we do get a "hit", how much time do we save by applying (2)?

Those questions, of course, can only be answered by benchmarks. So here they 
are, computed on an isolated CPU core, on two python interpreters, both 
compiled with ./configure --with-optimizations:

# This is a runnable script. Just build the reference interpreter in python-ref 
and the patched interpreter in python-dev.

# Pathological cases (where we pay for (1) but don't get any savings from (2)): 
what kind of losses do we suffer?

# How expensive is (1) for empty lists?
python-dev/python -m perf timeit -s "l=[]" "sorted(l)" --rigorous 
python-ref/python -m perf timeit -s "l=[]" "sorted(l)" --rigorous  
# Median +- std dev: 212 ns +- 9 ns
# Median +- std dev: 216 ns +- 10 ns
# (difference is within std dev)

# How expensive is (1) for singleton lists?
python-dev/python -m perf timeit -s "l=[None]" "sorted(l)" --rigorous 
python-ref/python -m perf timeit -s "l=[None]" "sorted(l)" --rigorous
# Median +- std dev: 235 ns +- 16 ns
# Median +- std dev: 238 ns +- 15 ns
# (difference is within std dev)

# How expensive is (1) for non-type-homogeneous lists?
# (we use small lists because as n gets large, the fraction time spent in (1) 
gets smaller)
python-dev/python -m perf timeit -s "import random; l=[random.uniform(-1,1) for 
_ in range(0,10)]+[1]" "sorted(l)" --rigorous 
python-ref/python -m perf timeit -s "import random; l=[random.uniform(-1,1) for 
_ in range(0,10)]+[1]" "sorted(l)" --rigorous 
# Median +- std dev: 784 ns +- 35 ns
# Median +- std dev: 707 ns +- 51 ns
# 10% slower. While this is unfortunate, this case almost never occurs in 
practice, and the losses are certainly offset by the gains we'll see below.
# Also, note that for large lists, the difference would be *much* smaller, 
since the time spent in (1) gets smaller like 1/log(n).
# So basically, we only pay a penalty for non-type-homogeneous small lists.

# OK, now for the cases that actually occur in practice:

# What kind of gains do we get for floats?
python-dev/python -m perf timeit -s "import random; l=[random.uniform(-1,1) for 
_ in range(0,100)]" "sorted(l)" --rigorous
python-ref/python -m perf timeit -s "import random; l=[random.uniform(-1,1) for 
_ in range(0,100)]" "sorted(l)" --rigorous
# Median +- std dev: 3.63 us +- 0.20 us
# Median +- std dev: 8.81 us +- 0.37 us
# Wow! 59% faster! And if we used a large list, we would've gotten even better 
numbers.

# What kind of gains do we get for latin strings?
python-dev/python -m perf timeit -s "import random; 
l=[str(random.uniform(-1,1)) for _ in range(0,100)]" "sorted(l)" --rigorous
python-ref/python -m perf timeit -s "import random; 
l=[str(random.uniform(-1,1)) for _ in range(0,100)]" "sorted(l)" --rigorous
# Median +- std dev: 9.51 us +- 0.28 us
# Median +- std dev: 15.9 us +- 0.7 us
# 40% faster! 

# What kind of gains do we get for non-latin strings (which I imagine aren't 
that common to sort in practice anyway)
python-dev/python -m perf timeit -s "import random; 
l=[str(random.uniform(-1,1))+'\uffff' for _ in range(0,100)]" "sorted(l)" 
--rigorous
python-ref/python -m perf timeit -s "import random; 
l=[str(random.uniform(-1,1))+'\uffff' for _ in range(0,100)]" "sorted(l)" 
--rigorous
# Median +- std dev: 12.2 us +- 0.4 us
# Median +- std dev: 14.2 us +- 0.6 us
# 14%. Not as impressive, but again, this is a bit of a pathological case.

# What kind of gains do we get for ints that fit in a machine word? (I'll keep 
them in (-2^15,2^15) to be safe)
python-dev/python -m perf timeit -s "import random; 
l=[int(random.uniform(-1,1)*2**15) for _ in range(0,100)]" "sorted(l)" 
--rigorous
python-ref/python -m perf timeit -s "import random; 
l=[int(random.uniform(-1,1)*2**15) for _ in range(0,100)]" "sorted(l)" 
--rigorous
# Median +- std dev: 4.92 us +- 0.35 us
# Median +- std dev: 9.59 us +- 0.75 us
# 49% faster. Pretty cool stuff!

# What kind of gains do we get for pathologically large ints?
python-dev/python -m perf timeit -s "import random; 
l=[int(random.uniform(-1,1)*2**100) for _ in range(0,100)]" "sorted(l)" 
--rigorous
python-ref/python -m perf timeit -s "import random; 
l=[int(random.uniform(-1,1)*2**100) for _ in range(0,100)]" "sorted(l)" 
--rigorous
# Median +- std dev: 6.93 us +- 0.26 us
# Median +- std dev: 8.93 us +- 0.23 us
# 22% faster. The same comparison function is used as in the pathological 
string case, but the numbers are different because comparing strings
# is more expensive than comparing ints, so we save less time from skipping the 
type checks. Regardless, 22% is still pretty cool. And I can't
# imagine it's that common to sort huge ints anyway, unless you're doing some 
sort of math conjecture testing or something.

# What kind of gains do we get for tuples whose first elements are floats?
python-dev/python -m perf timeit -s "import random; l=[(random.uniform(-1,1), 
'whatever') for _ in range(0,100)]" "sorted(l)" --rigorous
python-ref/python -m perf timeit -s "import random; l=[(random.uniform(-1,1), 
'whatever') for _ in range(0,100)]" "sorted(l)" --rigorous
# Median +- std dev: 5.83 us +- 0.50 us
# Median +- std dev: 21.5 us +- 0.5 us
# WOW! 73% faster!
# A note: I know of at least one application where this is relevant. The DEAP 
evolutionary algorithms library, which is very popular, has to sort
# tuples of floats when it's selecting individuals for the next generation. It 
spends a non-trivial amount of time sorting those tuples of floats,
# and this patch cuts that non-trivial time by 73%! Pretty cool! Now we can 
evolve more individuals more better!

# What kind of gains do we get for tuples whose first elements are latin 
strings?
python-dev/python -m perf timeit -s "import random; 
l=[(str(random.uniform(-1,1)), 42) for _ in range(0,100)]" "sorted(l)" 
--rigorous
python-ref/python -m perf timeit -s "import random; 
l=[(str(random.uniform(-1,1)), 42) for _ in range(0,100)]" "sorted(l)" 
--rigorous
# Median +- std dev: 14.9 us +- 0.8 us
# Median +- std dev: 31.2 us +- 1.3 us
# 52%. Not too shabby!

# What kind of gains do we get for tuples of other stuff?
# Obviously there are lots of combinations possible, I won't waste your time 
documenting all of them.
# Suffice it to say that the gains for lists tuples of x is always greater than 
the gains for lists of x, since we bypass
# two layers of safety checks: checks on the tuples, and then checks on the x.

# What kind of gains do we for arbitrary lists of objects of the same type?
# See the benchmarks for large ints and non-latin strings. It will be different 
for each object, since it depends on the ratio between
# the cost of type-check and the cost of comparison. Regardless, it is always 
non-trivial, unless the cost of comparison is huge.

# End of script #

TL;DR: This patch makes common list.sort() cases 40-75% faster, and makes very 
uncommon pathological cases at worst 15% slower.
It accomplishes this by performing the endless, but necessary, safety checks 
involved in comparison up front, and then using
optimized compares that ignore the already-performed checks during the actual 
sort.

----------
components: Interpreter Core
files: fastsort.patch
keywords: patch
messages: 280718
nosy: elliot.gorokhovsky
priority: normal
severity: normal
status: open
title: Optimizing list.sort() by performing safety checks in advance
type: performance
versions: Python 3.7
Added file: http://bugs.python.org/file45477/fastsort.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28685>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue28685] Optimizing list.sort() by performing safety checks in advance

Reply via email to