New submission from Raymond Hettinger:
This patch applies three techniques to tighten-up the generated code for the
lookkey() for sets.
I'm not sure I want to do this because it expands the size of the code quite a
bit (much like the previously existing lookkey specializations did). As such,
it offends my sense of beauty and neatness.
That said, it greatly tightens the inner search loop and in the handful of
timings I've done so far, it provides a good speed-up for smaller sets (ones
that fit in L1 and L2 cache) and minor improvements for bigger sets (where
running time is dominated by cost of fetching a cache-line with a key/hash
pair).
The three transformations are:
1.) Make two variants of lookkey, one with an early-out for a dummy entry and
one that treats a dummy entry as an active entry. This eliminates the
freeslot tracking in both cases.
In the first case (early-out), it frees one register (the one for the
freeslot), it simplifies the inner-loop test (no need to see if freeslot is
NULL), it provides an early-out (return on dummy rather than continuing to
search), and it simplifies the return code (just return entry, no need to check
the freeslot).
In the second case (dummies treated as active), it frees two registers (one for
freeslot and the other for &dummy) and it removes the dummy/freeslot inner-loop
test entirely.
That eliminated inner inner-loop code used to look like this in gcc-4.9:
cmpq %r15, %rbp entry->key != dummny
jne L375 |
testq %r13, %r13 freelslot == NULL |
cmove %rbx, %r13 |
L375: <---|
And the eliminated inner loop code was even worse with clang:
leaq __dummy_struct(%rip), %rax
cmpq %rax, %r14 entry->key==dummy
sete %al
testq %rbx, %rbx freeslot==NULL
sete %cl
testb %cl, %al
cmovneq %r13, %rbx freeslot ?= entry
2.) Avoid the &mask step for computing indices when not needed. Using the same
technique that is used in set_insert_clean, we test to see if there is a
possible wrap-around during the linear probes. If not, we can skip the lengthy
address calculation for table[(i+j)&mask] and instead use entry++.
This saves the sequentially dependent steps of add j, the bitwise-and with the
mask, a shift left by four to scale the index, and the add to the table start
address. It replaces those steps with a single entry++ which codes as an add
$16.
3). In the linear_probe inner-loop for the ignore dummy case, replace
(entry->key==NULL) with (entry->hash==0 && entry->key==NULL). While that
looks like a step backward adding an extra test, that second test is
essentially free (a register compare and branch which is nearly 100%
predictable). The benefit is that the inner-loop only needs to fetch the
entry->hash value and doesn't need to fetch entry->key. This doesn't sound
like much but it cuts the number of memory accesses per loop in half.
Individually, three transformations don't seem like much, but they take already
tight code and cut the work more than in-half (for the L1 and L2 case).
Collectively, it frees a couple of registers for better register allocation, it
reduces the number of compares inside the loop, in substantially reduces the
time for the entry address calculation, and in the ignore-dummy case cuts the
number of memory accesses per loop in half.
The resulting code is astonishingly tight compared to the original and looks
almost as tight as the code in set_insert_clean().
--- inner-loop of set_lookkey_dummy_allowed() ---
L393:
cmpq %r8, (%rbx) entry->key == dummy
je L453
cmpq %r12, %rbx entry == limit
movq %rbx, %r14
je L506
L412:
movq 16(%r14), %rbp
leaq 16(%r14), %rbx
testq %rbp, %rbp entry->key NULL
je L449
cmpq 8(%rbx), %r15 entry->hash == hash
jne L393
--- inner-loop of set_lookkey_dummy_ignored() ---
L846:
cmpq %r13, %rbx entry == limit
je L942
addq $16, %rbx entry++
movq 8(%rbx), %rax entry->hash == NULL
testq %rax, %rax
jne L845
cmpq $0, (%rbx) entry->key == NULL (not in the inner loop)
je L905
L845:
cmpq %r12, %rax entry->hash == tgt
jne L846
Before anything can go forward, more timings are needed and I need to come to
terms with the code expansion back to size it was before the old lookkey
specializations where removed. For now, I just want to document the work that
has been done.
Anyway, here it is if someone wants to experiment with it. Expect benefits in
the cases where the inner-loop matters (small to medium sized sets that have
collisions -- giant sets benefit as well but their lookup speed is dominated by
the cost to fetch a single random cache line).
----------
files: hoisted_and_specialized.diff
keywords: patch
messages: 235131
nosy: rhettinger
priority: low
severity: normal
stage: patch review
status: open
title: Speed-up set_lookkey()
type: performance
versions: Python 3.5
Added file: http://bugs.python.org/file37945/hoisted_and_specialized.diff
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue23359>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com