New submission from Raymond Hettinger:

This patch applies three techniques to tighten-up the generated code for the 
lookkey() for sets.

I'm not sure I want to do this because it expands the size of the code quite a 
bit (much like the previously existing lookkey specializations did).  As such, 
it offends my sense of beauty and neatness.

That said, it greatly tightens the inner search loop and in the handful of 
timings I've done so far, it provides a good speed-up for smaller sets (ones 
that fit in L1 and L2 cache) and minor improvements for bigger sets (where 
running time is dominated by cost of fetching a cache-line with a key/hash 
pair).

The three transformations are:

1.)  Make two variants of lookkey, one with an early-out for a dummy entry and 
one that treats a dummy entry as an active entry.   This eliminates the 
freeslot tracking in both cases.  

In the first case (early-out), it frees one register (the one for the 
freeslot), it simplifies the inner-loop test (no need to see if freeslot is 
NULL), it provides an early-out (return on dummy rather than continuing to 
search), and it simplifies the return code (just return entry, no need to check 
the freeslot).

In the second case (dummies treated as active), it frees two registers (one for 
freeslot and the other for &dummy) and it removes the dummy/freeslot inner-loop 
test entirely.

That eliminated inner inner-loop code used to look like this in gcc-4.9:

    cmpq    %r15, %rbp      entry->key != dummny
    jne L375                                   |
    testq   %r13, %r13      freelslot == NULL  |
    cmove   %rbx, %r13                         |
L375:                                      <---|

And the eliminated inner loop code was even worse with clang:

    leaq    __dummy_struct(%rip), %rax
    cmpq    %rax, %r14                entry->key==dummy
    sete    %al
    testq   %rbx, %rbx                freeslot==NULL
    sete    %cl
    testb   %cl, %al
    cmovneq %r13, %rbx                freeslot ?= entry


2.) Avoid the &mask step for computing indices when not needed. Using the same 
technique that is used in set_insert_clean, we test to see if there is a 
possible wrap-around during the linear probes.  If not, we can skip the lengthy 
address calculation for table[(i+j)&mask] and instead use entry++.

This saves the sequentially dependent steps of add j, the bitwise-and with the 
mask, a shift left by four to scale the index, and the add to the table start 
address.  It replaces those steps with a single entry++ which codes as an add 
$16.

3).  In the linear_probe inner-loop for the ignore dummy case, replace 
(entry->key==NULL) with (entry->hash==0 && entry->key==NULL).   While that 
looks like a step backward adding an extra test, that second test is 
essentially free (a register compare and branch which is nearly 100% 
predictable).   The benefit is that the inner-loop only needs to fetch the 
entry->hash value and doesn't need to fetch entry->key.   This doesn't sound 
like much but it cuts the number of memory accesses per loop in half.  

Individually, three transformations don't seem like much, but they take already 
tight code and cut the work more than in-half (for the L1 and L2 case).   
Collectively, it frees a couple of registers for better register allocation, it 
reduces the number of compares inside the loop, in substantially reduces the 
time for the entry address calculation, and in the ignore-dummy case cuts the 
number of memory accesses per loop in half.

The resulting code is astonishingly tight compared to the original and looks 
almost as tight as the code in set_insert_clean().


--- inner-loop of set_lookkey_dummy_allowed() ---
L393:
    cmpq    %r8, (%rbx)        entry->key == dummy
    je  L453
    cmpq    %r12, %rbx         entry == limit
    movq    %rbx, %r14
    je  L506
L412:
    movq    16(%r14), %rbp
    leaq    16(%r14), %rbx
    testq   %rbp, %rbp         entry->key NULL
    je  L449
    cmpq    8(%rbx), %r15      entry->hash == hash
    jne L393


--- inner-loop of set_lookkey_dummy_ignored() ---
L846:
    cmpq    %r13, %rbx         entry == limit
    je  L942
    addq    $16, %rbx          entry++
    movq    8(%rbx), %rax      entry->hash == NULL
    testq   %rax, %rax
    jne L845
    cmpq    $0, (%rbx)         entry->key == NULL (not in the inner loop)
    je  L905
L845:
    cmpq    %r12, %rax         entry->hash == tgt
    jne L846


Before anything can go forward, more timings are needed and I need to come to 
terms with the code expansion back to size it was before the old lookkey 
specializations where removed.  For now, I just want to document the work that 
has been done.

Anyway, here it is if someone wants to experiment with it.  Expect benefits in 
the cases where the inner-loop matters (small to medium sized sets that have 
collisions -- giant sets benefit as well but their lookup speed is dominated by 
the cost to fetch a single random cache line).

----------
files: hoisted_and_specialized.diff
keywords: patch
messages: 235131
nosy: rhettinger
priority: low
severity: normal
stage: patch review
status: open
title: Speed-up set_lookkey()
type: performance
versions: Python 3.5
Added file: http://bugs.python.org/file37945/hoisted_and_specialized.diff

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue23359>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to