Raymond Hettinger added the comment: Reopening this with a better patch that nicely tightens the inner-loop without an algorithmic change or any downside.
Currently, the entry computation adds a constant (i+j), masks it (&mask) to wrap on overflow, scales it to a table index (<< 4), and then does the key lookup (one memory access). These steps are sequential (not parallizeable). The patch moves the wrap-on-overflow decision out of the inner loop (adding a single, fast, predictable branch (i + LINEAR_PROBES > mask). The inner-loop then simplifies to an add, fetch, and test (no masking and shifting). The generated code on Clang is very tight: LBB29_26: cmpq $0, (%rsi) # entry->key == NULL je LBB29_30 incq %rdx # j++ addq $16, %rsi # entry++ (done in parallel with the incq) cmpq $9, %rdx # j <= LINEAR_PROBES jbe LBB29_26 On gcc-4.9, the loop is unrolled and we get even better code: leaq (%rbx,%rsi), %rdx # entry = table[i] cmpq $0, (%rdx) # entry.key == NULL je L223 leaq 16(%rbx,%rsi), %rdx # entry = table[i+1]; cmpq $0, (%rdx) # entry.key == NULL je L223 leaq 32(%rbx,%rsi), %rdx cmpq $0, (%rdx) je L223 ---------- resolution: rejected -> status: closed -> open Added file: http://bugs.python.org/file37857/tight2.diff _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue23269> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com