Mark Dickinson <dicki...@gmail.com> added the comment:

> So which of xor-popcount and add-up-up-trailing-zero-counts is faster may 
> well depend on platform.

I ran some timings for comb(k, 67) on my macOS / Intel MacBook Pro, using 
timeit to time calls to a function that looked like this:

def f(comb):
    for k in range(68):
        for _ in range(256):
            comb(k, 67)
            comb(k, 67)
            ... # 64 repetitions of comb(k, 67) in all

Based on 200 timings of this script with each of the popcount approach and the 
uint8_t-table-of-trailing-zero-counts approach (interleaved), the popcount 
approach won, but just barely, at around 1.3% faster. The result was 
statistically significant (SciPy gave me a result of 
Ttest_indResult(statistic=19.929941828072433, pvalue=8.570975609117687e-62)).

Interestingly, the default build on macOS/Intel is _not_ using the dedicated 
POPCNT instruction that arrived with the Nehalem architecture, presumably 
because it wants to produce builds that will still be useable on pre-Nehalem 
machines. It uses Clang's __builtin_popcount, but that gets translated to the 
same SIMD-within-a-register approach that we have already in pycore_bitutils.h.

If I recompile with -msse4.2, then the POPCNT instruction *is* used, and I get 
an even more marginal improvement: a 1.7% speedup over the lookup-table-based 
version.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue37295>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to