Mark Dickinson <[email protected]> added the comment:
> So which of xor-popcount and add-up-up-trailing-zero-counts is faster may
> well depend on platform.
I ran some timings for comb(k, 67) on my macOS / Intel MacBook Pro, using
timeit to time calls to a function that looked like this:
def f(comb):
for k in range(68):
for _ in range(256):
comb(k, 67)
comb(k, 67)
... # 64 repetitions of comb(k, 67) in all
Based on 200 timings of this script with each of the popcount approach and the
uint8_t-table-of-trailing-zero-counts approach (interleaved), the popcount
approach won, but just barely, at around 1.3% faster. The result was
statistically significant (SciPy gave me a result of
Ttest_indResult(statistic=19.929941828072433, pvalue=8.570975609117687e-62)).
Interestingly, the default build on macOS/Intel is _not_ using the dedicated
POPCNT instruction that arrived with the Nehalem architecture, presumably
because it wants to produce builds that will still be useable on pre-Nehalem
machines. It uses Clang's __builtin_popcount, but that gets translated to the
same SIMD-within-a-register approach that we have already in pycore_bitutils.h.
If I recompile with -msse4.2, then the POPCNT instruction *is* used, and I get
an even more marginal improvement: a 1.7% speedup over the lookup-table-based
version.
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue37295>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com