Mark Dickinson <dicki...@gmail.com> added the comment:
> So which of xor-popcount and add-up-up-trailing-zero-counts is faster may > well depend on platform. I ran some timings for comb(k, 67) on my macOS / Intel MacBook Pro, using timeit to time calls to a function that looked like this: def f(comb): for k in range(68): for _ in range(256): comb(k, 67) comb(k, 67) ... # 64 repetitions of comb(k, 67) in all Based on 200 timings of this script with each of the popcount approach and the uint8_t-table-of-trailing-zero-counts approach (interleaved), the popcount approach won, but just barely, at around 1.3% faster. The result was statistically significant (SciPy gave me a result of Ttest_indResult(statistic=19.929941828072433, pvalue=8.570975609117687e-62)). Interestingly, the default build on macOS/Intel is _not_ using the dedicated POPCNT instruction that arrived with the Nehalem architecture, presumably because it wants to produce builds that will still be useable on pre-Nehalem machines. It uses Clang's __builtin_popcount, but that gets translated to the same SIMD-within-a-register approach that we have already in pycore_bitutils.h. If I recompile with -msse4.2, then the POPCNT instruction *is* used, and I get an even more marginal improvement: a 1.7% speedup over the lookup-table-based version. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue37295> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com