Mark Dickinson <[email protected]> added the comment:
Thanks Tim for spotting the stupid mistake. The reworked timings are a bit more
... plausible.
tl;dr: On my machine, Raymond's suggestion gives a 2.2% speedup in the case
where POPCNT is not available, and a 0.45% slowdown in the case that it _is_
available. Given that, and the fact that a single-instruction population count
is not as readily available as I thought it was, I'd be happy to change the
implementation to use the trailing zero counts as suggested.
I'll attach the scripts I used for timing and analysis. There are two of them:
"timecomb.py" produces a single timing. "driver.py" repeatedly switches
branches, re-runs make, runs "timecomb.py", then assembles the results.
I ran the driver.py script twice: once with a regular `./configure` step, and
once with `./configure CFLAGS="-march=haswell"`. Below, "base" refers to the
code currently in master; "alt" is the branch with Raymond's suggested change
on it.
Output from the script for the normal ./configure
Mean time for base: 40.130ns
Mean for alt: 39.268ns
Speedup: 2.19%
Ttest_indResult(statistic=7.9929245698581415, pvalue=1.4418376402220854e-14)
Output for CFLAGS="-march=haswell":
Mean time for base: 39.612ns
Mean for alt: 39.791ns
Speedup: -0.45%
Ttest_indResult(statistic=-6.75385578636895, pvalue=5.119724894191512e-11)
----------
Added file: https://bugs.python.org/file50530/timecomb.py
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue37295>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com