[Gregory P. Smith <g...@krypto.org>]
> ...
> That only appears true in default boring -O2 builds.  Use
> `./configure --enable-optimizations` and the C version is *much* faster
> than your asm one...
>
> 250ns for C vs 370ns for your asm divl one using old gcc 9.3 on my
> zen3 when compiled using --enable-optimizations.

Something is missing here, but can't guess what without seeing the
generated machine code.But I trust Mark will do that.

> tested using ` -m timeit -n 1500000 -s 'x = 10**1000; r=x//10; assert r == 
> 10**999, r' 'x//10' `
>
> Use of __asm__ appears to have prevented the compiler from being able to fully
> optimize that code in PGO/FDO mode.
>
> I trust the compiler toolchain to know what's close to best here.

Mark is extremely experienced in numeric programming, and is the most
capable numerical analyst contributing to CPython. That's why he
trusts nothing ;-)


> ...  there's probably interesting ways to optimize bignum division using
> opencl and vector hardware as well - for the case when you know you've
> got over a dozen digits; but that's what numpy et. al. are for.

numpy has no support for bigints, beyond allowing the use of PyObject*
in numpy arrays (dtype='object') .In which case _all_ the bigint math
is performed by the hosting Python implementation.

People dead serious about bigint speed use the GMP library, which is a
massive amount of code and 100% devoted to peak speed. In Python,
gmpy2 supplies Python bindings for the GMP/MPIR, MPFR, and MPC
libraries.

> Bignums in python are a convenience. Most values normal code deals
> with are less than 2**100.

There's very little that "most normal code" uses. Not bigints, not
HTML, not regexps, not sockets, on & on & on..Whichever bubbles you
live in are nevertheless bubbles ;-)

The speed of + - * // impose fundamental limits on the speed of almost
all bigint algorithms. A reasonably quick win for bigint // would be
most welcome.

Anyone dead serious about working with bigints in Python "should"
install gmpy2. But it's still fun to prototype work in plain old
Python.

Note: I added an implementation of Karatsuba multiplication to CPython
two decades ago (which gives a major O() improvement over "schoolbook"
multiplication). I've often said that was a mistake,which I wouldn't
make again. Because it added relative mountains of new code to
longobject.c, and people who need fast bigint * should _still_ look to
GMP anyway (which implements several additional, even more gonzo, *
algorithms).

But neither do I want to remove it. It works fine, and the speed boost
is welcome (in my bubbles, multi-thousand bit ints are common).

What Mark is looking at here has a _very_ much more limited scope: the
two (I think) places in longobject.c that are dividing two native-size
machine ints in time-critical loops. "But nobody divides big ints
anyway - why would they?". For example, if you're investigating crypto
schemes, modular exponentiation with multi-thousand bit ints can be
common, and under the covers, CPython uses the same code for both //
and %. One modular exponentiation has to do a * and % for each bit
position in the exponent, plus more proportional to the number  of
bits set in the exponent.

About trusting "the toolchain", here under Win84 on a capable-enough
desktop box (Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, 3601 Mhz, 4
Core(s), 8 Logical Processor(s), using the released 3.10.1:

$ python --version
Python 3.10.1

$ python -m timeit -n 1500000 -s "x = 10**1000" "x//10"
1500000 loops, best of 5: 376 nsec per loop

Which actually makes little sense to me. 10**1000 requires 111 CPython
"digits", which is the number of times the loop in `inplace_divrem1()`
has to go around. Under 4 nsec per iteration seems close to impossibly
fast on a 3.8GHz box, given the presence of any division instruction.

However, dividing by 10 is not a worst case on this box. Dividing by
100 is over 3x slower:

$ python -m timeit -n 1500000 -s "x = 10**1000" "x//100"
1500000 loops, best of 5: 1.25 usec per loop

Dividing by the largest single Python "digit" is apparently the same:

$ python -m timeit -n 1500000 -s "x = 10**1000; d=2**30-1" "x//d"
1500000 loops, best of 5: 1.29 usec per loop

In fact, much the same story for dividing by 1, 2, 3. 4, 5, 6, 7, 8,
9, and 11. Got bored then ;-) They're _all_ much slower than dividing
by 10 in this case.

Trust nothing :-)
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/GLKIVXMJMVH35SFIR455O6SURNOTZCKD/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to