[Python-Dev] Re: Is anyone using 15-bit PyLong digits (PYLONG_BITS_IN_DIGIT=15)?

Gregory P. Smith Fri, 14 Jan 2022 16:57:32 -0800

On Fri, Jan 14, 2022 at 9:50 AM Mark Dickinson <dicki...@gmail.com> wrote:


> On Sun, Jan 2, 2022 at 10:35 AM Mark Dickinson <dicki...@gmail.com> wrote:
>
>> Division may still be problematic.
>>
>
> On that note: Python divisions are somewhat crippled even on x64. Assuming
> 30-bit digits, the basic building block that's needed for multi-precision
> division is a 64-bit-by-32-bit unsigned integer division, emitting a 32-bit
> quotient (and ideally also a 32-bit remainder). And there's an x86/x64
> instruction that does exactly that, namely DIVL. But without using inline
> assembly, current versions of GCC and Clang apparently can't be persuaded
> to emit that instruction from the longobject.c source - they'll use DIVQ (a
> 128-bit-by-64-bit division, albeit with the top 64 bits of the dividend set
> to zero) on x64, and the __udivti3 or __udivti4 intrinsic on x86.
>
> I was curious to find out what the potential impact of the failure to use
> DIVL was, so I ran some timings. A worst-case target is division of a large
> (multi-digit) integer by a single-digit integer (where "digit" means digit
> in the sense of PyLong digit, not decimal digit), since that involves
> multiple CPU division instructions in a fairly tight loop.
>
> Results: on my laptop (2.7 GHz Intel Core i7-8559U, macOS 10.14.6,
> non-optimised non-debug Python build), a single division of 10**1000 by 10
> takes ~1018ns on the current main branch and ~722ns when forced to use the
> DIVL instruction (by inserting inline assembly into the inplace_divrem1
> function). IOW, forcing use of DIVL instead of DIVQ, in combination
> with getting the remainder directly from the DIV instruction instead of
> computing it separately, gives a 41% speedup in this particular worst case.
> I'd expect the effect to be even more marked on x86, but haven't yet done
> those timings.
>
> For anyone who wants to play along, here's the implementation of the
> inplace_divrem1 (in longobject.c) that I was using:
>
> static digit
> inplace_divrem1(digit *pout, digit *pin, Py_ssize_t size, digit n)
> {
> digit remainder = 0;
>
> assert(n > 0 && n <= PyLong_MASK);
> while (--size >= 0) {
> twodigits dividend = ((twodigits)remainder << PyLong_SHIFT) | pin[size];
> digit quotient, high, low;
> high = (digit)(dividend >> 32);
> low = (digit)dividend;
> __asm__("divl %2\n"
> : "=a" (quotient), "=d" (remainder)
> : "r" (n), "a" (low), "d" (high)
> );
> pout[size] = quotient;
> }
> return remainder;
> }
>
>
> I don't know whether we *really* want to open the door to using inline
> assembly for performance reasons in longobject.c, but it's interesting to
> see the effect.
>

That only appears true in default boring -O2 builds.  Use `./configure
--enable-optimizations` and the C version is *much* faster than your asm
one...

250ns for C vs 370ns for your asm divl one using old gcc 9.3 on my zen3
when compiled using --enable-optimizations.

tested using ` -m timeit -n 1500000 -s 'x = 10**1000; r=x//10; assert r ==
10**999, r' 'x//10' `

Use of __asm__ appears to have prevented the compiler from being able to
fully optimize that code in PGO/FDO mode.

I trust the compiler toolchain to know what's close to best here.
Optimizing use of a div instruction isn't entirely straight forward as on
many microarchitectures the time required varies based on the inputs as
they'll internally implement looping when values exceed the bits with which
their hw operates.  there's probably interesting ways to optimize bignum
division using opencl and vector hardware as well - for the case when you
know you've got over a dozen digits; but that's what numpy et. al. are for.
Bignums in python are a convenience. Most values normal code deals with are
less than 2**100.

-Greg


>
> --
> Mark
>
>
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-le...@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/ZWGPO3TMCI7WNLC3EMS26DIKI5D3ZWMK/
> Code of Conduct: http://python.org/psf/codeofconduct/
>

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/WAZGSHPS7LBN4QNYVVSUG2RC26322L5D/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Is anyone using 15-bit PyLong digits (PYLONG_BITS_IN_DIGIT=15)?

Reply via email to