On Sat, Jan 15, 2022 at 8:12 PM Tim Peters <tim.pet...@gmail.com> wrote:

> Something is missing here, but can't guess what without seeing the
> generated machine code.But I trust Mark will do that.
>

Welp, there goes my weekend. :-)

 $ python -m timeit -n 1500000 -s "x = 10**1000" "x//10"

1500000 loops, best of 5: 376 nsec per loop
>
> Which actually makes little sense to me. [...] Under 4 nsec per iteration
> seems

close to impossibly fast on a 3.8GHz box, given the presence of any
> division instruction.



<snip>



However, dividing by 10 is not a worst case on this box. Dividing by
> 100 is over 3x slower:
>
> $ python -m timeit -n 1500000 -s "x = 10**1000" "x//100"
> 1500000 loops, best of 5: 1.25 usec per loop


Now *that* I certainly wasn't expecting. I don't see the same effect on
macOS / Clang, whether compiling with --enable-optimizations or not; this
appears to be a GCC innovation. And indeed, as Tim suggested, it turns out
that there's no division instruction present in the loop for the
division-by-10 case - we're doing division via multiplication by the
reciprocal. In Python terms, we're computing `x // 10` as `(x *
0xcccccccccccccccd) >> 67`. Here's the tell-tale snippet of the assembly
output from the second compilation (the one that makes use of the generated
profile information) of longobject.c at commit
09087b8519316608b85131ee7455b664c00c38d2
<https://github.com/python/cpython/blob/09087b8519316608b85131ee7455b664c00c38d2/Objects/longobject.c>
on
a Linux box, with GCC 11.2.0. I added a couple of comments, but it's
otherwise unaltered

.loc 1 1632 36 view .LVU12309
movl %r13d, %r11d
salq $2, %rbp
cmpl $10, %r13d # compare divisor 'n' with 10, and
jne .L2797  # go to the slow version if n != 10
leaq 1(%r10), %r9 # from here on, the divisor is 10
addq %rbp, %r8
.LVL3442:
.loc 1 1632 36 view .LVU12310
addq %rbp, %rdi
.LVL3443:
.loc 1 1632 36 view .LVU12311
.LBE8049:
.loc 1 1624 15 view .LVU12312
xorl %r13d, %r13d
.LVL3444:
.loc 1 1624 15 view .LVU12313
movabsq $-3689348814741910323, %r11 # magic constant 0xcccccccccccccccd for
division by 10

and then a few lines later:

.loc 1 1630 9 is_stmt 1 view .LVU12316
.loc 1 1631 9 view .LVU12317
.loc 1 1631 39 is_stmt 0 view .LVU12318
movl (%r8,%r10,4), %r14d # move top digit of divisor into the low word of
r14
.LVL3446:
.loc 1 1632 9 is_stmt 1 view .LVU12319
movq %r14, %rax # set up for division: top digit is now in rax
.loc 1 1633 13 is_stmt 0 view .LVU12320
movq %r14, %r13
mulq %r11 # here's the division by 10: multiply by the magic constant
shrq $3, %rdx # and divide by 8 (via a shift)

and then it all gets a bit repetitive and boring - there's a lot of loop
unrolling going on.

So gcc is anticipating divisions by 10 and introducing special-case
divide-by-reciprocal-multiply code for that case, and presumably the
profile generated for the PGO backs up this being a common enough case, so
we end up with the above code in the final compilation.

TIL ...

-- 
Mark
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/VDII5EBMXLNO4U3BSSNWAW2ETLNG6YUN/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to