https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100331

            Bug ID: 100331
           Summary: 128 bit arithmetic --- suboptimal after shifting when
                    referencing other variables
           Product: gcc
           Version: 9.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: zero at smallinteger dot com
  Target Milestone: ---

Created attachment 50706
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50706&action=edit
Reproduction test case

Compile the given C program with -O2.  Enabling the #if 1 branch results in:

compute(unsigned long, unsigned long):
        mov     ecx, edi
        xor     edx, edx
        mov     rax, rsi
        xor     esi, esi
        and     ecx, 63
        shrd    rax, rdx, cl
        shr     rdx, cl
        test    cl, 64
        mov     r8d, ecx
        cmovne  rax, rdx
        cmovne  rdx, rsi
        and     r8d, 63
        mov     rsi, rax
        mov     rax, r8
        mov     rdi, rdx
        xor     edx, edx
        add     rax, rsi
        adc     rdx, rdi
        ret

Note test cl, 64 and the subsequent cmovs are unnecessary because the result of
the test is already known after and ecx, 63.  Note also mov r8d, ecx, followed
by and r8d, 63, redoing work.

Enabling the #if 0 branch results in this code instead.

compute(unsigned long, unsigned long):
        mov     rcx, rdi
        xor     edx, edx
        mov     rax, rsi
        shrd    rax, rdx, cl
        shr     rdx, cl
        ret

That is, now gcc realizes the range of possible values for cl and does not emit
the test, the cmovs, and the redoing of the and with r8d.  One way or another,
the double precision shift is also unnecessary because only the lower 64 bits
of result may be non-zero.

Verified on Ubuntu 20.04 LTS, as well as Godbolt with gcc 9.3.0, gcc 11.1, and
gcc trunk.  This issue is similar to other 128 bit arithmetic reported bugs,
but unlike those others this one seems to be controlled exclusively by the
addition in the #if 1 branch.

For the sake of comparison, clang trunk emits the following code for the #if 1
and #if 0 branches, as per Godbolt.

compute(unsigned long, unsigned long):                           #
@compute(unsigned long, unsigned long)
        mov     rcx, rdi
        mov     eax, ecx
        shr     rsi, cl
        and     eax, 63
        xor     edx, edx
        add     rax, rsi
        setb    dl
        ret


compute(unsigned long, unsigned long):                           #
@compute(unsigned long, unsigned long)
        mov     rax, rsi
        mov     rcx, rdi
        shr     rax, cl
        xor     edx, edx
        ret

Reply via email to