https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100331
Bug ID: 100331 Summary: 128 bit arithmetic --- suboptimal after shifting when referencing other variables Product: gcc Version: 9.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: zero at smallinteger dot com Target Milestone: --- Created attachment 50706 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50706&action=edit Reproduction test case Compile the given C program with -O2. Enabling the #if 1 branch results in: compute(unsigned long, unsigned long): mov ecx, edi xor edx, edx mov rax, rsi xor esi, esi and ecx, 63 shrd rax, rdx, cl shr rdx, cl test cl, 64 mov r8d, ecx cmovne rax, rdx cmovne rdx, rsi and r8d, 63 mov rsi, rax mov rax, r8 mov rdi, rdx xor edx, edx add rax, rsi adc rdx, rdi ret Note test cl, 64 and the subsequent cmovs are unnecessary because the result of the test is already known after and ecx, 63. Note also mov r8d, ecx, followed by and r8d, 63, redoing work. Enabling the #if 0 branch results in this code instead. compute(unsigned long, unsigned long): mov rcx, rdi xor edx, edx mov rax, rsi shrd rax, rdx, cl shr rdx, cl ret That is, now gcc realizes the range of possible values for cl and does not emit the test, the cmovs, and the redoing of the and with r8d. One way or another, the double precision shift is also unnecessary because only the lower 64 bits of result may be non-zero. Verified on Ubuntu 20.04 LTS, as well as Godbolt with gcc 9.3.0, gcc 11.1, and gcc trunk. This issue is similar to other 128 bit arithmetic reported bugs, but unlike those others this one seems to be controlled exclusively by the addition in the #if 1 branch. For the sake of comparison, clang trunk emits the following code for the #if 1 and #if 0 branches, as per Godbolt. compute(unsigned long, unsigned long): # @compute(unsigned long, unsigned long) mov rcx, rdi mov eax, ecx shr rsi, cl and eax, 63 xor edx, edx add rax, rsi setb dl ret compute(unsigned long, unsigned long): # @compute(unsigned long, unsigned long) mov rax, rsi mov rcx, rdi shr rax, cl xor edx, edx ret