https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96305

            Bug ID: 96305
           Summary: Unnecessary signed x unsigned multiplication with
                    squares or signed variables
           Product: gcc
           Version: 7.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: petr at nejedli dot cz
  Target Milestone: ---

In presence of a signed variable multiplied by itself, the compiler seems to
recognize that the result will necessarily be positive, then considers the
result as unsigned going forward, causing unnecessarily complicated code down
the line.
I have initially reproduced the issue on 7.2.1 for arm, but I have verified the
same issue happens in the latest supported by the gotbolt compiler.

---
 [nenik@Pix2 ~]$ arm-none-eabi-gcc --version
arm-none-eabi-gcc (GNU Tools for Arm Embedded Processors 7-2017-q4-major) 7.2.1
20170904 (release) [ARM/embedded-7-branch revision 255204]
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[nenik@Pix2 ~]$ cat mull-issue.c
inline int hmull(int a, int b) {
    return ((long long)a * b) >> 32;
}

int compute(int a, int b) {
    int t = hmull(a,a);
    return hmull(t, b);
}

[nenik@Pix2 ~]$ arm-none-eabi-gcc -Os -S -mcpu=cortex-m3 mull-issue.c 

[nenik@Pix2 ~]$ cat mull-issue.s 
        .cpu cortex-m3
        .eabi_attribute 20, 1
        .eabi_attribute 21, 1
        .eabi_attribute 23, 3
        .eabi_attribute 24, 1
        .eabi_attribute 25, 1
        .eabi_attribute 26, 1
        .eabi_attribute 30, 4
        .eabi_attribute 34, 1
        .eabi_attribute 18, 4
        .file   "mull-issue.c"
        .text
        .align  1
        .global compute
        .syntax unified
        .thumb
        .thumb_func
        .fpu softvfp
        .type   compute, %function
compute:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        smull   r2, r3, r0, r0
        push    {r4, r6, r7, lr}
        asrs    r7, r1, #31
        mul     r0, r3, r7
        asrs    r4, r3, #31
        mla     r0, r1, r4, r0
        umull   r2, r3, r3, r1
        add     r0, r0, r3
        pop     {r4, r6, r7, pc}
        .size   compute, .-compute
        .ident  "GCC: (GNU Tools for Arm Embedded Processors 7-2017-q4-major)
7.2.1 20170904 (release) [ARM/embedded-7-branch revision 255204]"
---
https://godbolt.org/z/v186Yz


Expected code should be pretty much:
        smull   r2, r3, r0, r0
        smull   r2, r0, r3, r1
        bx      lr
under the simple reasoning, that r3, after the first smull, would be, at most,
0x40000000 for any argument and thus while certainly positive, never having the
highest bit set. r4 after second asrs will always be zero and so would be the
multiplicative part of the following mla, removing the need to go with umull
and fixing the result.

I have got clang to generate optimal code in a more complicated piece of SW.
I can also get gcc to generate two smulls (and smaller code overall) if I add
an unknown extra argument (or even a small constant) to the "t" variable before
the  second hmull call, but if I try with a constant of zero and the compiler
manages  to learn that, it gets back to suboptimal code.

Reply via email to