14 Regression] Missed optimization for CSE

pinskia at gcc dot gnu.org via Gcc-bugs Mon, 01 Apr 2024 14:18:48 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114545


--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I am not sure this is worse.

In the GCC 7 case we have:
```
        sub     eax, DWORD PTR a[rip]
        mov     edx, eax
        ...
        neg     edx
```

While in GCC 8+ we get:
```
        movl    %edx, %ecx
        subl    %eax, %ecx
        subl    %edx, %eax
```

In the case of GCC 8, we have 2 independent sub and still a move. In GCC 7 we
get one sub followed by a move an dependent neg. The latency for the GCC 8+
will be less than what was done for GCC 7 because both sub can happen at the
same time and the mov (which only happens on x86_64) is removed during rename.



aarch64 produces for GCC 8+:
```
        adrp    x1, a
        adrp    x2, c
        adrp    x3, b
        ldr     w0, [x1, #:lo12:a]
        ldr     w2, [x2, #:lo12:c]
        sub     w4, w2, w0
        sub     w0, w0, w2
        str     w4, [x1, #:lo12:a]
        str     w0, [x3, #:lo12:b]
        ret
```

While before:
```
        adrp    x1, a
        adrp    x0, c
        adrp    x2, b
        ldr     w3, [x1, #:lo12:a]
        ldr     w0, [x0, #:lo12:c]
        sub     w0, w0, w3
        str     w0, [x1, #:lo12:a]
        neg     w0, w0
        str     w0, [x2, #:lo12:b]
        ret
```

So the neg will issue with the first str but if you have 2 store units and 2
ALUs, the GCC 8+ is better.
So for superscalars, what GCC 8+ is doing is better and even in order cores,
GCC 8+ will still be better due to the 2 independent instructions
I think only at -Os/-Oz it might make a difference for x86_64 really.

[Bug tree-optimization/114545] [11/12/13/14 Regression] Missed optimization for CSE

Reply via email to