https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114545
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> --- I am not sure this is worse. In the GCC 7 case we have: ``` sub eax, DWORD PTR a[rip] mov edx, eax ... neg edx ``` While in GCC 8+ we get: ``` movl %edx, %ecx subl %eax, %ecx subl %edx, %eax ``` In the case of GCC 8, we have 2 independent sub and still a move. In GCC 7 we get one sub followed by a move an dependent neg. The latency for the GCC 8+ will be less than what was done for GCC 7 because both sub can happen at the same time and the mov (which only happens on x86_64) is removed during rename. aarch64 produces for GCC 8+: ``` adrp x1, a adrp x2, c adrp x3, b ldr w0, [x1, #:lo12:a] ldr w2, [x2, #:lo12:c] sub w4, w2, w0 sub w0, w0, w2 str w4, [x1, #:lo12:a] str w0, [x3, #:lo12:b] ret ``` While before: ``` adrp x1, a adrp x0, c adrp x2, b ldr w3, [x1, #:lo12:a] ldr w0, [x0, #:lo12:c] sub w0, w0, w3 str w0, [x1, #:lo12:a] neg w0, w0 str w0, [x2, #:lo12:b] ret ``` So the neg will issue with the first str but if you have 2 store units and 2 ALUs, the GCC 8+ is better. So for superscalars, what GCC 8+ is doing is better and even in order cores, GCC 8+ will still be better due to the 2 independent instructions I think only at -Os/-Oz it might make a difference for x86_64 really.