https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82344

            Bug ID: 82344
           Summary: [8 Regression] SPEC CPU2006 435.gromacs ~10%
                    performance regression with trunk@250855
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: alexander.nesterovskiy at intel dot com
  Target Milestone: ---

Created attachment 42246
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42246&action=edit
r250854 vs r250855 generated code comparison

Compilation options that affects regression: "-Ofast -march=core-avx2
-mfpmath=sse"

Regression happened after r250855 though it looks like this commit is not of
guilty by itself but reveals something in other stages.

Changes in 123t.reassoc1 stage leads to a bit different code generation during
stages that follow it.

Place of interest is in "inl1130" subroutine (file "innerf.f") - it's a part of
a big loop with 9 similar expressions with 4-byte float variables:
---------------
y1 = 1.0/sqrt(x1)
y2 = 1.0/sqrt(x2)
y3 = 1.0/sqrt(x3)
y4 = 1.0/sqrt(x4)
y5 = 1.0/sqrt(x5)
y6 = 1.0/sqrt(x6)
y7 = 1.0/sqrt(x7)
y8 = 1.0/sqrt(x8)
y9 = 1.0/sqrt(x9)
---------------

When compiled with "-ffast-math" 1/sqrt is calculated with "vrsqrtss"
instruction followed by Newton-Raphson step with four "vmulss", one "vadss" and
two constants used.
Like here (part of r250854 code):
---------------
vrsqrtss xmm12, xmm12, xmm7
vmulss   xmm7,  xmm12, xmm7
vmulss   xmm0,  xmm12, DWORD PTR .LC2[rip]
vmulss   xmm8,  xmm7,  xmm12
vaddss   xmm5,  xmm8,  DWORD PTR .LC1[rip]
vmulss   xmm1,  xmm5,  xmm0
---------------
Input values (x1-x9) are in xmm registers mostly (x2 and x7 in memory), output
values (y1-y9) are in xmm registers.

After r250855 .LC2 constant goes into xmm7 and x7 is also goes to xmm register.
This leads to lack of temporary registers and worse instructions interleaving
as a result.
See attached picture with part of assembly listings where corresponding
y=1/sqrt parts are painted the same color.

Finally these 9 lines of code are executed about twice slower which leads to
~10% performance regression for whole test.

I've made two independent attempts to change code in order to verify the above.

1. To be sure that we loose performance exactly in this part of a loop I just
pasted ~60 assembly instructions from previous revision to a new one (after
proper renaming of course). This helped to restore performance.

2. To be sure that the problem is due to a lack of temporary registers I moved
calculation of 1/sqrt for one last line into function call. Like here:
---------------
... in other module to disable inlining:
function myrsqrt(x)
  implicit none
  real*4 x
  real*4 myrsqrt
  myrsqrt = 1.0/sqrt(x);
  return
end function myrsqrt

...

y1 = 1.0/sqrt(x1)
y2 = 1.0/sqrt(x2)
y3 = 1.0/sqrt(x3)
y4 = 1.0/sqrt(x4)
y5 = 1.0/sqrt(x5)
y6 = 1.0/sqrt(x6)
y7 = 1.0/sqrt(x7)
y8 = 1.0/sqrt(x8)
y9 = myrsqrt(x9)
---------------
Even with call/ret overhead this also helped to restore performance since it
freed some registers.

Reply via email to