https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82344
Bug ID: 82344 Summary: [8 Regression] SPEC CPU2006 435.gromacs ~10% performance regression with trunk@250855 Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alexander.nesterovskiy at intel dot com Target Milestone: --- Created attachment 42246 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42246&action=edit r250854 vs r250855 generated code comparison Compilation options that affects regression: "-Ofast -march=core-avx2 -mfpmath=sse" Regression happened after r250855 though it looks like this commit is not of guilty by itself but reveals something in other stages. Changes in 123t.reassoc1 stage leads to a bit different code generation during stages that follow it. Place of interest is in "inl1130" subroutine (file "innerf.f") - it's a part of a big loop with 9 similar expressions with 4-byte float variables: --------------- y1 = 1.0/sqrt(x1) y2 = 1.0/sqrt(x2) y3 = 1.0/sqrt(x3) y4 = 1.0/sqrt(x4) y5 = 1.0/sqrt(x5) y6 = 1.0/sqrt(x6) y7 = 1.0/sqrt(x7) y8 = 1.0/sqrt(x8) y9 = 1.0/sqrt(x9) --------------- When compiled with "-ffast-math" 1/sqrt is calculated with "vrsqrtss" instruction followed by Newton-Raphson step with four "vmulss", one "vadss" and two constants used. Like here (part of r250854 code): --------------- vrsqrtss xmm12, xmm12, xmm7 vmulss xmm7, xmm12, xmm7 vmulss xmm0, xmm12, DWORD PTR .LC2[rip] vmulss xmm8, xmm7, xmm12 vaddss xmm5, xmm8, DWORD PTR .LC1[rip] vmulss xmm1, xmm5, xmm0 --------------- Input values (x1-x9) are in xmm registers mostly (x2 and x7 in memory), output values (y1-y9) are in xmm registers. After r250855 .LC2 constant goes into xmm7 and x7 is also goes to xmm register. This leads to lack of temporary registers and worse instructions interleaving as a result. See attached picture with part of assembly listings where corresponding y=1/sqrt parts are painted the same color. Finally these 9 lines of code are executed about twice slower which leads to ~10% performance regression for whole test. I've made two independent attempts to change code in order to verify the above. 1. To be sure that we loose performance exactly in this part of a loop I just pasted ~60 assembly instructions from previous revision to a new one (after proper renaming of course). This helped to restore performance. 2. To be sure that the problem is due to a lack of temporary registers I moved calculation of 1/sqrt for one last line into function call. Like here: --------------- ... in other module to disable inlining: function myrsqrt(x) implicit none real*4 x real*4 myrsqrt myrsqrt = 1.0/sqrt(x); return end function myrsqrt ... y1 = 1.0/sqrt(x1) y2 = 1.0/sqrt(x2) y3 = 1.0/sqrt(x3) y4 = 1.0/sqrt(x4) y5 = 1.0/sqrt(x5) y6 = 1.0/sqrt(x6) y7 = 1.0/sqrt(x7) y8 = 1.0/sqrt(x8) y9 = myrsqrt(x9) --------------- Even with call/ret overhead this also helped to restore performance since it freed some registers.