https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100076
Bug ID: 100076 Summary: eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on SKX/CLX Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: crazylht at gmail dot com CC: hjl.tools at gmail dot com Target Milestone: --- Refer to https://godbolt.org/z/e3nfz3xvW cat testcase.c int t_run_test(double a) { static double P1, Q1; static varsize polyX1[9]; polyX1[1] = a; P1 = (varsize)constantP[0]; polyX1[1] = a; // Loop 1 for( int i1 = 2 ; i1 <= 8 ; i1++ ) { polyX1[i1] = polyX1[i1 - 1] * polyX1[1] ; } P1 = (varsize)constantP[0] ; // Loop 2 for( int i1 = 1 ; i1 <= 8 ; i1++ ) { P1 += (varsize)constantP[i1] * polyX1[i1] ; } Q1 = (varsize)constantQ[0] ; // Loop 3 for( int i1 = 1 ; i1 <= 8 ; i1++ ) { Q1 += (varsize)constantQ[i1] * polyX1[i1] ; } return a = a * P1 / Q1 ; } Loop 1 write array polyX1 which is used by Loop2 and Loop 3, with -ftree-vectorize -O2, Loop2 and Loop 3 are vectorized, but Loop 1 is not since it have inter-iterative dependence, then for array polyX1, there're 64-bit stores in loop 1 and 128-bit load in Loop2 and Loop 3, and it causes store forwarding stalls which hurt performance.