https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324
Bug ID: 114324 Summary: AVX2 vectorisation performance regression with gfortran 13/14 Product: gcc Version: 13.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: fortran Assignee: unassigned at gcc dot gnu.org Reporter: mjr19 at cam dot ac.uk Target Milestone: --- Created attachment 57685 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57685&action=edit Test case of loop showing performance regression The attached loop, when compiled with "-Ofast -mavx2" runs over 20% slower on gfortran 13 or (pre-release) 14 than it does on 12.x. Precise versions tested 12.3.0, 13.1.0 and GCC 14 downloaded on 11th March. Precise slowdown depends on CPU. Tested on Haswell and Kaby Lake desktops. Adding "-fopenmp" changes the code produced, but 12.3 still beats later compilers. The analysis below is without -fopenmp. It appears (to me) that 12.x is using the full width of the ymm registers, and has a loop of 17 vector instructions, and some scalar loop control, which performs two iterations of the original Fortran loop. 13.x manages more aggressive unrolling, performing four iterations per pass, but uses about 54 vector instructions, rather than the 34 one might naively expect. More instructions does not necessarily mean slower, but here it does. I attach the test case to which I refer. I would be happy to add the trivial timing program to show how I have been timing it. The full code is an FFT, but the test case has been reduced to functional nonsense. (I note that in other areas there are pleasing performance gains in gfortran 13.x. It is a pity that this partially cancels them.)