https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119046
Bug ID: 119046
Summary: [15 Regression] Performance drop from not forming
lane-wise FMLAs with Eigen library
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: ktkachov at gcc dot gnu.org
Target Milestone: ---
Created attachment 60603
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60603&action=edit
Reproducer for aarch64
Unfortunately I couldn't reduce this to smaller example, but I'm attaching a
small benchmark that builds against the Eigen library to reproduce the issue.
You'll need the template-only Eigen library from
https://gitlab.com/libeigen/eigen checked out.
On aarch64 you can build the benchmark with:
g++ -I../eigen -O3 -mcpu=neoverse-v2 benchmark.cpp
Running the resulting binary should give a GFOPS number (higher is better)
Building the benchmark with GCC 15 gives about ~20% lower number than with GCC
14.
The codegen difference is down to GCC 14 producing this in the critical GEMM
loop:
ldp q29, q9, [x1]
ldp q11, q12, [x0]
ldr q13, [x0, 32]
fmla v3.4s, v13.4s, v29.s[0]
fmla v26.4s, v11.4s, v29.s[0]
fmla v27.4s, v11.4s, v29.s[1]
fmla v28.4s, v11.4s, v29.s[2]
fmla v14.4s, v11.4s, v29.s[3]
fmla v15.4s, v12.4s, v29.s[0]
fmla v0.4s, v12.4s, v29.s[1]
fmla v1.4s, v12.4s, v29.s[2]
fmla v2.4s, v12.4s, v29.s[3]
fmla v4.4s, v13.4s, v29.s[1]
fmla v5.4s, v13.4s, v29.s[2]
fmla v7.4s, v13.4s, v29.s[3]
mov v29.16b, v10.16b
fmla v16.4s, v11.4s, v9.s[0]
fmla v17.4s, v12.4s, v9.s[0]
fmla v19.4s, v11.4s, v9.s[1]
fmla v20.4s, v12.4s, v9.s[1]
fmla v22.4s, v11.4s, v9.s[2]
fmla v23.4s, v12.4s, v9.s[2]
fmla v25.4s, v11.4s, v9.s[3]
fmla v18.4s, v13.4s, v9.s[0]
fmla v21.4s, v13.4s, v9.s[1]
fmla v24.4s, v13.4s, v9.s[2]
fmla v29.4s, v12.4s, v9.s[3]
fmla v31.4s, v13.4s, v9.s[3]
whereas GCC 15 emits extra lane-dup instructions:
ldp q5, q6, [x1]
ldp q2, q3, [x0]
ldr q4, [x0, 32]
dup v1.4s, v5.s[1]
fmla v29.4s, v4.4s, v5.s[0]
fmla v30.4s, v2.4s, v5.s[0]
fmla v28.4s, v3.4s, v5.s[0]
fmla v7.4s, v2.4s, v1.4s
fmla v8.4s, v3.4s, v1.4s
fmla v9.4s, v4.4s, v1.4s
dup v1.4s, v5.s[2]
dup v5.4s, v5.s[3]
fmla v17.4s, v2.4s, v6.s[0]
fmla v14.4s, v2.4s, v5.4s
fmla v15.4s, v3.4s, v5.4s
fmla v16.4s, v4.4s, v5.4s
dup v5.4s, v6.s[1]
fmla v18.4s, v3.4s, v6.s[0]
fmla v19.4s, v4.4s, v6.s[0]
fmla v20.4s, v2.4s, v5.4s
fmla v21.4s, v3.4s, v5.4s
fmla v22.4s, v4.4s, v5.4s
dup v5.4s, v6.s[2]
dup v6.4s, v6.s[3]
fmla v10.4s, v2.4s, v1.4s
fmla v12.4s, v3.4s, v1.4s
fmla v13.4s, v4.4s, v1.4s
fmla v23.4s, v2.4s, v5.4s
fmla v24.4s, v3.4s, v5.4s
fmla v25.4s, v4.4s, v5.4s
fmla v26.4s, v2.4s, v6.4s
fmla v27.4s, v3.4s, v6.4s
fmla v31.4s, v4.4s, v6.4s
I've bisected this to the change g:9dbff9c05520a74e6cd337578f27b56c941f64f3 the
Revert "Revert "combine: Don't combine if I2 does not change""
The code inside Eigen in question that generates the FMLAs is in
Eigen/src/Core/arch/NEON/GeneralBlockPanelKernel.h and references PR89101 as a
previous incarnation of this bug that they had to workaround with inline
assembly.