https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125461
Bug ID: 125461
Summary: znver5: 14.2 vs trunk differ on SRA store shape/LIM
and loop vector width (ZMM vs XMM)
Product: gcc
Version: 17.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: raghesh.aloor at amd dot com
CC: jamborm at gcc dot gnu.org, rguenth at gcc dot gnu.org,
venkataramanan.kumar at amd dot com
Target Milestone: ---
Created attachment 64554
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=64554&action=edit
The preprocessed input file to be used for the command mentioned in the
description
GCC 14.2 vs trunk: SRA/LIM and loop-vectorization differences (znver5)
======================================================================
We are posting this report to get an initial opinion and advice on how to
proceed further. We can share the full reduced microbenchmark source,
tree dumps if required.
Context
-------
We compared GCC 14.2 and GCC trunk on a short version of a hot loop from an
application. The loop uses a user defined vectorized array of type
SimdBlock8<double>. We are experimenting on an AMD znver5 (AVX-512) machine.
We observe, trunk is slower than 14.2 on a build: the generated code
uses 128-bit XMM instructions for much of the loop, while 14.2 uses 512-bit
ZMM instructions.
Reduced Testcase
----------------
Function: simd_coeff_apply_kernel_impl
inline void
simd_coeff_apply_kernel_impl (
const double * restrict coeffs,
const SimdBlock8<double> *src,
SimdBlock8<double> *dst)
{
using Vec = SimdBlock8<double>;
constexpr int inner_n = 2;
for (int block_i = 0; block_i < 6; ++block_i)
{
Vec v_p0 = src[0] + src[30];
Vec v_m0 = src[0] - src[30];
Vec v_p1 = src[6] + src[24];
Vec v_m1 = src[6] - src[24];
Vec v_p2 = src[12] + src[18];
Vec v_m2 = src[12] - src[18];
for (int k = 0; k < inner_n; ++k)
{
Vec acc0 = coeffs[k * 3] * v_p0;
Vec acc1 = coeffs[(4 - k) * 3] * v_m0;
acc0 += coeffs[k * 3 + 1] * v_p1;
acc1 += coeffs[(4 - k) * 3 + 1] * v_m1;
acc0 += coeffs[k * 3 + 2] * v_p2;
acc1 += coeffs[(4 - k) * 3 + 2] * v_m2;
dst[6 * k] = acc0 + acc1;
dst[6 * (4 - k)] = acc0 - acc1;
}
Vec acc_tail = coeffs[6] * v_p0;
acc_tail += coeffs[7] * v_p1;
acc_tail += coeffs[8] * v_p2;
dst[12] = acc_tail;
src += 1;
dst += 1;
}
src += 6 * 5;
dst += 6 * 4;
}
Compiler command used:
g++ -S -m64 -O3 -march=znver5 -std=c++17 -fdump-tree-sra -fdump-tree-lim2
-fdump-tree-vect simd_coeff_apply_microbench.i -fopt-info-vec
-fopt-info-vec-missed -o simd_coeff_apply_microbench-gcc.s
We see two differences between GCC 14.2 and trunk on this test program.
They may be unrelated, but both matter for the final code. We describe them
as Issue A and Issue B below.
Issue A — SRA and LIM (before the loop vectorizer runs)
-------------------------------------------------------
This looks like a different problem from Issue B below, but both show up in
the same test program.
After pass_sra, the store dst[12] = acc_tail appears in the compiler IR as
follows on 14.2 vs trunk:
GCC 14.2:
MEM[(struct SimdBlock8 *)dst + 768B].data[k]
GCC trunk (after PR118924):
MEM<double> [(struct SimdBlock8 *)dst + (768 + 8*k)B]
That difference affects pass_lim (lim2). The loads from coeffs[6], coeffs[7],
and coeffs[8] (byte offsets +48, +56, +64 from coeffs) do not change inside
the outer loop, but:
- On 14.2, lim2 says they do not alias the first dst store, moves them
before the loop. We can see the following in the dumps
Moving statement
_47 = MEM[(const double *)coeffs_71(D) + 48B];
(cost 20) out of loop 1.
Moving statement
_49 = MEM[(const double *)coeffs_71(D) + 56B];
(cost 20) out of loop 1.
Moving statement
_51 = MEM[(const double *)coeffs_71(D) + 64B];
(cost 20) out of loop 1.
- On trunk, lim2 says they depend on the first flattened dst store at
+768B and leaves them inside the loop.
Further analysis shows that trunk SRA sets grp_same_access_path = 0 on
acc_tail.data, while on 14.2 it stays 1.
Reverting these three commits on trunk brings back 14.2-like store shapes and
lim2 behaviour in our small tests:
40445711b8a sra: Clear grp_same_access_path ... (PR118924)
07d24367002 sra: Avoid creating TBAA hazards (PR118924)
0c286ea4006 sra: Dont use build_reconstructed_reference ... (PR122976)
Issue B — Loop vectorization (ZMM vs XMM)
-----------------------------------------
We are not sure whether Issue A and Issue B are related. What we do see is
that reverting only the three SRA commits above makes the IR before the loop
vectorizer match 14.2 again — but that did not bring back 14.2-style
vectorization (ZMM) on the inner k loop.
At -O3:
- GCC 14.2: the loop vectorizer uses wide AVX-512 (ZMM). fopt-info mentions
trying again with SLP turned off.
simd_coeff_apply_microbench.i:63:24: optimized: basic block part vectorized
using 64 byte vectors
simd_coeff_apply_microbench.i:63:24: optimized: basic block part vectorized
using 64 byte vectors
simd_coeff_apply_microbench.i:69:15: optimized: basic block part vectorized
using 64 byte vectors
simd_coeff_apply_microbench.i:64:30: optimized: basic block part vectorized
using 64 byte vectors
simd_coeff_apply_microbench.i:64:30: optimized: basic block part vectorized
using 64 byte vectors
- GCC trunk: the same loop is often vectorized with narrower XMM code and
different SLP / loop-vectorizer choices.
simd_coeff_apply_microbench.i:41:1: optimized: basic block part vectorized
using 16 byte vectors
simd_coeff_apply_microbench.i:41:1: optimized: basic block part vectorized
using 16 byte vectors
<...More dumps here...>
We found several 2025 vectorizer commits that might explain this, including
PR115895 (1b5d2ccd060) and commits that removed some non-SLP loop-vector
paths (cfeee375ecc, da012141c28, 1ae9e3c88ea). Reverting that group might
help, but we did not finish: the changes are large and conflicted in
tree-vect-loop.cc.
Both issues come from the same test program and the same application loop.
We think they may have different causes, but we are not sure. We are not
asking for a fix right now — we would like guidance on whether to focus on
SRA/TBAA, LIM, or the loop vectorizer.