https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119960
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed|2025-04-27 00:00:00 |2025-04-28
Status|UNCONFIRMED |ASSIGNED
Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot
gnu.org
Ever confirmed|0 |1
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
It seems that -O2 performance is now faster but -O3 regressed and specifically
-O3 is slower than -O2.
With GCC 14 we vectorize the stores in (inlined)
static void pushEdgeFifo(EdgeFifo fifo, unsigned int a, unsigned int b, size_t&
offset)
{
fifo[offset][0] = a;
fifo[offset][1] = b;
offset = (offset + 1) & 15;
}
while with GCC 15 we only vectorize (as with GCC 14) lower part of the
grouped store to (inlined) 'destination'.
static void writeTriangle(void* destination, size_t offset, size_t index_size,
unsigned int a, unsigned int b, unsigned int c)
{
if (index_size == 2)
...
else
{
static_cast<unsigned int*>(destination)[offset + 0] = a;
static_cast<unsigned int*>(destination)[offset + 1] = b;
static_cast<unsigned int*>(destination)[offset + 2] = c;
}
}
and the reason is we reject this with the default cost model (as we don't
emit vector CTORs from PHI args - the incoming 'a' and 'b' are quite
elaborately computed:
t2.c:1641:18: note: Costing subgraph:
t2.c:1641:18: note: node 0x1382e240 (max_nunits=2, refcnt=1) vector(2) unsigned
int
t2.c:1641:18: note: op template: (*_202)[0] = a_618;
t2.c:1641:18: note: stmt 0 (*_202)[0] = a_618;
t2.c:1641:18: note: stmt 1 (*_202)[1] = c_76;
t2.c:1641:18: note: children 0x1382e900
t2.c:1641:18: note: node (external) 0x1382e900 (max_nunits=1, refcnt=1)
vector(2) unsigned int
t2.c:1641:18: note: { a_618, c_76 }
t2.c:1641:18: note: Cost model analysis:
a_618 1 times scalar_store costs 12 in body
c_76 1 times scalar_store costs 12 in body
a_618 1 times vector_store costs 12 in body
node 0x1382e900 1 times vec_construct costs 16 in prologue
t2.c:1641:18: note: Cost model analysis for part in loop 1:
Vector cost: 28
Scalar cost: 24
t2.c:1641:18: missed: not vectorized: vectorization is not profitable.
the reason is the vector construction requires a GPR<->XMM move. If you
use any non-generic tuning like -mtune=intel or -mtune=znver4 you get
the stores vectorized again.
Note the regression is in some of the cases where GCC 14 has
t2.c:1641:18: note: Costing subgraph:
t2.c:1641:18: note: node 0x350534d8 (max_nunits=2, refcnt=1) vector(2) unsigned
int
t2.c:1641:18: note: op template: (*_147)[0] = c_64;
t2.c:1641:18: note: stmt 0 (*_147)[0] = c_64;
t2.c:1641:18: note: stmt 1 (*_147)[1] = b_114;
t2.c:1641:18: note: children 0x350535e8
t2.c:1641:18: note: node (external) 0x350535e8 (max_nunits=1, refcnt=1)
vector(2) unsigned int
t2.c:1641:18: note: { c_64, b_114 }
t2.c:1641:18: note: Cost model analysis:
c_64 1 times scalar_store costs 12 in body
b_114 1 times scalar_store costs 12 in body
c_64 1 times vector_store costs 12 in body
node 0x350535e8 1 times vec_construct costs 10 in prologue
for some. Those do not happen with GCC 15 because of the change as
the load that would result in a reduction in cost is in a different
basic-block where the fix is required for correctness. One example is:
[t2.c:2032:17] b_845 = [t2.c:2032:63] [t2.c:2032:60] edgefifo[_847][1];
_356 = codetri_842 & 15;
[t2.c:2035:8] fec_357 = (int) _356;
[t2.c:2039:4] if (fecmax_62 > fec_357)
goto <bb 211>; [50.00%]
else
goto <bb 198>; [50.00%]
<bb 195> [local count: 316429835]:
# next_410 = PHI <next_863(199), [t2.c:2046:10] next_949(213)>
# last_412 = PHI <[t2.c:2055:10 discrim 5] c_892(199), last_862(213)>
# c_360 = PHI <c_892(199), c_946(213)>
# vertexfifooffset_393 = PHI <[t2.c:1662:9] _898(199), [t2.c:1662:9]
_955(213)>
# data_396 = PHI <data_893(199), data_856(213)>
[t2.c:1641:13] _402 = edgefifooffset_858 * 8;
[t2.c:1641:13] _406 = [t2.c:2062:17] &edgefifo + _402;
[t2.c:1641:18] [t2.c:1641:16] (*_406)[0] = c_360;
[t2.c:1642:18] [t2.c:1642:16] (*_406)[1] = b_845;
where we place the vector initializer is put into the BB of the store
and the costing assumes we'd manage to put the load into a XMM reg
directly.
So confirmed. I'll think about whether we can do something here.