https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463
--- Comment #32 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #31)
> (In reply to Tamar Christina from comment #29)
> > (In reply to Tamar Christina from comment #27)
> > > > >
> > > > > We DO already impose any order on them, but the other operand is
> > > > > oddodd, so
> > > > > the overall order ends up being oddodd because any known permute
> > > > > overrides
> > > > > unknown ones.
> > > >
> > > > So what's the desired outcome? I guess PERM_UNKNOWN? I guess it's
> > > > the "other operand" of an add? What's the (bad) effect of classifying
> > > > it as ODDODD (optimistically)?
> > > >
> > > > > So the question is, can we not follow externals in a constructor to
> > > > > figure
> > > > > out if how they are used they all read from the same base and in
> > > > > which order?
> > > >
> > > > I don't see how it makes sense to do this. For the above example,
> > > > what's
> > > > the testcase exhibiting this (and on which arch)?
> > >
> > > I've been working on a fix from a different angle for this, which also
> > > covers another GCC 14 regression that went unnoticed. I'll post after
> > > regressions finish.
> >
> > So I've formalized the handling of TOP a bit better. Which gets it to
> > recognize it again, however, it will be dropped as it's not profitable.
> >
> > The reason it's not profitable is the canonicalization issue mentioned
> > above. This has split the imaginary and real nodes into different
> > computations.
> >
> > So no matter what you do in the SLP tree, the attached digraph won't make
> > the loads of _5 linear. Are you ok with me trying that Richi?
>
> I can't make sense of that graph - the node feeding the store seems to have
> wrong scalar stmts?
>
> What's the testcase for this (and on what arch?).
>
void fms_elemconjsnd(_Complex TYPE a[restrict N], _Complex TYPE b,
_Complex TYPE c[restrict N]) {
for (int i = 0; i < N; i++)
c[i] -= a[i] * ~b;
}
compiled with -Ofast -march=armv8.3-a
#define TYPE double
#define I 1.0i
#define N 200
void fms180snd (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
_Complex TYPE c[restrict N])
{
for (int i=0; i < N; i++)
c[i] -= a[i] * (b[i] * I * I);
}
void fms180snd_1 (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
_Complex TYPE c[restrict N])
{
_Complex TYPE t = I;
for (int i=0; i < N; i++)
c[i] -= a[i] * (b[i] * t * t);
}
is another one, where they are the same things, but 1st one is matched and
second one doesn't.
> But yes, the loads of *5 won't get linear here, but at least the
> permute node feeding the complex-add-rot270 can be elided, eventually
> even making the external _53, b$real_11 match the other with different
> order (though we don't model that, cost-wise).
But without the loads getting linearize the match will never work as multi-lane
SLP will be immediately cancelled because it assumed load-lanes is cheaper
(it's not, but load lanes doesn't get costed) and that's why there's a load
permute optimization step after complex pattern matching.
The point is however, that no permute is needed. *not even for the loads*.
GCC 13 generated:
fms_elemconjsnd:
fneg d1, d1
mov x2, 0
dup v4.2d, v0.d[0]
dup v3.2d, v1.d[0]
.L2:
ldr q1, [x0, x2]
ldr q0, [x1, x2]
fmul v2.2d, v3.2d, v1.2d
fcadd v0.2d, v0.2d, v2.2d, #270
fmls v0.2d, v4.2d, v1.2d
str q0, [x1, x2]
add x2, x2, 16
cmp x2, 3200
bne .L2
ret
which was the optimal sequence.