https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108608
--- Comment #12 from Richard Sandiford <rsandifo at gcc dot gnu.org> ---
(In reply to fengfei.xi from comment #11)
> could you please explain under what specific circumstances this change might
> lead to slower performance?
> Also, is there a more complete fix or any plans for further optimization?
The log message was a bit cryptic, sorry. The problem isn't that the patch
makes things slower. Instead, it's the feature that the patch is fixing that
makes things slower.
If the vectoriser vectorises something like:
int64_t f(int32_t *x, int32_t *y) {
int64_t res = 0;
for (int i = 0; i < 100; ++i)
res += x[i] * y[i];
return res;
}
one option is to have one vector of 32-bit integers for each of x and y and two
vectors of 64-bit integers for res (so that the total number of elements in the
same). With this approach, the vectoriser can do two parallel additions on
each res vector.
In contrast, single def-use cycles replace the two res vectors with one res
vector but add to it twice. You can see the effect in
https://godbolt.org/z/o11zrMbWs . The main loop is:
ldr q30, [x1, x2]
ldr q29, [x0, x2]
add x2, x2, 16
mul v29.4s, v30.4s, v29.4s
saddw v31.2d, v31.2d, v29.2s
saddw2 v31.2d, v31.2d, v29.4s
cmp x2, 400
bne .L2
This adds to v31 twice, doubling the loop-carried latency. Ideally we would
do:
ldr q30, [x1, x2]
ldr q29, [x0, x2]
add x2, x2, 16
mul v29.4s, v30.4s, v29.4s
saddw v31.2d, v31.2d, v29.2s
saddw2 v28.2d, v28.2d, v29.4s
cmp x2, 400
bne .L2
add v31.2d, v31.2d, v28.2d
instead.
The vectoriser specifically chooses the first (serial) version over the second
(parallel) one. The commit message was complaining about that. But the patch
doesn't change that decision. It just makes both versions work.