https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108608
--- Comment #13 from fengfei.xi at horizon dot auto ---
(In reply to Richard Sandiford from comment #12)
> (In reply to fengfei.xi from comment #11)
> > could you please explain under what specific circumstances this change might
> > lead to slower performance?
> > Also, is there a more complete fix or any plans for further optimization?
> The log message was a bit cryptic, sorry. The problem isn't that the patch
> makes things slower. Instead, it's the feature that the patch is fixing
> that makes things slower.
>
> If the vectoriser vectorises something like:
>
> int64_t f(int32_t *x, int32_t *y) {
> int64_t res = 0;
> for (int i = 0; i < 100; ++i)
> res += x[i] * y[i];
> return res;
> }
>
> one option is to have one vector of 32-bit integers for each of x and y and
> two vectors of 64-bit integers for res (so that the total number of elements
> in the same). With this approach, the vectoriser can do two parallel
> additions on each res vector.
>
> In contrast, single def-use cycles replace the two res vectors with one res
> vector but add to it twice. You can see the effect in
> https://godbolt.org/z/o11zrMbWs . The main loop is:
>
> ldr q30, [x1, x2]
> ldr q29, [x0, x2]
> add x2, x2, 16
> mul v29.4s, v30.4s, v29.4s
> saddw v31.2d, v31.2d, v29.2s
> saddw2 v31.2d, v31.2d, v29.4s
> cmp x2, 400
> bne .L2
>
> This adds to v31 twice, doubling the loop-carried latency. Ideally we would
> do:
>
> ldr q30, [x1, x2]
> ldr q29, [x0, x2]
> add x2, x2, 16
> mul v29.4s, v30.4s, v29.4s
> saddw v31.2d, v31.2d, v29.2s
> saddw2 v28.2d, v28.2d, v29.4s
> cmp x2, 400
> bne .L2
> add v31.2d, v31.2d, v28.2d
>
> instead.
>
> The vectoriser specifically chooses the first (serial) version over the
> second (parallel) one. The commit message was complaining about that. But
> the patch doesn't change that decision. It just makes both versions work.
OKOK. Thank you very much for your detailed explanation. I understand.
Best regards,
Fengfei.Xi