[Bug tree-optimization/108608] [12 Regression] AArch64 vectorizer ICE in vect_transform_reduction and __builtin_fmax

fengfei.xi at horizon dot auto via Gcc-bugs Fri, 14 Feb 2025 01:18:32 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108608


--- Comment #13 from fengfei.xi at horizon dot auto ---
(In reply to Richard Sandiford from comment #12)
> (In reply to fengfei.xi from comment #11)
> > could you please explain under what specific circumstances this change might
> > lead to slower performance?
> > Also, is there a more complete fix or any plans for further optimization?
> The log message was a bit cryptic, sorry.  The problem isn't that the patch
> makes things slower.  Instead, it's the feature that the patch is fixing
> that makes things slower.
> 
> If the vectoriser vectorises something like:
> 
>   int64_t f(int32_t *x, int32_t *y) {
>     int64_t res = 0;
>     for (int i = 0; i < 100; ++i)
>       res += x[i] * y[i];
>     return res;
>   }
> 
> one option is to have one vector of 32-bit integers for each of x and y and
> two vectors of 64-bit integers for res (so that the total number of elements
> in the same).  With this approach, the vectoriser can do two parallel
> additions on each res vector.
> 
> In contrast, single def-use cycles replace the two res vectors with one res
> vector but add to it twice.  You can see the effect in
> https://godbolt.org/z/o11zrMbWs .   The main loop is:
> 
>         ldr     q30, [x1, x2]
>         ldr     q29, [x0, x2]
>         add     x2, x2, 16
>         mul     v29.4s, v30.4s, v29.4s
>         saddw   v31.2d, v31.2d, v29.2s
>         saddw2  v31.2d, v31.2d, v29.4s
>         cmp     x2, 400
>         bne     .L2
> 
> This adds to v31 twice, doubling the loop-carried latency.  Ideally we would
> do:
> 
>         ldr     q30, [x1, x2]
>         ldr     q29, [x0, x2]
>         add     x2, x2, 16
>         mul     v29.4s, v30.4s, v29.4s
>         saddw   v31.2d, v31.2d, v29.2s
>         saddw2  v28.2d, v28.2d, v29.4s
>         cmp     x2, 400
>         bne     .L2
>         add     v31.2d, v31.2d, v28.2d
> 
> instead.
> 
> The vectoriser specifically chooses the first (serial) version over the
> second (parallel) one.  The commit message was complaining about that.  But
> the patch doesn't change that decision.  It just makes both versions work.


OKOK. Thank you very much for your detailed explanation. I understand.

Best regards,
Fengfei.Xi

[Bug tree-optimization/108608] [12 Regression] AArch64 vectorizer ICE in vect_transform_reduction and __builtin_fmax

Reply via email to