https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118505
--- Comment #5 from Dhruv Chawla <dhruvc at nvidia dot com> ---
(In reply to Andrew Pinski from comment #3)
> Note there is also a fma forming missing:
> _69 = s_64 + 1.0e+0;
> ...
> _71 = _69 * _70;
>
> which is:
> `(s_64 + 1.0) * _70` which can be rewritten as `s_64 * _70 + _70`
>
> That might alone get the performance back up. I should note that LLVM also
> does the fcsel but with changing of the 2 instruction `(a+1) * b` into one
> fma instruction `a*b + b`.
I tried doing this, via:
fcmpe s2, #0.0
fmul s1, s30, s30
fcsel s31, s1, s31, gt
fmadd s0, s31, s0, s30
str s0, [x21, x0]
ldr s29, [x19, x0]
fmadd s29, s31, s29, s29
str s29, [x20, x0]
I don't really see a performance impact. Also, it seems that clang's codegen is
still a bit slower than the split paths.