https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115438
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #5)
> One difference wrt SLP vs. non-SLP is that with SLP we are taking the
> initial value as the initial value with SLP while with non-SLP we
> are using zero as initial reduction value and compensating at the epilouge:
>
> _1615 = {tmp_111, 0.0, 0.0, 0.0};
> # _1619 = PHI <_1618(116), _1615(119)>
> ...
> _1623 = .REDUC_PLUS (vect_tmp_1505.835_1621);
>
> vs.
>
> # _1346 = PHI <_1345(98), { 0.0, 0.0, 0.0, 0.0 }(94)>
> ...
> _1385 = .REDUC_PLUS (vect_tmp_1268.744_1383);
> _1386 = tmp_710 + _1385;
>
> so while the profile clearly shows a difference between GCC 14.2 and trunk
> I can't yet pinpoint to what makes the difference.
>
> The same can be seen for the other similar loops in this function.
So I can confirm the above difference is one reason for the slowdown.
When imposing this onto the non-SLP code on the 14 branch I see
(base is original branch, peak is SLP behavior for non-SLP on branch):
Overhead Samples Command Shared Object Symbol
28.30% 252838 bwaves_r_base.g bwaves_r_base.gcc7-m64 [.]
mat_times_vec_
28.14% 252817 bwaves_r_peak.g bwaves_r_peak.gcc7-m64 [.]
mat_times_vec_
9.01% 81068 bwaves_r_peak.g bwaves_r_peak.gcc7-m64 [.]
bi_cgstab_block_
8.73% 77406 bwaves_r_base.g bwaves_r_base.gcc7-m64 [.] shell_
8.68% 77601 bwaves_r_peak.g bwaves_r_peak.gcc7-m64 [.] shell_
6.03% 53902 bwaves_r_base.g bwaves_r_base.gcc7-m64 [.]
bi_cgstab_block_
I can't really explain why the different accumulator init handling makes
a difference. This is STMT_VINFO_REDUC_EPILOGUE_ADJUSTMENT, never used
with SLP. The optimization by itself shouldn't really be responsible
for this large of a difference.
When comparing that PEAK to trunk there's a remaining difference in
mat_times_vec_:
Overhead Samples Command Shared Object Symbol
30.06% 287313 bwaves_r_peak.g bwaves_r_peak.gcc7-m64 [.]
mat_times_vec_
26.58% 253234 bwaves_r_base.g bwaves_r_base.gcc7-m64 [.]
mat_times_vec_
8.46% 80752 bwaves_r_base.g bwaves_r_base.gcc7-m64 [.]
bi_cgstab_block_
8.41% 80455 bwaves_r_peak.g bwaves_r_peak.gcc7-m64 [.]
bi_cgstab_block_
8.15% 77204 bwaves_r_base.g bwaves_r_base.gcc7-m64 [.] shell_
7.93% 75602 bwaves_r_peak.g bwaves_r_peak.gcc7-m64 [.] shell_
3.31% 31853 bwaves_r_base.g bwaves_r_base.gcc7-m64 [.] jacobian_
3.28% 31783 bwaves_r_peak.g bwaves_r_peak.gcc7-m64 [.] jacobian_
here I can't spot the difference either (the GIMPLE IL is identical).