[Bug tree-optimization/115438] [15 Regression] 503.bwaves_r regressed 5-11% on different x86_64 machines at -Ofast -march=native since r15-1006-gd93353e6423eca

rguenth at gcc dot gnu.org via Gcc-bugs Tue, 26 Nov 2024 05:16:59 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115438


--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #5)
> One difference wrt SLP vs. non-SLP is that with SLP we are taking the
> initial value as the initial value with SLP while with non-SLP we
> are using zero as initial reduction value and compensating at the epilouge:
> 
>   _1615 = {tmp_111, 0.0, 0.0, 0.0};
>   # _1619 = PHI <_1618(116), _1615(119)>
> ...
>   _1623 = .REDUC_PLUS (vect_tmp_1505.835_1621);
> 
> vs.
> 
>   # _1346 = PHI <_1345(98), { 0.0, 0.0, 0.0, 0.0 }(94)>
> ...
>   _1385 = .REDUC_PLUS (vect_tmp_1268.744_1383);
>   _1386 = tmp_710 + _1385;
> 
> so while the profile clearly shows a difference between GCC 14.2 and trunk
> I can't yet pinpoint to what makes the difference.
> 
> The same can be seen for the other similar loops in this function.

So I can confirm the above difference is one reason for the slowdown.
When imposing this onto the non-SLP code on the 14 branch I see
(base is original branch, peak is SLP behavior for non-SLP on branch):

Overhead       Samples  Command          Shared Object           Symbol         
  28.30%        252838  bwaves_r_base.g  bwaves_r_base.gcc7-m64  [.]
mat_times_vec_
  28.14%        252817  bwaves_r_peak.g  bwaves_r_peak.gcc7-m64  [.]
mat_times_vec_
   9.01%         81068  bwaves_r_peak.g  bwaves_r_peak.gcc7-m64  [.]
bi_cgstab_block_
   8.73%         77406  bwaves_r_base.g  bwaves_r_base.gcc7-m64  [.] shell_
   8.68%         77601  bwaves_r_peak.g  bwaves_r_peak.gcc7-m64  [.] shell_
   6.03%         53902  bwaves_r_base.g  bwaves_r_base.gcc7-m64  [.]
bi_cgstab_block_

I can't really explain why the different accumulator init handling makes
a difference.  This is STMT_VINFO_REDUC_EPILOGUE_ADJUSTMENT, never used
with SLP.  The optimization by itself shouldn't really be responsible
for this large of a difference.

When comparing that PEAK to trunk there's a remaining difference in
mat_times_vec_:

Overhead       Samples  Command          Shared Object           Symbol         
  30.06%        287313  bwaves_r_peak.g  bwaves_r_peak.gcc7-m64  [.]
mat_times_vec_
  26.58%        253234  bwaves_r_base.g  bwaves_r_base.gcc7-m64  [.]
mat_times_vec_
   8.46%         80752  bwaves_r_base.g  bwaves_r_base.gcc7-m64  [.]
bi_cgstab_block_
   8.41%         80455  bwaves_r_peak.g  bwaves_r_peak.gcc7-m64  [.]
bi_cgstab_block_
   8.15%         77204  bwaves_r_base.g  bwaves_r_base.gcc7-m64  [.] shell_
   7.93%         75602  bwaves_r_peak.g  bwaves_r_peak.gcc7-m64  [.] shell_
   3.31%         31853  bwaves_r_base.g  bwaves_r_base.gcc7-m64  [.] jacobian_
   3.28%         31783  bwaves_r_peak.g  bwaves_r_peak.gcc7-m64  [.] jacobian_

here I can't spot the difference either (the GIMPLE IL is identical).

[Bug tree-optimization/115438] [15 Regression] 503.bwaves_r regressed 5-11% on different x86_64 machines at -Ofast -march=native since r15-1006-gd93353e6423eca

Reply via email to