[Bug tree-optimization/49955] Fails to do partial basic-block SLP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49955 Richard Biener changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED Target Milestone|--- |14.0 --- Comment #7 from Richard Biener --- This is fixed now.
[Bug tree-optimization/49955] Fails to do partial basic-block SLP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49955 --- Comment #6 from CVS Commits --- The master branch has been updated by Richard Biener : https://gcc.gnu.org/g:d9f3ea61fe36e2de3354b90b65ff8245099114c9 commit r14-3078-gd9f3ea61fe36e2de3354b90b65ff8245099114c9 Author: Richard Biener Date: Mon Aug 7 14:44:20 2023 +0200 tree-optimization/49955 - BB reduction with odd number of lanes The following enhances BB reduction vectorization to support vectorizing only a subset of the lanes, keeping the rest as scalar ops. For now we try to make the number of lanes even by leaving alone the "last" lane. That's because SLP discovery with all lanes will fail too soon to get us any hint on which lane to strip and likewise we don't know what vector modes the target supports so restricting ourselves to power-of-two or other cases isn't easy. This is enough to get at the vectorization opportunity for the testcase in the PR - albeit with the chosen lanes not optimal but at least vectorizable. PR tree-optimization/49955 * tree-vectorizer.h (_slp_instance::remain_stmts): New. (SLP_INSTANCE_REMAIN_STMTS): Likewise. * tree-vect-slp.cc (vect_free_slp_instance): Release SLP_INSTANCE_REMAIN_STMTS. (vect_build_slp_instance): Make the number of lanes of a BB reduction even. (vectorize_slp_instance_root_stmt): Handle unvectorized defs of a BB reduction. * gfortran.dg/vect/pr49955.f: New testcase.
[Bug tree-optimization/49955] Fails to do partial basic-block SLP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49955 Richard Biener changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org --- Comment #5 from Richard Biener --- The loop in comment#1 isn't vectorized because we do not have interleaving support for a group size of 5: t.f:18:17: missed: the size of the group of accesses is not a power of 2 or not equal to 3 t.f:18:17: missed: not falling back to elementwise accesses t.f:19:72: missed: not vectorized: relevant stmt not supported: t1_83 = (*q_82(D))[_21]; t.f:18:17: missed: bad operation or unsupported loop bound. we don't try to SLP this because there's just a single lane reduction. There's not really a loop vectorization opportunity and as comment#3 says there's at most a BB reduction opportunity. We try to analyze that now: _58 = powmult_9 + powmult_107; t7_108 = _58 + powmult_88; t7_109 = __builtin_sqrt (t7_108); M.7_110 = MAX_EXPR ; and t.f:28:72: note: Starting SLP discovery for t.f:28:72: note: powmult_88 = _106 * _106; t.f:28:72: note: powmult_9 = _101 * _101; t.f:28:72: note: powmult_107 = _96 * _96; t.f:28:72: note: starting SLP discovery for node 0x50ef8a0 t.f:28:72: note: Build SLP for powmult_88 = _106 * _106; t.f:28:72: note: get vectype for scalar type (group size 3): real(kind=8) t.f:28:72: note: vectype: vector(2) real(kind=8) t.f:28:72: note: nunits = 2 t.f:28:72: missed: Build SLP failed: unrolling required in basic block SLP we do not yet have code to limit a BB reduction vectorization to a subset of lanes (in this case it's uniform so choosing any power-of-two elements would work but ideally we'd let SLP discovery figure out the "best" lane combination to vectorize - there's more missing support for BB reduction vectorization).
[Bug tree-optimization/49955] Fails to do partial basic-block SLP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49955 --- Comment #4 from Andrew Pinski --- The testcase in comment #0 started to be vectorized in GCC 13
[Bug tree-optimization/49955] Fails to do partial basic-block SLP
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49955 --- Comment #3 from Ira Rosen 2011-08-05 10:50:27 UTC --- (In reply to comment #1) > The loop that remains after fixing PR49957 in 410.bwaves is the following, > which loop SLP does not handle (well, I'm not exactly sure) because > > t.f:18: note: ==> examining statement: t1_62 = *q_61(D)[D.1645_60]; > > t.f:18: note: num. args = 4 (not unary/binary/ternary op). > t.f:18: note: vect_is_simple_use: operand *q_61(D)[D.1645_60] > t.f:18: note: not ssa-name. > t.f:18: note: use not simple. > t.f:18: note: no array mode for V2DF[5] > t.f:18: note: the size of the group of strided accesses is not a power of 2 > t.f:18: note: not vectorized: relevant stmt not supported: t1_62 = > *q_61(D)[D.1645_60]; > > t.f:18: note: bad operation or unsupported loop bound. > t.f:1: note: vectorized 0 loops in function. > > probably the issue that we can't handle this kind of "invariants" in the > SLP group? Thus, the SLP group should be q(2,..), q(3,...) ... q(5, ...) > which is size 4, q(1,..) should be treated as invariant. > This loop is not SLPed because there is no SLP opportunity here besides the loads. The only isomorphism after that is t2=q(2,i,j,k)/t1 t3=q(3,i,j,k)/t1 t4=q(4,i,j,k)/t1 and somewhat here t7=((dabs(t2)+t6)/dx+mu/dx**2)**2 + 1((dabs(t3)+t6)/dy+mu/dy**2)**2 + 2((dabs(t4)+t6)/dz+mu/dz**2)**2 but these are groups of 3. Moreover, the current implementation starts building SLP tree from a group of strided stores, or a group of reductions, or a reduction chain. None of these exist here. But, again, even if we could start from a group of loads, it wouldn't help us much here anyway. Ira
[Bug tree-optimization/49955] Fails to do partial basic-block SLP
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49955 Ira Rosen changed: What|Removed |Added CC||irar at il dot ibm.com --- Comment #2 from Ira Rosen 2011-08-05 10:38:53 UTC --- (In reply to comment #0) > but of course we could simply vectorize with an interleaving size of 4 > leaving the excess operations unvectorized (with optimization opportunity > if we can pick a properly sized and aligned set of accesses). Right. I even had a patch for this some time ago. I can try to bring it to life. Ira
[Bug tree-optimization/49955] Fails to do partial basic-block SLP
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49955 Richard Guenther changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2011.08.03 15:12:42 Ever Confirmed|0 |1 --- Comment #1 from Richard Guenther 2011-08-03 15:12:42 UTC --- The loop that remains after fixing PR49957 in 410.bwaves is the following, which loop SLP does not handle (well, I'm not exactly sure) because t.f:18: note: ==> examining statement: t1_62 = *q_61(D)[D.1645_60]; t.f:18: note: num. args = 4 (not unary/binary/ternary op). t.f:18: note: vect_is_simple_use: operand *q_61(D)[D.1645_60] t.f:18: note: not ssa-name. t.f:18: note: use not simple. t.f:18: note: no array mode for V2DF[5] t.f:18: note: the size of the group of strided accesses is not a power of 2 t.f:18: note: not vectorized: relevant stmt not supported: t1_62 = *q_61(D)[D.1645_60]; t.f:18: note: bad operation or unsupported loop bound. t.f:1: note: vectorized 0 loops in function. probably the issue that we can't handle this kind of "invariants" in the SLP group? Thus, the SLP group should be q(2,..), q(3,...) ... q(5, ...) which is size 4, q(1,..) should be treated as invariant. subroutine shell(nx,ny,nz,q,dt,cfl,dx,dy,dz) implicit none integer nx,ny,nz,n,i,j,k real*8 cfl,dx,dy,dz,dt real*8 gm,Re,Pr,cfll,t1,t2,t3,t4,t5,t6,t7,t8,mu real*8 q(5,nx,ny,nz) C This particular problem is periodic only cfll=0.1d0+(n-1.0d0)*cfl/20.0d0 if (cfll.ge.cfl) cfll=cfl t8=0.0d0 do k=1,nz do j=1,ny do i=1,nx t1=q(1,i,j,k) t2=q(2,i,j,k)/t1 t3=q(3,i,j,k)/t1 t4=q(4,i,j,k)/t1 t5=(gm-1.0d0)*(q(5,i,j,k)-0.5d0*t1*(t2*t2+t3*t3+t4*t4)) t6=dSQRT(gm*t5/t1) mu=gm*Pr*(gm*t5/t1)**0.75d0*2.0d0/Re/t1 t7=((dabs(t2)+t6)/dx+mu/dx**2)**2 + 1((dabs(t3)+t6)/dy+mu/dy**2)**2 + 2((dabs(t4)+t6)/dz+mu/dz**2)**2 t7=DSQRT(t7) t8=max(t8,t7) enddo enddo enddo dt=cfll / t8 return end