[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

amker at gcc dot gnu.org Thu, 18 Jan 2018 02:14:28 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604


--- Comment #12 from amker at gcc dot gnu.org ---
(In reply to rguent...@suse.de from comment #11)
> On Thu, 18 Jan 2018, amker at gcc dot gnu.org wrote:
> 
> 
> I think the zeroing stmt can be distributed into a separate loop nest
> (up to whavever level we choose) and in the then non-parallelized nest
> the memset can stay at the current level.  So distribute
> 
> >          do j=1,ny
> >             jm1=mod(j+ny-2,ny)+1
> >             jp1=mod(j,ny)+1
> >             do i=1,nx
> >                im1=mod(i+nx-2,nx)+1
> >                ip1=mod(i,nx)+1
> >                do l=1,nb
> >                   y(l,i,j,k)=0.0d0
> >                   do m=1,nb
> >                      y(l,i,j,k)=y(l,i,j,k)+
> >                      ;; ....
> >                   enddo
> >                enddo
> >             enddo
> >          enddo
> 
> to
> 
> >          do j=1,ny
> >             jm1=mod(j+ny-2,ny)+1
> >             jp1=mod(j,ny)+1
> >             do i=1,nx
> >                im1=mod(i+nx-2,nx)+1
> >                ip1=mod(i,nx)+1
> >                do l=1,nb
> >                   y(l,i,j,k)=0.0d0
> >                enddo
> >             enddo
> >          enddo
> >          do j=1,ny
> >             jm1=mod(j+ny-2,ny)+1
> >             jp1=mod(j,ny)+1
> >             do i=1,nx
> >                im1=mod(i+nx-2,nx)+1
> >                ip1=mod(i,nx)+1
> >                do l=1,nb
> >                   do m=1,nb
> >                      y(l,i,j,k)=y(l,i,j,k)+
> >                      ;; ....
> >                   enddo
> >                enddo
> >             enddo
> >          enddo
> 
Yes, this can be done.  For now, it's disabled because without classifying
zeroing stmt as a builtin partition, it's fused because of shared memory
reference to y(l,i,j,k).  This step can be made by cost model changes.  The
on;y problem is the cost model change doesn't make sense here (without
considering builtin partition stuff, it should be fused, right?)

> And then do memset replacement in the first loop.
I guess this step is equally hard to what I mentioned?  We still need to prove
loops of zeroing statement doesn't leave bubble in memory.
> 
> I think the current cost modeling doesn't consider this because
> of the re-use of y.  IIRC this is what my original nest distribution
> patches did.
> 
> This might be doable by just cost model changes?

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

Reply via email to