[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

rguenther at suse dot de Thu, 18 Jan 2018 02:22:35 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604


--- Comment #13 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 18 Jan 2018, amker at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
> 
> --- Comment #12 from amker at gcc dot gnu.org ---
> (In reply to rguent...@suse.de from comment #11)
> > On Thu, 18 Jan 2018, amker at gcc dot gnu.org wrote:
> > 
> > 
> > I think the zeroing stmt can be distributed into a separate loop nest
> > (up to whavever level we choose) and in the then non-parallelized nest
> > the memset can stay at the current level.  So distribute
> > 
> > >          do j=1,ny
> > >             jm1=mod(j+ny-2,ny)+1
> > >             jp1=mod(j,ny)+1
> > >             do i=1,nx
> > >                im1=mod(i+nx-2,nx)+1
> > >                ip1=mod(i,nx)+1
> > >                do l=1,nb
> > >                   y(l,i,j,k)=0.0d0
> > >                   do m=1,nb
> > >                      y(l,i,j,k)=y(l,i,j,k)+
> > >                      ;; ....
> > >                   enddo
> > >                enddo
> > >             enddo
> > >          enddo
> > 
> > to
> > 
> > >          do j=1,ny
> > >             jm1=mod(j+ny-2,ny)+1
> > >             jp1=mod(j,ny)+1
> > >             do i=1,nx
> > >                im1=mod(i+nx-2,nx)+1
> > >                ip1=mod(i,nx)+1
> > >                do l=1,nb
> > >                   y(l,i,j,k)=0.0d0
> > >                enddo
> > >             enddo
> > >          enddo
> > >          do j=1,ny
> > >             jm1=mod(j+ny-2,ny)+1
> > >             jp1=mod(j,ny)+1
> > >             do i=1,nx
> > >                im1=mod(i+nx-2,nx)+1
> > >                ip1=mod(i,nx)+1
> > >                do l=1,nb
> > >                   do m=1,nb
> > >                      y(l,i,j,k)=y(l,i,j,k)+
> > >                      ;; ....
> > >                   enddo
> > >                enddo
> > >             enddo
> > >          enddo
> > 
> Yes, this can be done.  For now, it's disabled because without classifying
> zeroing stmt as a builtin partition, it's fused because of shared memory
> reference to y(l,i,j,k).  This step can be made by cost model changes.  The
> on;y problem is the cost model change doesn't make sense here (without
> considering builtin partition stuff, it should be fused, right?)

It might be profitable to distribute away stores that have no dependent
stmts (thus stores from invariants).

Another heuristic would be to never merge builtin partitions with
other partitions because loop optimizations do not handle memory
builtins (the data dependence limitation).  Which might also be a reason
not to handle those as builtins but revert to a non-builtin
classification.

I suppose implementing both and then looking at what distributions
change due to them on say SPEC CPU 2006, classifying each change
as either good or bad is the only way we'd know whether such
cost model change is wanted.

> > And then do memset replacement in the first loop.
> I guess this step is equally hard to what I mentioned?  We still need to prove
> loops of zeroing statement doesn't leave bubble in memory.

No, you'd simply have the i and j loops containing a memset...

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

Reply via email to