[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

rguenther at suse dot de Thu, 18 Jan 2018 02:03:53 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604


--- Comment #11 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 18 Jan 2018, amker at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
> 
> --- Comment #10 from amker at gcc dot gnu.org ---
> For the record, there is another possible fix.  Quoted loop nest from
> gcc/testsuite/gfortran.dg/pr81303.f:
> 
>          do j=1,ny
>             jm1=mod(j+ny-2,ny)+1
>             jp1=mod(j,ny)+1
>             do i=1,nx
>                im1=mod(i+nx-2,nx)+1
>                ip1=mod(i,nx)+1
>                do l=1,nb
>                   y(l,i,j,k)=0.0d0
>                   do m=1,nb
>                      y(l,i,j,k)=y(l,i,j,k)+
>                      ;; ....
>                   enddo
>                enddo
>             enddo
>          enddo
> 
> Originally GCC can parallelize loop nest at i loop, but now GCC only
> parallelize it at l loop because stmt "y(l,i,j,k)=0.0d0" is distributed into
> memset into i loop.  As a result the distributed memset call can't be analyzed
> by data reference analyzer.
> An idea is to distribute the stmt to outer loop j, so at least we can
> parallelize at loop level i as before.
> 
> Unfortunately this is not easy.  To distribute it into memset at loop level j,
> we have to prove that memory range set to ZERO at each loop level doesn't 
> leave
> any bubble in it.
> Given the array bound and loop niters are not constant, we need to prove
> non-trivially equality for difference expressions.  This needs to be done in
> function tree-loop-distribution.c:compute_access_range.  Specifically in this
> function we have:
> 
> <bb 2>:
>   _1 = *nb_113(D);
>   ubound.86_114 = (integer(kind=8)) _1;
>   stride.88_115 = MAX_EXPR <ubound.86_114, 0>;
> 
> ...
> 
> <bb 34>:              // thus in loop nest we have _1 > 0
>   if (_1 <= 0)
>     goto <bb 24>; [15.00%]
>   else
>     goto <bb 35>; [85.00%]
> 
> ...
> 
> And in the end, we need to prove:
> 
> ((sizetype) ((unsigned int) _1 + 4294967295) + 1) * 8
>  == (sizetype) stride.88_115 * 8
> 
> We first need to prove:
> ((sizetype) ((unsigned int) _1 + 4294967295) + 1)
>   == (sizetype) _1
> using pre-condition "_1 > 0"
> 
> Then need to prove: MAX_EXPR <ubound.86_114, 0> == ubound.86_114 also because
> of "_1 > 0".
> 
> I doubt this can be done (without heavy messy code) in GCC now.  Or there 
> might
> be another way out of this?

I think the zeroing stmt can be distributed into a separate loop nest
(up to whavever level we choose) and in the then non-parallelized nest
the memset can stay at the current level.  So distribute

>          do j=1,ny
>             jm1=mod(j+ny-2,ny)+1
>             jp1=mod(j,ny)+1
>             do i=1,nx
>                im1=mod(i+nx-2,nx)+1
>                ip1=mod(i,nx)+1
>                do l=1,nb
>                   y(l,i,j,k)=0.0d0
>                   do m=1,nb
>                      y(l,i,j,k)=y(l,i,j,k)+
>                      ;; ....
>                   enddo
>                enddo
>             enddo
>          enddo

to

>          do j=1,ny
>             jm1=mod(j+ny-2,ny)+1
>             jp1=mod(j,ny)+1
>             do i=1,nx
>                im1=mod(i+nx-2,nx)+1
>                ip1=mod(i,nx)+1
>                do l=1,nb
>                   y(l,i,j,k)=0.0d0
>                enddo
>             enddo
>          enddo
>          do j=1,ny
>             jm1=mod(j+ny-2,ny)+1
>             jp1=mod(j,ny)+1
>             do i=1,nx
>                im1=mod(i+nx-2,nx)+1
>                ip1=mod(i,nx)+1
>                do l=1,nb
>                   do m=1,nb
>                      y(l,i,j,k)=y(l,i,j,k)+
>                      ;; ....
>                   enddo
>                enddo
>             enddo
>          enddo

And then do memset replacement in the first loop.

I think the current cost modeling doesn't consider this because
of the re-use of y.  IIRC this is what my original nest distribution
patches did.

This might be doable by just cost model changes?

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

Reply via email to