[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

amker at gcc dot gnu.org Thu, 18 Jan 2018 01:52:12 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604


--- Comment #10 from amker at gcc dot gnu.org ---
For the record, there is another possible fix.  Quoted loop nest from
gcc/testsuite/gfortran.dg/pr81303.f:

         do j=1,ny
            jm1=mod(j+ny-2,ny)+1
            jp1=mod(j,ny)+1
            do i=1,nx
               im1=mod(i+nx-2,nx)+1
               ip1=mod(i,nx)+1
               do l=1,nb
                  y(l,i,j,k)=0.0d0
                  do m=1,nb
                     y(l,i,j,k)=y(l,i,j,k)+
                     ;; ....
                  enddo
               enddo
            enddo
         enddo

Originally GCC can parallelize loop nest at i loop, but now GCC only
parallelize it at l loop because stmt "y(l,i,j,k)=0.0d0" is distributed into
memset into i loop.  As a result the distributed memset call can't be analyzed
by data reference analyzer.
An idea is to distribute the stmt to outer loop j, so at least we can
parallelize at loop level i as before.

Unfortunately this is not easy.  To distribute it into memset at loop level j,
we have to prove that memory range set to ZERO at each loop level doesn't leave
any bubble in it.
Given the array bound and loop niters are not constant, we need to prove
non-trivially equality for difference expressions.  This needs to be done in
function tree-loop-distribution.c:compute_access_range.  Specifically in this
function we have:

<bb 2>:
  _1 = *nb_113(D);
  ubound.86_114 = (integer(kind=8)) _1;
  stride.88_115 = MAX_EXPR <ubound.86_114, 0>;

...

<bb 34>:              // thus in loop nest we have _1 > 0
  if (_1 <= 0)
    goto <bb 24>; [15.00%]
  else
    goto <bb 35>; [85.00%]

...

And in the end, we need to prove:

((sizetype) ((unsigned int) _1 + 4294967295) + 1) * 8
 == (sizetype) stride.88_115 * 8

We first need to prove:
((sizetype) ((unsigned int) _1 + 4294967295) + 1)
  == (sizetype) _1
using pre-condition "_1 > 0"

Then need to prove: MAX_EXPR <ubound.86_114, 0> == ubound.86_114 also because
of "_1 > 0".

I doubt this can be done (without heavy messy code) in GCC now.  Or there might
be another way out of this?

Thanks,

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

Reply via email to