[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization

2021-05-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED

[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization

2016-03-19 Thread ramana at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710

Ramana Radhakrishnan  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2016-03-16
 CC||ramana at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #16 from Ramana Radhakrishnan  ---
Confirmed then.

[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization

2016-03-07 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710

--- Comment #15 from Doug Gilmore  ---
> I had a patch too, will send it for review in GCC7 if it's still needed.
Sorry I got side track last week and didn't make much progress.

Please go ahead and submit if you have something you feel comfortable with,
I'll assist in testing.

Thanks,

[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization

2016-03-07 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710

--- Comment #14 from amker at gcc dot gnu.org ---
(In reply to Doug Gilmore from comment #13)
> I think this should be fairly straightforward to fix in the
> autovectorization pass.  Hopefully I should be able to post a patch
> in the next few days.

I had a patch too, will send it for review in GCC7 if it's still needed.

Thanks.

[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization

2016-02-24 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710

--- Comment #13 from Doug Gilmore  ---
I think this should be fairly straightforward to fix in the
autovectorization pass.  Hopefully I should be able to post a patch
in the next few days.

[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization

2016-02-16 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710

--- Comment #12 from Doug Gilmore  ---
> Yes, I proposed some cleanup passess after vectorization but richi
> thinks it's genrally expensive.  So what's implmentation complexity
> of pass_dominator?
One thing we might consider is only enable it when vectorization is
run on architectures where cleanup is needed.

I plan to send an RFC comment for my patch to see what objections
there are to that approach, though beforehand I'd like to investigate
what could be done to the vectorizer so that it doesn't generate code
that contain false dependencies.

[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization

2016-02-14 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710

--- Comment #11 from amker at gcc dot gnu.org ---
(In reply to Doug Gilmore from comment #10)
> Created attachment 37681 [details]
> prototype fix
> 
> > 1) we failed recognize that use 0 and 2 are identical to each other.
> > This is because vectorizer generates redundant setup code in loop
> > pre-header.  There are two possible fixes here.  One is to make
> > expand_simple_operations more aggressive in expanding (used by
> > ivopts) in tree-ssa-loop-niter.c.  But I don't think this is a good
> > idea in all cases, because expanded complicated expression makes ivo
> > transform and niter analysis harder.
> Or something along the lines of the attached patch, tested only on
> the on the problem at hand.   As it stands it is probably to heavy
> handed to consider as a possible review candidate.
Yes, I proposed some cleanup passess after vectorization but richi thinks it's
genrally expensive.  So what's implmentation complexity of pass_dominator?


> > The other is to fix vectorizer
> > to generate clean code.  Richard's suggestion is to use gimple_build
> > for that.
> ISTM to be the reasonable approach but I haven't yet investigated
> what's involved.
> > Also the problem exists only for arm because it doesn't support
> > [base+index] addressing mode for vect load/store.  I guess mips
> > doesn't either.
> > 
> Right MIPS MSA doesn't support [base+index] mode.
> 
> BTW, the reason why IVOPTS works for DP but not SP on MIPS MSA is
> that the code in the pre-header is simpler for DP:
> 
>   :
>   vect_cst__52 = {da_6(D), da_6(D)};
> 
>   :
>   # vectp_dy.8_46 = PHI 
>   # vectp_dx.11_49 = PHI 
>   # vectp_dy.16_55 = PHI 
>   # ivtmp_58 = PHI <0(6), ivtmp_59(12)>
> ...
> which IVOPS can handle.
Ah, so IMHO the code should be refined before IVO, we shouldn't put too much
pressure which is not directly ivo related on ivo transform.

[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization

2016-02-13 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710

--- Comment #10 from Doug Gilmore  ---
Created attachment 37681
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37681&action=edit
prototype fix

> 1) we failed recognize that use 0 and 2 are identical to each other.
> This is because vectorizer generates redundant setup code in loop
> pre-header.  There are two possible fixes here.  One is to make
> expand_simple_operations more aggressive in expanding (used by
> ivopts) in tree-ssa-loop-niter.c.  But I don't think this is a good
> idea in all cases, because expanded complicated expression makes ivo
> transform and niter analysis harder.
Or something along the lines of the attached patch, tested only on
the on the problem at hand.   As it stands it is probably to heavy
handed to consider as a possible review candidate.
> The other is to fix vectorizer
> to generate clean code.  Richard's suggestion is to use gimple_build
> for that.
ISTM to be the reasonable approach but I haven't yet investigated
what's involved.
> Also the problem exists only for arm because it doesn't support
> [base+index] addressing mode for vect load/store.  I guess mips
> doesn't either.
> 
Right MIPS MSA doesn't support [base+index] mode.

BTW, the reason why IVOPTS works for DP but not SP on MIPS MSA is
that the code in the pre-header is simpler for DP:

  :
  vect_cst__52 = {da_6(D), da_6(D)};

  :
  # vectp_dy.8_46 = PHI 
  # vectp_dx.11_49 = PHI 
  # vectp_dy.16_55 = PHI 
  # ivtmp_58 = PHI <0(6), ivtmp_59(12)>
...
which IVOPS can handle.

[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization

2016-02-10 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710

--- Comment #9 from amker at gcc dot gnu.org ---
Also the problem exists only for arm because it doesn't support [base+index]
addressing mode for vect load/store.  I guess mips doesn't either.

[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization

2016-02-10 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710

--- Comment #8 from amker at gcc dot gnu.org ---
Reproduced on arm with saxpy.c.  The dump for slp is as below:

  :
  _82 = prologue_after_cost_adjust.7_43 * 4;
  vectp_dy.13_81 = dy_9(D) + _82;
  _87 = prologue_after_cost_adjust.7_43 * 4;
  vectp_dx.16_86 = dx_13(D) + _87;
  vect_cst__91 = {da_6(D), da_6(D), da_6(D), da_6(D)};
  _95 = prologue_after_cost_adjust.7_43 * 4;
  vectp_dy.21_94 = dy_9(D) + _95;

  :
  # vectp_dy.12_83 = PHI 
  # vectp_dx.15_88 = PHI 
  # vectp_dy.20_96 = PHI 
  # ivtmp_99 = PHI <0(13), ivtmp_100(21)>
  vect__12.14_85 = MEM[(float *)vectp_dy.12_83];
  vect__15.17_90 = MEM[(float *)vectp_dx.15_88];
  vect__16.18_92 = vect_cst__91 * vect__15.17_90;
  vect__17.19_93 = vect__12.14_85 + vect__16.18_92;
  MEM[(float *)vectp_dy.20_96] = vect__17.19_93;
  vectp_dy.12_84 = vectp_dy.12_83 + 16;
  vectp_dx.15_89 = vectp_dx.15_88 + 16;
  vectp_dy.20_97 = vectp_dy.20_96 + 16;
  ivtmp_100 = ivtmp_99 + 1;
  if (ivtmp_100 < bnd.9_53)
goto ;
  else
goto ;

  :
goto ;

IVO recognized below uses:

use 0
  address
  in statement vect__12.14_85 = MEM[(float *)vectp_dy.12_83];

  at position MEM[(float *)vectp_dy.12_83]
  type vector(4) float *
  base vectp_dy.13_81
  step 16
  base object (void *) vectp_dy.13_81
  related candidates 

use 1
  generic
  in statement vectp_dx.15_88 = PHI 

  at position 
  type vector(4) float *
  base vectp_dx.16_86
  step 16
  base object (void *) vectp_dx.16_86
  is a biv
  related candidates 

use 2
  address
  in statement MEM[(float *)vectp_dy.20_96] = vect__17.19_93;

  at position MEM[(float *)vectp_dy.20_96]
  type vector(4) float *
  base vectp_dy.21_94
  step 16
  base object (void *) vectp_dy.21_94
  related candidates 

use 3
  compare
  in statement if (ivtmp_100 < bnd.9_53)

  at position 
  type unsigned int
  base 1
  step 1
  is a biv
  related candidates 

There are two problems:
1) we failed recognize that use 0 and 2 are identical to each other.  This is
because vectorizer generates redundant setup code in loop pre-header.  There
are two possible fixes here.  One is to make expand_simple_operations more
aggressive in expanding (used by ivopts) in tree-ssa-loop-niter.c.  But I don't
think this is a good idea in all cases, because expanded complicated expression
makes ivo transform and niter analysis harder.  The other is to fix vectorizer
to generate clean code.  Richard's suggestion is to use gimple_build for that.

2) use 1 is not recognized as an address iv because alignment of that memory
reference.

[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization

2016-02-09 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710

--- Comment #7 from amker at gcc dot gnu.org ---
Hmm, the first problem is the two iv uses from dy load/store are not recognized
as having same base address/object.  This may caused by my patch disabling
expansion of iv base.  Or it exists all the time.  This isn't critical, we can
transform it on aaarch64 anyway.
Another problem is more important that dx mem ref is not recognized as address
type.

[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization

2016-02-09 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710

amker at gcc dot gnu.org changed:

   What|Removed |Added

 CC||amker at gcc dot gnu.org

--- Comment #6 from amker at gcc dot gnu.org ---
I will have a look at this one.

[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization

2016-02-06 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710

--- Comment #5 from Doug Gilmore  ---
Thanks for checking on AArch64 Andrew.

BTW, I made my (incorrect) hunch by running a test on gcc113, where
the installed 4.8 compile showed problems for both DP and SP.  (I
assumed that the problem was addressed on DP since we don't see it on
MIPS at DP ToT with the MSA patch applied.)

For Neon after ivopts I see:

  :
  # vectp_dy.20_96 = PHI 
  # ivtmp.22_78 = PHI <0(13), ivtmp.22_77(21)>
  # ivtmp.26_112 = PHI 
  # ivtmp.31_153 = PHI 
  vectp_dx.15_88 = (vector(4) float *) ivtmp.26_112;
  _156 = (void *) ivtmp.31_153;
  vect__12.14_85 = MEM[base: _156, offset: 0B];
  ivtmp.31_154 = ivtmp.31_153 + 16;
  vect__15.17_90 = MEM[(float *)vectp_dx.15_88];
  vect__16.18_92 = vect_cst__91 * vect__15.17_90;
  vect__17.19_93 = vect__12.14_85 + vect__16.18_92;
  MEM[base: vectp_dy.20_96, offset: 0B] = vect__17.19_93;
  vectp_dy.20_97 = vectp_dy.20_96 + 16;
  ivtmp.22_77 = ivtmp.22_78 + 1;
  ivtmp.26_111 = ivtmp.26_112 + 16;
  if (ivtmp.22_77 < bnd.9_53)
goto ;
  else
goto ;
...
  :
  goto ;

So the problem is indeed exposed on Neon.

[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization

2016-02-06 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710

Andrew Pinski  changed:

   What|Removed |Added

   Keywords||missed-optimization
  Component|tree-optimization   |rtl-optimization

--- Comment #4 from Andrew Pinski  ---
For me, the problem is only on the rtl level for the scheduling issue.