[Bug tree-optimization/68030] Redundant address calculations in vectorized loop

2016-05-20 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68030

--- Comment #9 from amker at gcc dot gnu.org ---
I had a patch for this, will send for review.

[Bug tree-optimization/68030] Redundant address calculations in vectorized loop

2016-05-11 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68030

--- Comment #8 from amker at gcc dot gnu.org ---
(In reply to rguent...@suse.de from comment #7)
> On May 10, 2016 6:25:57 PM GMT+02:00, "amker at gcc dot gnu.org"
>  wrote:
> >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68030
> >
> >--- Comment #6 from amker at gcc dot gnu.org ---
> >It's not only the vectorizer generating CSE sub-optimal code, pre and
> >lim also
> >do this kind of transform.
> 
> In another PR I suggested swapping LIM and PRE to cleanup after LIM.  IIRC
> that had some testsuite regressions.
> 
Not really, PRE and LIM do same transform here.  What we need is to transform
below:
  _970 = iy_186 + -2;
  _971 = _970 * 516;
  _979 = iy_186 + -1;
  _980 = _979 * 516;
  _985 = iy_186 * 516;
  _990 = iy_186 + 1;
  _991 = _990 * 516;
  _996 = iy_186 + 2;
  _997 = _996 * 516;
into:
  _x = iy_186 * 516
  _971 = _x - 516 * 2
  _980 = _x - 516
  _985 = _x
  _990 = _x + 516
  _997 = _x + 516 * 2

I remember a way to handle reassociation is to assign different ranks to
const/ssa_names, and re-associate expressions wrto to ranks.  GCC creates new
expressions time to time (for this case, it's cunroll), we may be able to do
same re-association when we are creating new expressions, thus they can be
easily handled by CSE (or something else).

[Bug tree-optimization/68030] Redundant address calculations in vectorized loop

2016-05-10 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68030

--- Comment #7 from rguenther at suse dot de  ---
On May 10, 2016 6:25:57 PM GMT+02:00, "amker at gcc dot gnu.org"
 wrote:
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68030
>
>--- Comment #6 from amker at gcc dot gnu.org ---
>It's not only the vectorizer generating CSE sub-optimal code, pre and
>lim also
>do this kind of transform.

In another PR I suggested swapping LIM and PRE to cleanup after LIM.  IIRC that
had some testsuite regressions.

>Compiling the attached example with below command line
>
>$ ./gcc -S -Ofast -march=haswell pr68030.c -o pr68030.S
>-fdump-tree-vect-details -fdump-tree-slp -fdump-tree-ivopts-details 
>-fdump-tree-all  -fno-tree-vectorize
>
>Gives below dump info before IVOPT:
>
>  :
>  local_Filter_33 = global_Filters;
>  pretmp_887 = global_Output;
>  pretmp_889 = global_Input;
>  goto ;
>
>  :
>
>  :
>  # ix_187 = PHI <_202(3), 2(7)>
>  # ivtmp_1065 = PHI 
>  _154 = ix_187 + -2;
>  _157 = _154 + _971;
>  _158 = (long unsigned int) _157;
>  _159 = _158 * 4;
>  _160 = pretmp_889 + _159;
>  _161 = *_160;
>  _165 = *local_Filter_33;
>  _166 = _161 * _165;
>  _170 = ix_187 + -1;
>  _173 = _170 + _971;
>  _174 = (long unsigned int) _173;
>  _175 = _174 * 4;
>  _176 = pretmp_889 + _175;
>  _177 = *_176;
>  _181 = MEM[(float *)local_Filter_33 + 4B];
>  _182 = _177 * _181;
>  _81 = _166 + _182;
>  _189 = ix_187 + _971;
>  _190 = (long unsigned int) _189;
>  _191 = _190 * 4;
>  _192 = pretmp_889 + _191;
>  _193 = *_192;
>  _197 = MEM[(float *)local_Filter_33 + 8B];
>  _198 = _193 * _197;
>  _202 = ix_187 + 1;
>  _205 = _202 + _971;
>  _206 = (long unsigned int) _205;
>  _207 = _206 * 4;
>  _208 = pretmp_889 + _207;
>  _209 = *_208;
>  _213 = MEM[(float *)local_Filter_33 + 12B];
>  _214 = _209 * _213;
>  _218 = ix_187 + 2;
>  _221 = _218 + _971;
>  _222 = (long unsigned int) _221;
>  _223 = _222 * 4;
>  _224 = pretmp_889 + _223;
>  _225 = *_224;
>  _229 = MEM[(float *)local_Filter_33 + 16B];
>  _230 = _225 * _229;
>  _82 = _214 + _230;
>  _67 = _81 + _82;
>  _243 = _154 + _980;
>  _244 = (long unsigned int) _243;
>  _245 = _244 * 4;
>  _246 = pretmp_889 + _245;
>  _247 = *_246;
>  _251 = MEM[(float *)local_Filter_33 + 20B];
>  _252 = _247 * _251;
>  _259 = _170 + _980;
>  _260 = (long unsigned int) _259;
>  _261 = _260 * 4;
>  _262 = pretmp_889 + _261;
>  _263 = *_262;
>  _267 = MEM[(float *)local_Filter_33 + 24B];
>  _268 = _263 * _267;
>  _78 = _252 + _268;
>  _275 = ix_187 + _980;
>  _276 = (long unsigned int) _275;
>  _277 = _276 * 4;
>  _278 = pretmp_889 + _277;
>  _279 = *_278;
>  _283 = MEM[(float *)local_Filter_33 + 28B];
>  _284 = _279 * _283;
>  _72 = _198 + _284;
>  _291 = _202 + _980;
>  _292 = (long unsigned int) _291;
>  _293 = _292 * 4;
>  _294 = pretmp_889 + _293;
>  _295 = *_294;
>  _299 = MEM[(float *)local_Filter_33 + 32B];
>  _300 = _295 * _299;
>  _307 = _218 + _980;
>  _308 = (long unsigned int) _307;
>  _309 = _308 * 4;
>  _310 = pretmp_889 + _309;
>  _311 = *_310;
>  _315 = MEM[(float *)local_Filter_33 + 36B];
>  _316 = _311 * _315;
>  _79 = _300 + _316;
>  _56 = _78 + _79;
>  _329 = _154 + _985;
>  _330 = (long unsigned int) _329;
>  _331 = _330 * 4;
>  _332 = pretmp_889 + _331;
>  _333 = *_332;
>  _337 = MEM[(float *)local_Filter_33 + 40B];
>  _338 = _333 * _337;
>  _345 = _170 + _985;
>  _346 = (long unsigned int) _345;
>  _347 = _346 * 4;
>  _348 = pretmp_889 + _347;
>  _349 = *_348;
>  _353 = MEM[(float *)local_Filter_33 + 44B];
>  _354 = _349 * _353;
>  _75 = _338 + _354;
>  _361 = ix_187 + _985;
>  _362 = (long unsigned int) _361;
>  _363 = _362 * 4;
>  _364 = pretmp_889 + _363;
>  _365 = *_364;
>  _369 = MEM[(float *)local_Filter_33 + 48B];
>  _370 = _365 * _369;
>  _377 = _202 + _985;
>  _378 = (long unsigned int) _377;
>  _379 = _378 * 4;
>  _380 = pretmp_889 + _379;
>  _381 = *_380;
>  _385 = MEM[(float *)local_Filter_33 + 52B];
>  _386 = _381 * _385;
>  _393 = _218 + _985;
>  _394 = (long unsigned int) _393;
>  _395 = _394 * 4;
>  _396 = pretmp_889 + _395;
>  _397 = *_396;
>  _401 = MEM[(float *)local_Filter_33 + 56B];
>  _402 = _397 * _401;
>  _76 = _386 + _402;
>  _495 = _75 + _76;
>  _415 = _154 + _991;
>  _416 = (long unsigned int) _415;
>  _417 = _416 * 4;
>  _418 = pretmp_889 + _417;
>  _419 = *_418;
>  _423 = MEM[(float *)local_Filter_33 + 60B];
>  _424 = _419 * _423;
>  _431 = _170 + _991;
>  _432 = (long unsigned int) _431;
>  _433 = _432 * 4;
>  _434 = pretmp_889 + _433;
>  _435 = *_434;
>  _439 = MEM[(float *)local_Filter_33 + 64B];
>  _440 = _435 * _439;
>  _572 = _424 + _440;
>  _447 = ix_187 + _991;
>  _448 = (long unsigned int) _447;
>  _449 = _448 * 4;
>  _450 = pretmp_889 + _449;
>  _451 = *_450;
>  _455 = MEM[(float *)local_Filter_33 + 68B];
>  _456 = _451 * _455;
>  _73 = _370 + _456;
>  _65 = _72 + _73;
>  _55 = _65 + _67;
>  _25 = _55 + _56;
>  _19 = _25 + _495;
>  _463 = _202 + _991;
>  _464 = (long unsigned int) _463;
>  _465 = 

[Bug tree-optimization/68030] Redundant address calculations in vectorized loop

2016-05-10 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68030

--- Comment #6 from amker at gcc dot gnu.org ---
It's not only the vectorizer generating CSE sub-optimal code, pre and lim also
do this kind of transform.
Compiling the attached example with below command line

$ ./gcc -S -Ofast -march=haswell pr68030.c -o pr68030.S
-fdump-tree-vect-details -fdump-tree-slp -fdump-tree-ivopts-details 
-fdump-tree-all  -fno-tree-vectorize

Gives below dump info before IVOPT:

  :
  local_Filter_33 = global_Filters;
  pretmp_887 = global_Output;
  pretmp_889 = global_Input;
  goto ;

  :

  :
  # ix_187 = PHI <_202(3), 2(7)>
  # ivtmp_1065 = PHI 
  _154 = ix_187 + -2;
  _157 = _154 + _971;
  _158 = (long unsigned int) _157;
  _159 = _158 * 4;
  _160 = pretmp_889 + _159;
  _161 = *_160;
  _165 = *local_Filter_33;
  _166 = _161 * _165;
  _170 = ix_187 + -1;
  _173 = _170 + _971;
  _174 = (long unsigned int) _173;
  _175 = _174 * 4;
  _176 = pretmp_889 + _175;
  _177 = *_176;
  _181 = MEM[(float *)local_Filter_33 + 4B];
  _182 = _177 * _181;
  _81 = _166 + _182;
  _189 = ix_187 + _971;
  _190 = (long unsigned int) _189;
  _191 = _190 * 4;
  _192 = pretmp_889 + _191;
  _193 = *_192;
  _197 = MEM[(float *)local_Filter_33 + 8B];
  _198 = _193 * _197;
  _202 = ix_187 + 1;
  _205 = _202 + _971;
  _206 = (long unsigned int) _205;
  _207 = _206 * 4;
  _208 = pretmp_889 + _207;
  _209 = *_208;
  _213 = MEM[(float *)local_Filter_33 + 12B];
  _214 = _209 * _213;
  _218 = ix_187 + 2;
  _221 = _218 + _971;
  _222 = (long unsigned int) _221;
  _223 = _222 * 4;
  _224 = pretmp_889 + _223;
  _225 = *_224;
  _229 = MEM[(float *)local_Filter_33 + 16B];
  _230 = _225 * _229;
  _82 = _214 + _230;
  _67 = _81 + _82;
  _243 = _154 + _980;
  _244 = (long unsigned int) _243;
  _245 = _244 * 4;
  _246 = pretmp_889 + _245;
  _247 = *_246;
  _251 = MEM[(float *)local_Filter_33 + 20B];
  _252 = _247 * _251;
  _259 = _170 + _980;
  _260 = (long unsigned int) _259;
  _261 = _260 * 4;
  _262 = pretmp_889 + _261;
  _263 = *_262;
  _267 = MEM[(float *)local_Filter_33 + 24B];
  _268 = _263 * _267;
  _78 = _252 + _268;
  _275 = ix_187 + _980;
  _276 = (long unsigned int) _275;
  _277 = _276 * 4;
  _278 = pretmp_889 + _277;
  _279 = *_278;
  _283 = MEM[(float *)local_Filter_33 + 28B];
  _284 = _279 * _283;
  _72 = _198 + _284;
  _291 = _202 + _980;
  _292 = (long unsigned int) _291;
  _293 = _292 * 4;
  _294 = pretmp_889 + _293;
  _295 = *_294;
  _299 = MEM[(float *)local_Filter_33 + 32B];
  _300 = _295 * _299;
  _307 = _218 + _980;
  _308 = (long unsigned int) _307;
  _309 = _308 * 4;
  _310 = pretmp_889 + _309;
  _311 = *_310;
  _315 = MEM[(float *)local_Filter_33 + 36B];
  _316 = _311 * _315;
  _79 = _300 + _316;
  _56 = _78 + _79;
  _329 = _154 + _985;
  _330 = (long unsigned int) _329;
  _331 = _330 * 4;
  _332 = pretmp_889 + _331;
  _333 = *_332;
  _337 = MEM[(float *)local_Filter_33 + 40B];
  _338 = _333 * _337;
  _345 = _170 + _985;
  _346 = (long unsigned int) _345;
  _347 = _346 * 4;
  _348 = pretmp_889 + _347;
  _349 = *_348;
  _353 = MEM[(float *)local_Filter_33 + 44B];
  _354 = _349 * _353;
  _75 = _338 + _354;
  _361 = ix_187 + _985;
  _362 = (long unsigned int) _361;
  _363 = _362 * 4;
  _364 = pretmp_889 + _363;
  _365 = *_364;
  _369 = MEM[(float *)local_Filter_33 + 48B];
  _370 = _365 * _369;
  _377 = _202 + _985;
  _378 = (long unsigned int) _377;
  _379 = _378 * 4;
  _380 = pretmp_889 + _379;
  _381 = *_380;
  _385 = MEM[(float *)local_Filter_33 + 52B];
  _386 = _381 * _385;
  _393 = _218 + _985;
  _394 = (long unsigned int) _393;
  _395 = _394 * 4;
  _396 = pretmp_889 + _395;
  _397 = *_396;
  _401 = MEM[(float *)local_Filter_33 + 56B];
  _402 = _397 * _401;
  _76 = _386 + _402;
  _495 = _75 + _76;
  _415 = _154 + _991;
  _416 = (long unsigned int) _415;
  _417 = _416 * 4;
  _418 = pretmp_889 + _417;
  _419 = *_418;
  _423 = MEM[(float *)local_Filter_33 + 60B];
  _424 = _419 * _423;
  _431 = _170 + _991;
  _432 = (long unsigned int) _431;
  _433 = _432 * 4;
  _434 = pretmp_889 + _433;
  _435 = *_434;
  _439 = MEM[(float *)local_Filter_33 + 64B];
  _440 = _435 * _439;
  _572 = _424 + _440;
  _447 = ix_187 + _991;
  _448 = (long unsigned int) _447;
  _449 = _448 * 4;
  _450 = pretmp_889 + _449;
  _451 = *_450;
  _455 = MEM[(float *)local_Filter_33 + 68B];
  _456 = _451 * _455;
  _73 = _370 + _456;
  _65 = _72 + _73;
  _55 = _65 + _67;
  _25 = _55 + _56;
  _19 = _25 + _495;
  _463 = _202 + _991;
  _464 = (long unsigned int) _463;
  _465 = _464 * 4;
  _466 = pretmp_889 + _465;
  _467 = *_466;
  _471 = MEM[(float *)local_Filter_33 + 72B];
  _472 = _467 * _471;
  _479 = _218 + _991;
  _480 = (long unsigned int) _479;
  _481 = _480 * 4;
  _482 = pretmp_889 + _481;
  _483 = *_482;
  _487 = MEM[(float *)local_Filter_33 + 76B];
  _488 = _483 * _487;
  _556 = _472 + _488;
  _20 = _556 + _572;
  _429 = _19 + _20;
  _501 = _154 + _997;
  _502 = (long unsigned int) _501;
  _503 = _502 * 4;
  _504 = pretmp_889 + _503;
  _505 = 

[Bug tree-optimization/68030] Redundant address calculations in vectorized loop

2016-04-25 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68030

amker at gcc dot gnu.org changed:

   What|Removed |Added

 CC||amker at gcc dot gnu.org

--- Comment #5 from amker at gcc dot gnu.org ---
(In reply to Kirill Yukhin from comment #4)
> Hello Bin,
> Is it possible to handle the issue using current ivopt?

Not yet I think.  IIUC, this is the same issue as PR69710.  We need to teach
vectorizer about CSE opportunities.  I had a scratch patch fixing it in a way
just as mentioned in #comment2, I will revisit it after work on if-conversion. 
According to Richard's comment, I may need to make it a general fix, for
example, share this facility among different optimizers that need to insert
code in pre-header.

Thanks.

[Bug tree-optimization/68030] Redundant address calculations in vectorized loop

2016-04-25 Thread kyukhin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68030

Kirill Yukhin  changed:

   What|Removed |Added

 CC||amker.cheng at gmail dot com

--- Comment #4 from Kirill Yukhin  ---
Hello Bin,
Is it possible to handle the issue using current ivopt?

[Bug tree-optimization/68030] Redundant address calculations in vectorized loop

2016-01-28 Thread ienkovich at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68030

Ilya Enkovich  changed:

   What|Removed |Added

 CC||ienkovich at gcc dot gnu.org

--- Comment #2 from Ilya Enkovich  ---
(In reply to Richard Biener from comment #1)
> Induction variable optimization is responsible here but it needs some help
> from a CSE.  I proposed adding a late FRE for that some time ago.  The issue
> is that
> the vectorizer creates some redundancies when creating address IVs for the
> vectorized accesses.

Would it be reasonable to cache in some way results of vect_create_data_ref_ptr
and don't create a new pointer each time it is called with the same set of
arguments (except STMT one)?  This should help to make vector code to use IVs
set similar to what scalar code uses.

[Bug tree-optimization/68030] Redundant address calculations in vectorized loop

2016-01-28 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68030

--- Comment #3 from rguenther at suse dot de  ---
On Thu, 28 Jan 2016, ienkovich at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68030
> 
> Ilya Enkovich  changed:
> 
>What|Removed |Added
> 
>  CC||ienkovich at gcc dot gnu.org
> 
> --- Comment #2 from Ilya Enkovich  ---
> (In reply to Richard Biener from comment #1)
> > Induction variable optimization is responsible here but it needs some help
> > from a CSE.  I proposed adding a late FRE for that some time ago.  The issue
> > is that
> > the vectorizer creates some redundancies when creating address IVs for the
> > vectorized accesses.
> 
> Would it be reasonable to cache in some way results of 
> vect_create_data_ref_ptr
> and don't create a new pointer each time it is called with the same set of
> arguments (except STMT one)?  This should help to make vector code to use IVs
> set similar to what scalar code uses.

Yes, I had a prototype patch doing this at some point but we really
need better infrastructure here, not trying to work around it after
the fact.

[Bug tree-optimization/68030] Redundant address calculations in vectorized loop

2015-10-20 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68030

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2015-10-20
 CC||rguenth at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #1 from Richard Biener  ---
Induction variable optimization is responsible here but it needs some help from
a CSE.  I proposed adding a late FRE for that some time ago.  The issue is that
the vectorizer creates some redundancies when creating address IVs for the
vectorized accesses.