[Bug tree-optimization/68030] Redundant address calculations in vectorized loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68030 --- Comment #9 from amker at gcc dot gnu.org --- I had a patch for this, will send for review.
[Bug tree-optimization/68030] Redundant address calculations in vectorized loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68030 --- Comment #8 from amker at gcc dot gnu.org --- (In reply to rguent...@suse.de from comment #7) > On May 10, 2016 6:25:57 PM GMT+02:00, "amker at gcc dot gnu.org" >wrote: > >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68030 > > > >--- Comment #6 from amker at gcc dot gnu.org --- > >It's not only the vectorizer generating CSE sub-optimal code, pre and > >lim also > >do this kind of transform. > > In another PR I suggested swapping LIM and PRE to cleanup after LIM. IIRC > that had some testsuite regressions. > Not really, PRE and LIM do same transform here. What we need is to transform below: _970 = iy_186 + -2; _971 = _970 * 516; _979 = iy_186 + -1; _980 = _979 * 516; _985 = iy_186 * 516; _990 = iy_186 + 1; _991 = _990 * 516; _996 = iy_186 + 2; _997 = _996 * 516; into: _x = iy_186 * 516 _971 = _x - 516 * 2 _980 = _x - 516 _985 = _x _990 = _x + 516 _997 = _x + 516 * 2 I remember a way to handle reassociation is to assign different ranks to const/ssa_names, and re-associate expressions wrto to ranks. GCC creates new expressions time to time (for this case, it's cunroll), we may be able to do same re-association when we are creating new expressions, thus they can be easily handled by CSE (or something else).
[Bug tree-optimization/68030] Redundant address calculations in vectorized loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68030 --- Comment #7 from rguenther at suse dot de --- On May 10, 2016 6:25:57 PM GMT+02:00, "amker at gcc dot gnu.org"wrote: >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68030 > >--- Comment #6 from amker at gcc dot gnu.org --- >It's not only the vectorizer generating CSE sub-optimal code, pre and >lim also >do this kind of transform. In another PR I suggested swapping LIM and PRE to cleanup after LIM. IIRC that had some testsuite regressions. >Compiling the attached example with below command line > >$ ./gcc -S -Ofast -march=haswell pr68030.c -o pr68030.S >-fdump-tree-vect-details -fdump-tree-slp -fdump-tree-ivopts-details >-fdump-tree-all -fno-tree-vectorize > >Gives below dump info before IVOPT: > > : > local_Filter_33 = global_Filters; > pretmp_887 = global_Output; > pretmp_889 = global_Input; > goto ; > > : > > : > # ix_187 = PHI <_202(3), 2(7)> > # ivtmp_1065 = PHI > _154 = ix_187 + -2; > _157 = _154 + _971; > _158 = (long unsigned int) _157; > _159 = _158 * 4; > _160 = pretmp_889 + _159; > _161 = *_160; > _165 = *local_Filter_33; > _166 = _161 * _165; > _170 = ix_187 + -1; > _173 = _170 + _971; > _174 = (long unsigned int) _173; > _175 = _174 * 4; > _176 = pretmp_889 + _175; > _177 = *_176; > _181 = MEM[(float *)local_Filter_33 + 4B]; > _182 = _177 * _181; > _81 = _166 + _182; > _189 = ix_187 + _971; > _190 = (long unsigned int) _189; > _191 = _190 * 4; > _192 = pretmp_889 + _191; > _193 = *_192; > _197 = MEM[(float *)local_Filter_33 + 8B]; > _198 = _193 * _197; > _202 = ix_187 + 1; > _205 = _202 + _971; > _206 = (long unsigned int) _205; > _207 = _206 * 4; > _208 = pretmp_889 + _207; > _209 = *_208; > _213 = MEM[(float *)local_Filter_33 + 12B]; > _214 = _209 * _213; > _218 = ix_187 + 2; > _221 = _218 + _971; > _222 = (long unsigned int) _221; > _223 = _222 * 4; > _224 = pretmp_889 + _223; > _225 = *_224; > _229 = MEM[(float *)local_Filter_33 + 16B]; > _230 = _225 * _229; > _82 = _214 + _230; > _67 = _81 + _82; > _243 = _154 + _980; > _244 = (long unsigned int) _243; > _245 = _244 * 4; > _246 = pretmp_889 + _245; > _247 = *_246; > _251 = MEM[(float *)local_Filter_33 + 20B]; > _252 = _247 * _251; > _259 = _170 + _980; > _260 = (long unsigned int) _259; > _261 = _260 * 4; > _262 = pretmp_889 + _261; > _263 = *_262; > _267 = MEM[(float *)local_Filter_33 + 24B]; > _268 = _263 * _267; > _78 = _252 + _268; > _275 = ix_187 + _980; > _276 = (long unsigned int) _275; > _277 = _276 * 4; > _278 = pretmp_889 + _277; > _279 = *_278; > _283 = MEM[(float *)local_Filter_33 + 28B]; > _284 = _279 * _283; > _72 = _198 + _284; > _291 = _202 + _980; > _292 = (long unsigned int) _291; > _293 = _292 * 4; > _294 = pretmp_889 + _293; > _295 = *_294; > _299 = MEM[(float *)local_Filter_33 + 32B]; > _300 = _295 * _299; > _307 = _218 + _980; > _308 = (long unsigned int) _307; > _309 = _308 * 4; > _310 = pretmp_889 + _309; > _311 = *_310; > _315 = MEM[(float *)local_Filter_33 + 36B]; > _316 = _311 * _315; > _79 = _300 + _316; > _56 = _78 + _79; > _329 = _154 + _985; > _330 = (long unsigned int) _329; > _331 = _330 * 4; > _332 = pretmp_889 + _331; > _333 = *_332; > _337 = MEM[(float *)local_Filter_33 + 40B]; > _338 = _333 * _337; > _345 = _170 + _985; > _346 = (long unsigned int) _345; > _347 = _346 * 4; > _348 = pretmp_889 + _347; > _349 = *_348; > _353 = MEM[(float *)local_Filter_33 + 44B]; > _354 = _349 * _353; > _75 = _338 + _354; > _361 = ix_187 + _985; > _362 = (long unsigned int) _361; > _363 = _362 * 4; > _364 = pretmp_889 + _363; > _365 = *_364; > _369 = MEM[(float *)local_Filter_33 + 48B]; > _370 = _365 * _369; > _377 = _202 + _985; > _378 = (long unsigned int) _377; > _379 = _378 * 4; > _380 = pretmp_889 + _379; > _381 = *_380; > _385 = MEM[(float *)local_Filter_33 + 52B]; > _386 = _381 * _385; > _393 = _218 + _985; > _394 = (long unsigned int) _393; > _395 = _394 * 4; > _396 = pretmp_889 + _395; > _397 = *_396; > _401 = MEM[(float *)local_Filter_33 + 56B]; > _402 = _397 * _401; > _76 = _386 + _402; > _495 = _75 + _76; > _415 = _154 + _991; > _416 = (long unsigned int) _415; > _417 = _416 * 4; > _418 = pretmp_889 + _417; > _419 = *_418; > _423 = MEM[(float *)local_Filter_33 + 60B]; > _424 = _419 * _423; > _431 = _170 + _991; > _432 = (long unsigned int) _431; > _433 = _432 * 4; > _434 = pretmp_889 + _433; > _435 = *_434; > _439 = MEM[(float *)local_Filter_33 + 64B]; > _440 = _435 * _439; > _572 = _424 + _440; > _447 = ix_187 + _991; > _448 = (long unsigned int) _447; > _449 = _448 * 4; > _450 = pretmp_889 + _449; > _451 = *_450; > _455 = MEM[(float *)local_Filter_33 + 68B]; > _456 = _451 * _455; > _73 = _370 + _456; > _65 = _72 + _73; > _55 = _65 + _67; > _25 = _55 + _56; > _19 = _25 + _495; > _463 = _202 + _991; > _464 = (long unsigned int) _463; > _465 =
[Bug tree-optimization/68030] Redundant address calculations in vectorized loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68030 --- Comment #6 from amker at gcc dot gnu.org --- It's not only the vectorizer generating CSE sub-optimal code, pre and lim also do this kind of transform. Compiling the attached example with below command line $ ./gcc -S -Ofast -march=haswell pr68030.c -o pr68030.S -fdump-tree-vect-details -fdump-tree-slp -fdump-tree-ivopts-details -fdump-tree-all -fno-tree-vectorize Gives below dump info before IVOPT: : local_Filter_33 = global_Filters; pretmp_887 = global_Output; pretmp_889 = global_Input; goto ; : : # ix_187 = PHI <_202(3), 2(7)> # ivtmp_1065 = PHI_154 = ix_187 + -2; _157 = _154 + _971; _158 = (long unsigned int) _157; _159 = _158 * 4; _160 = pretmp_889 + _159; _161 = *_160; _165 = *local_Filter_33; _166 = _161 * _165; _170 = ix_187 + -1; _173 = _170 + _971; _174 = (long unsigned int) _173; _175 = _174 * 4; _176 = pretmp_889 + _175; _177 = *_176; _181 = MEM[(float *)local_Filter_33 + 4B]; _182 = _177 * _181; _81 = _166 + _182; _189 = ix_187 + _971; _190 = (long unsigned int) _189; _191 = _190 * 4; _192 = pretmp_889 + _191; _193 = *_192; _197 = MEM[(float *)local_Filter_33 + 8B]; _198 = _193 * _197; _202 = ix_187 + 1; _205 = _202 + _971; _206 = (long unsigned int) _205; _207 = _206 * 4; _208 = pretmp_889 + _207; _209 = *_208; _213 = MEM[(float *)local_Filter_33 + 12B]; _214 = _209 * _213; _218 = ix_187 + 2; _221 = _218 + _971; _222 = (long unsigned int) _221; _223 = _222 * 4; _224 = pretmp_889 + _223; _225 = *_224; _229 = MEM[(float *)local_Filter_33 + 16B]; _230 = _225 * _229; _82 = _214 + _230; _67 = _81 + _82; _243 = _154 + _980; _244 = (long unsigned int) _243; _245 = _244 * 4; _246 = pretmp_889 + _245; _247 = *_246; _251 = MEM[(float *)local_Filter_33 + 20B]; _252 = _247 * _251; _259 = _170 + _980; _260 = (long unsigned int) _259; _261 = _260 * 4; _262 = pretmp_889 + _261; _263 = *_262; _267 = MEM[(float *)local_Filter_33 + 24B]; _268 = _263 * _267; _78 = _252 + _268; _275 = ix_187 + _980; _276 = (long unsigned int) _275; _277 = _276 * 4; _278 = pretmp_889 + _277; _279 = *_278; _283 = MEM[(float *)local_Filter_33 + 28B]; _284 = _279 * _283; _72 = _198 + _284; _291 = _202 + _980; _292 = (long unsigned int) _291; _293 = _292 * 4; _294 = pretmp_889 + _293; _295 = *_294; _299 = MEM[(float *)local_Filter_33 + 32B]; _300 = _295 * _299; _307 = _218 + _980; _308 = (long unsigned int) _307; _309 = _308 * 4; _310 = pretmp_889 + _309; _311 = *_310; _315 = MEM[(float *)local_Filter_33 + 36B]; _316 = _311 * _315; _79 = _300 + _316; _56 = _78 + _79; _329 = _154 + _985; _330 = (long unsigned int) _329; _331 = _330 * 4; _332 = pretmp_889 + _331; _333 = *_332; _337 = MEM[(float *)local_Filter_33 + 40B]; _338 = _333 * _337; _345 = _170 + _985; _346 = (long unsigned int) _345; _347 = _346 * 4; _348 = pretmp_889 + _347; _349 = *_348; _353 = MEM[(float *)local_Filter_33 + 44B]; _354 = _349 * _353; _75 = _338 + _354; _361 = ix_187 + _985; _362 = (long unsigned int) _361; _363 = _362 * 4; _364 = pretmp_889 + _363; _365 = *_364; _369 = MEM[(float *)local_Filter_33 + 48B]; _370 = _365 * _369; _377 = _202 + _985; _378 = (long unsigned int) _377; _379 = _378 * 4; _380 = pretmp_889 + _379; _381 = *_380; _385 = MEM[(float *)local_Filter_33 + 52B]; _386 = _381 * _385; _393 = _218 + _985; _394 = (long unsigned int) _393; _395 = _394 * 4; _396 = pretmp_889 + _395; _397 = *_396; _401 = MEM[(float *)local_Filter_33 + 56B]; _402 = _397 * _401; _76 = _386 + _402; _495 = _75 + _76; _415 = _154 + _991; _416 = (long unsigned int) _415; _417 = _416 * 4; _418 = pretmp_889 + _417; _419 = *_418; _423 = MEM[(float *)local_Filter_33 + 60B]; _424 = _419 * _423; _431 = _170 + _991; _432 = (long unsigned int) _431; _433 = _432 * 4; _434 = pretmp_889 + _433; _435 = *_434; _439 = MEM[(float *)local_Filter_33 + 64B]; _440 = _435 * _439; _572 = _424 + _440; _447 = ix_187 + _991; _448 = (long unsigned int) _447; _449 = _448 * 4; _450 = pretmp_889 + _449; _451 = *_450; _455 = MEM[(float *)local_Filter_33 + 68B]; _456 = _451 * _455; _73 = _370 + _456; _65 = _72 + _73; _55 = _65 + _67; _25 = _55 + _56; _19 = _25 + _495; _463 = _202 + _991; _464 = (long unsigned int) _463; _465 = _464 * 4; _466 = pretmp_889 + _465; _467 = *_466; _471 = MEM[(float *)local_Filter_33 + 72B]; _472 = _467 * _471; _479 = _218 + _991; _480 = (long unsigned int) _479; _481 = _480 * 4; _482 = pretmp_889 + _481; _483 = *_482; _487 = MEM[(float *)local_Filter_33 + 76B]; _488 = _483 * _487; _556 = _472 + _488; _20 = _556 + _572; _429 = _19 + _20; _501 = _154 + _997; _502 = (long unsigned int) _501; _503 = _502 * 4; _504 = pretmp_889 + _503; _505 =
[Bug tree-optimization/68030] Redundant address calculations in vectorized loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68030 amker at gcc dot gnu.org changed: What|Removed |Added CC||amker at gcc dot gnu.org --- Comment #5 from amker at gcc dot gnu.org --- (In reply to Kirill Yukhin from comment #4) > Hello Bin, > Is it possible to handle the issue using current ivopt? Not yet I think. IIUC, this is the same issue as PR69710. We need to teach vectorizer about CSE opportunities. I had a scratch patch fixing it in a way just as mentioned in #comment2, I will revisit it after work on if-conversion. According to Richard's comment, I may need to make it a general fix, for example, share this facility among different optimizers that need to insert code in pre-header. Thanks.
[Bug tree-optimization/68030] Redundant address calculations in vectorized loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68030 Kirill Yukhin changed: What|Removed |Added CC||amker.cheng at gmail dot com --- Comment #4 from Kirill Yukhin --- Hello Bin, Is it possible to handle the issue using current ivopt?
[Bug tree-optimization/68030] Redundant address calculations in vectorized loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68030 Ilya Enkovich changed: What|Removed |Added CC||ienkovich at gcc dot gnu.org --- Comment #2 from Ilya Enkovich --- (In reply to Richard Biener from comment #1) > Induction variable optimization is responsible here but it needs some help > from a CSE. I proposed adding a late FRE for that some time ago. The issue > is that > the vectorizer creates some redundancies when creating address IVs for the > vectorized accesses. Would it be reasonable to cache in some way results of vect_create_data_ref_ptr and don't create a new pointer each time it is called with the same set of arguments (except STMT one)? This should help to make vector code to use IVs set similar to what scalar code uses.
[Bug tree-optimization/68030] Redundant address calculations in vectorized loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68030 --- Comment #3 from rguenther at suse dot de --- On Thu, 28 Jan 2016, ienkovich at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68030 > > Ilya Enkovich changed: > >What|Removed |Added > > CC||ienkovich at gcc dot gnu.org > > --- Comment #2 from Ilya Enkovich --- > (In reply to Richard Biener from comment #1) > > Induction variable optimization is responsible here but it needs some help > > from a CSE. I proposed adding a late FRE for that some time ago. The issue > > is that > > the vectorizer creates some redundancies when creating address IVs for the > > vectorized accesses. > > Would it be reasonable to cache in some way results of > vect_create_data_ref_ptr > and don't create a new pointer each time it is called with the same set of > arguments (except STMT one)? This should help to make vector code to use IVs > set similar to what scalar code uses. Yes, I had a prototype patch doing this at some point but we really need better infrastructure here, not trying to work around it after the fact.
[Bug tree-optimization/68030] Redundant address calculations in vectorized loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68030 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2015-10-20 CC||rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Richard Biener --- Induction variable optimization is responsible here but it needs some help from a CSE. I proposed adding a late FRE for that some time ago. The issue is that the vectorizer creates some redundancies when creating address IVs for the vectorized accesses.