[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 Andrew Pinski changed: What|Removed |Added Severity|normal |enhancement
[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 --- Comment #12 from Jiangning Liu --- MGO RFC is at https://gcc.gnu.org/pipermail/gcc/2021-January/234682.html
[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 --- Comment #11 from Jiangning Liu --- (In reply to rguent...@suse.de from comment #8) > On Sat, 9 Jan 2021, jiangning.liu at amperecomputing dot com wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 > > > > --- Comment #7 from Jiangning Liu > com> --- > > (In reply to rguent...@suse.de from comment #6) > > > On January 9, 2021 4:17:17 AM GMT+01:00, "jiangning.liu at amperecomputing > > > dot com" wrote: > > > >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 > > > > > > > >--- Comment #5 from Jiangning Liu > > >com> --- > > > >> It has to be done with care of course, cost modeling is difficult > > > >> (we need to have a good estimate of n and m or need to version > > > >> the whole nest). That said, usually we attempt the reverse > > > >transform. > > > > > > > >Before tuning the cost model good enough, we may implement this > > > >optimization by > > > >adding a new optimization command line option. This won't hurt gcc, > > > >right? > > > > > > New options not enabled by default tend to bitrot, be broken from the > > > start > > > and won't be used by the lazy user. So I see no point in doing that. > > > > > > > Understand. I mean we can enable it by default eventually, but we need to > > implement and tune it step by step. It is unrealistic to work out the best > > cost > > model at the very beginning. > > Sure. The "easiest" thing is to rely on a profile from PGO, we did > have some transforms only enabled by -fprofile-use by default. That is, > the cost model needs to be conservative, esp. if you introduce dynamic > allocation for this. In the end I guess only a variant that versions > the nest on the size of the temporary will be good enough to not trigger > OOM or excessive overhead for small sizes anyway. People usually don't use PGO unless they can't find any better static compiler switches. This optimization should always benefit performance if we can tune the cost model good enough. It is true that the temp memory size needs to be checked to avoid OOM, which is one of the runtime overheads.
[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 --- Comment #10 from Jiangning Liu --- (In reply to Hongtao.liu from comment #9) > It looks like a SOA/AOC opt opportunity which is discussed in > https://gcc.gnu.org/wiki/ > cauldron2015?action=AttachFile&do=view&target=Olga+Golovanevsky_+Memory+Layou > t+Optimizations+of+Structures+and+Objects.pdf > > And i remember there's someone working on enabling SOA/AOS opt in GCC. No. The key difference is the optimization opportunity here doesn't rely on LTO at all. It is purely a local optimization within a function instead.
[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 Hongtao.liu changed: What|Removed |Added CC||crazylht at gmail dot com --- Comment #9 from Hongtao.liu --- It looks like a SOA/AOC opt opportunity which is discussed in https://gcc.gnu.org/wiki/cauldron2015?action=AttachFile&do=view&target=Olga+Golovanevsky_+Memory+Layout+Optimizations+of+Structures+and+Objects.pdf And i remember there's someone working on enabling SOA/AOS opt in GCC.
[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 --- Comment #8 from rguenther at suse dot de --- On Sat, 9 Jan 2021, jiangning.liu at amperecomputing dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 > > --- Comment #7 from Jiangning Liu > --- > (In reply to rguent...@suse.de from comment #6) > > On January 9, 2021 4:17:17 AM GMT+01:00, "jiangning.liu at amperecomputing > > dot com" wrote: > > >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 > > > > > >--- Comment #5 from Jiangning Liu > >com> --- > > >> It has to be done with care of course, cost modeling is difficult > > >> (we need to have a good estimate of n and m or need to version > > >> the whole nest). That said, usually we attempt the reverse > > >transform. > > > > > >Before tuning the cost model good enough, we may implement this > > >optimization by > > >adding a new optimization command line option. This won't hurt gcc, > > >right? > > > > New options not enabled by default tend to bitrot, be broken from the start > > and won't be used by the lazy user. So I see no point in doing that. > > > > Understand. I mean we can enable it by default eventually, but we need to > implement and tune it step by step. It is unrealistic to work out the best > cost > model at the very beginning. Sure. The "easiest" thing is to rely on a profile from PGO, we did have some transforms only enabled by -fprofile-use by default. That is, the cost model needs to be conservative, esp. if you introduce dynamic allocation for this. In the end I guess only a variant that versions the nest on the size of the temporary will be good enough to not trigger OOM or excessive overhead for small sizes anyway.
[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 --- Comment #7 from Jiangning Liu --- (In reply to rguent...@suse.de from comment #6) > On January 9, 2021 4:17:17 AM GMT+01:00, "jiangning.liu at amperecomputing > dot com" wrote: > >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 > > > >--- Comment #5 from Jiangning Liu >com> --- > >> It has to be done with care of course, cost modeling is difficult > >> (we need to have a good estimate of n and m or need to version > >> the whole nest). That said, usually we attempt the reverse > >transform. > > > >Before tuning the cost model good enough, we may implement this > >optimization by > >adding a new optimization command line option. This won't hurt gcc, > >right? > > New options not enabled by default tend to bitrot, be broken from the start > and won't be used by the lazy user. So I see no point in doing that. > Understand. I mean we can enable it by default eventually, but we need to implement and tune it step by step. It is unrealistic to work out the best cost model at the very beginning.
[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 --- Comment #6 from rguenther at suse dot de --- On January 9, 2021 4:17:17 AM GMT+01:00, "jiangning.liu at amperecomputing dot com" wrote: >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 > >--- Comment #5 from Jiangning Liu com> --- >> It has to be done with care of course, cost modeling is difficult >> (we need to have a good estimate of n and m or need to version >> the whole nest). That said, usually we attempt the reverse >transform. > >Before tuning the cost model good enough, we may implement this >optimization by >adding a new optimization command line option. This won't hurt gcc, >right? New options not enabled by default tend to bitrot, be broken from the start and won't be used by the lazy user. So I see no point in doing that. >> >> My personal opinion is that hinting the user to possibly refactor >> his code (guided by profiling to be not too noisy) is much >> prefered to the idea that the compiler can ever apply such transform >> to the loops where it matters and not to the loops where it is >> harmful. > >Sometimes, it is not always easy for the user to modify the code, and >even the >user may be lazy and reluctant to change the code. This kind of Memory >Gathering Optimization can make end-user's life easier.
[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 --- Comment #5 from Jiangning Liu --- > It has to be done with care of course, cost modeling is difficult > (we need to have a good estimate of n and m or need to version > the whole nest). That said, usually we attempt the reverse transform. Before tuning the cost model good enough, we may implement this optimization by adding a new optimization command line option. This won't hurt gcc, right? > > My personal opinion is that hinting the user to possibly refactor > his code (guided by profiling to be not too noisy) is much > prefered to the idea that the compiler can ever apply such transform > to the loops where it matters and not to the loops where it is > harmful. Sometimes, it is not always easy for the user to modify the code, and even the user may be lazy and reluctant to change the code. This kind of Memory Gathering Optimization can make end-user's life easier.
[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 --- Comment #4 from bin cheng --- Didn't go deep into the case. For simple cases taken as examples, it's possible to interchange the two loops thus enables loop invariant code motion. Though loop interchange may fail because of complicated data dependences, we may take some useful points from it, for example, the cost model checking new loop invariants wrto the outer loop.
[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 --- Comment #3 from rguenther at suse dot de --- On Fri, 8 Jan 2021, jiangning.liu at amperecomputing dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 > > --- Comment #2 from Jiangning Liu > --- > Loop distribution can only handle very simple case. If the inner loop has > complicated control flow and other memory accesses with loop-carried > dependence, it would be hard to handle it. For example, > > int foo (int n, int m, A *pa) { > int sum; > > for (int i = 0; i < n; i++) { > for (int j = 0; j < m; j++) { > sum += pa[j].pb->pc->val; // each value is repeatedly loaded "n" times > sum = sum % 7; > } > sum = sum % 13; > } > > return sum; > } > > Alternatively, we can detect "invariant" dependent memory loads for the nested > loops with alias conflict checked. If the outer loop is hot enough, we could > have a chance to "hoist" them to create cache. > > As for temp storage, is it a gcc's rule of thumb not to introduce temp storage > on heap, or it is just gcc doesn't have it yet and we want to have it? It has to be done with care of course, cost modeling is difficult (we need to have a good estimate of n and m or need to version the whole nest). That said, usually we attempt the reverse transform. My personal opinion is that hinting the user to possibly refactor his code (guided by profiling to be not too noisy) is much prefered to the idea that the compiler can ever apply such transform to the loops where it matters and not to the loops where it is harmful.
[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 --- Comment #2 from Jiangning Liu --- Loop distribution can only handle very simple case. If the inner loop has complicated control flow and other memory accesses with loop-carried dependence, it would be hard to handle it. For example, int foo (int n, int m, A *pa) { int sum; for (int i = 0; i < n; i++) { for (int j = 0; j < m; j++) { sum += pa[j].pb->pc->val; // each value is repeatedly loaded "n" times sum = sum % 7; } sum = sum % 13; } return sum; } Alternatively, we can detect "invariant" dependent memory loads for the nested loops with alias conflict checked. If the outer loop is hot enough, we could have a chance to "hoist" them to create cache. As for temp storage, is it a gcc's rule of thumb not to introduce temp storage on heap, or it is just gcc doesn't have it yet and we want to have it?
[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 Richard Biener changed: What|Removed |Added Version|tree-ssa|11.0 Status|UNCONFIRMED |NEW Keywords||missed-optimization Last reconfirmed||2021-01-08 CC||amker at gcc dot gnu.org, ||matz at gcc dot gnu.org, ||rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Richard Biener --- GCC doesn't have any such capability (introducing temporary storage to allow loop distribution or other transforms). But the transform would fit into loop distribution (or rather after introducing the array loop distribution could distribute the loop - even though it's cost model likely says that's not wanted).