[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops

2021-12-23 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598

Andrew Pinski  changed:

   What|Removed |Added

   Severity|normal  |enhancement

[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops

2021-01-14 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598

--- Comment #12 from Jiangning Liu  
---
MGO RFC is at https://gcc.gnu.org/pipermail/gcc/2021-January/234682.html

[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops

2021-01-11 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598

--- Comment #11 from Jiangning Liu  
---
(In reply to rguent...@suse.de from comment #8)
> On Sat, 9 Jan 2021, jiangning.liu at amperecomputing dot com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598
> > 
> > --- Comment #7 from Jiangning Liu  > com> ---
> > (In reply to rguent...@suse.de from comment #6)
> > > On January 9, 2021 4:17:17 AM GMT+01:00, "jiangning.liu at amperecomputing
> > > dot com"  wrote:
> > > >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598
> > > >
> > > >--- Comment #5 from Jiangning Liu  > > >com> ---
> > > >> It has to be done with care of course, cost modeling is difficult
> > > >> (we need to have a good estimate of n and m or need to version
> > > >> the whole nest).  That said, usually we attempt the reverse
> > > >transform.
> > > >
> > > >Before tuning the cost model good enough, we may implement this
> > > >optimization by
> > > >adding a new optimization command line option. This won't hurt gcc,
> > > >right?
> > > 
> > > New options not enabled by default tend to bitrot, be broken from the 
> > > start
> > > and won't be used by the lazy user. So I see no point in doing that. 
> > > 
> > 
> > Understand. I mean we can enable it by default eventually, but we need to
> > implement and tune it step by step. It is unrealistic to work out the best 
> > cost
> > model at the very beginning.
> 
> Sure.  The "easiest" thing is to rely on a profile from PGO, we did
> have some transforms only enabled by -fprofile-use by default.  That is,
> the cost model needs to be conservative, esp. if you introduce dynamic
> allocation for this.  In the end I guess only a variant that versions
> the nest on the size of the temporary will be good enough to not trigger
> OOM or excessive overhead for small sizes anyway.

People usually don't use PGO unless they can't find any better static compiler
switches. This optimization should always benefit performance if we can tune
the cost model good enough. It is true that the temp memory size needs to be
checked to avoid OOM, which is one of the runtime overheads.

[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops

2021-01-11 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598

--- Comment #10 from Jiangning Liu  
---
(In reply to Hongtao.liu from comment #9)
> It looks like a SOA/AOC opt opportunity which is discussed in
> https://gcc.gnu.org/wiki/
> cauldron2015?action=AttachFile&do=view&target=Olga+Golovanevsky_+Memory+Layou
> t+Optimizations+of+Structures+and+Objects.pdf
> 
> And i remember there's someone working on enabling SOA/AOS opt in GCC.

No. The key difference is the optimization opportunity here doesn't rely on LTO
at all. It is purely a local optimization within a function instead.

[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops

2021-01-11 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #9 from Hongtao.liu  ---
It looks like a SOA/AOC opt opportunity which is discussed in
https://gcc.gnu.org/wiki/cauldron2015?action=AttachFile&do=view&target=Olga+Golovanevsky_+Memory+Layout+Optimizations+of+Structures+and+Objects.pdf

And i remember there's someone working on enabling SOA/AOS opt in GCC.

[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops

2021-01-10 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598

--- Comment #8 from rguenther at suse dot de  ---
On Sat, 9 Jan 2021, jiangning.liu at amperecomputing dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598
> 
> --- Comment #7 from Jiangning Liu  
> ---
> (In reply to rguent...@suse.de from comment #6)
> > On January 9, 2021 4:17:17 AM GMT+01:00, "jiangning.liu at amperecomputing
> > dot com"  wrote:
> > >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598
> > >
> > >--- Comment #5 from Jiangning Liu  > >com> ---
> > >> It has to be done with care of course, cost modeling is difficult
> > >> (we need to have a good estimate of n and m or need to version
> > >> the whole nest).  That said, usually we attempt the reverse
> > >transform.
> > >
> > >Before tuning the cost model good enough, we may implement this
> > >optimization by
> > >adding a new optimization command line option. This won't hurt gcc,
> > >right?
> > 
> > New options not enabled by default tend to bitrot, be broken from the start
> > and won't be used by the lazy user. So I see no point in doing that. 
> > 
> 
> Understand. I mean we can enable it by default eventually, but we need to
> implement and tune it step by step. It is unrealistic to work out the best 
> cost
> model at the very beginning.

Sure.  The "easiest" thing is to rely on a profile from PGO, we did
have some transforms only enabled by -fprofile-use by default.  That is,
the cost model needs to be conservative, esp. if you introduce dynamic
allocation for this.  In the end I guess only a variant that versions
the nest on the size of the temporary will be good enough to not trigger
OOM or excessive overhead for small sizes anyway.

[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops

2021-01-09 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598

--- Comment #7 from Jiangning Liu  ---
(In reply to rguent...@suse.de from comment #6)
> On January 9, 2021 4:17:17 AM GMT+01:00, "jiangning.liu at amperecomputing
> dot com"  wrote:
> >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598
> >
> >--- Comment #5 from Jiangning Liu  >com> ---
> >> It has to be done with care of course, cost modeling is difficult
> >> (we need to have a good estimate of n and m or need to version
> >> the whole nest).  That said, usually we attempt the reverse
> >transform.
> >
> >Before tuning the cost model good enough, we may implement this
> >optimization by
> >adding a new optimization command line option. This won't hurt gcc,
> >right?
> 
> New options not enabled by default tend to bitrot, be broken from the start
> and won't be used by the lazy user. So I see no point in doing that. 
> 

Understand. I mean we can enable it by default eventually, but we need to
implement and tune it step by step. It is unrealistic to work out the best cost
model at the very beginning.

[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops

2021-01-09 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598

--- Comment #6 from rguenther at suse dot de  ---
On January 9, 2021 4:17:17 AM GMT+01:00, "jiangning.liu at amperecomputing dot
com"  wrote:
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598
>
>--- Comment #5 from Jiangning Liu com> ---
>> It has to be done with care of course, cost modeling is difficult
>> (we need to have a good estimate of n and m or need to version
>> the whole nest).  That said, usually we attempt the reverse
>transform.
>
>Before tuning the cost model good enough, we may implement this
>optimization by
>adding a new optimization command line option. This won't hurt gcc,
>right?

New options not enabled by default tend to bitrot, be broken from the start and
won't be used by the lazy user. So I see no point in doing that. 

>> 
>> My personal opinion is that hinting the user to possibly refactor
>> his code (guided by profiling to be not too noisy) is much
>> prefered to the idea that the compiler can ever apply such transform
>> to the loops where it matters and not to the loops where it is
>> harmful.
>
>Sometimes, it is not always easy for the user to modify the code, and
>even the
>user may be lazy and reluctant to change the code. This kind of Memory
>Gathering Optimization can make end-user's life easier.

[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops

2021-01-08 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598

--- Comment #5 from Jiangning Liu  ---
> It has to be done with care of course, cost modeling is difficult
> (we need to have a good estimate of n and m or need to version
> the whole nest).  That said, usually we attempt the reverse transform.

Before tuning the cost model good enough, we may implement this optimization by
adding a new optimization command line option. This won't hurt gcc, right?

> 
> My personal opinion is that hinting the user to possibly refactor
> his code (guided by profiling to be not too noisy) is much
> prefered to the idea that the compiler can ever apply such transform
> to the loops where it matters and not to the loops where it is
> harmful.

Sometimes, it is not always easy for the user to modify the code, and even the
user may be lazy and reluctant to change the code. This kind of Memory
Gathering Optimization can make end-user's life easier.

[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops

2021-01-08 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598

--- Comment #4 from bin cheng  ---
Didn't go deep into the case.

For simple cases taken as examples, it's possible to interchange the two loops
thus enables loop invariant code motion.  Though loop interchange may fail
because of complicated data dependences, we may take some useful points from
it, for example, the cost model checking new loop invariants wrto the outer
loop.

[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops

2021-01-08 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598

--- Comment #3 from rguenther at suse dot de  ---
On Fri, 8 Jan 2021, jiangning.liu at amperecomputing dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598
> 
> --- Comment #2 from Jiangning Liu  
> ---
> Loop distribution can only handle very simple case. If the inner loop has
> complicated control flow and other memory accesses with loop-carried
> dependence, it would be hard to handle it. For example,
> 
> int foo (int n, int m, A *pa) {
>   int sum;
> 
>   for (int i = 0; i < n; i++) {
> for (int j = 0; j < m; j++) {
>   sum += pa[j].pb->pc->val;  // each value is repeatedly loaded "n" times
>   sum = sum % 7;
> }
> sum = sum % 13;
>   }
> 
>   return sum;
> }
> 
> Alternatively, we can detect "invariant" dependent memory loads for the nested
> loops with alias conflict checked. If the outer loop is hot enough, we could
> have a chance to "hoist" them to create cache.
> 
> As for temp storage, is it a gcc's rule of thumb not to introduce temp storage
> on heap, or it is just gcc doesn't have it yet and we want to have it?

It has to be done with care of course, cost modeling is difficult
(we need to have a good estimate of n and m or need to version
the whole nest).  That said, usually we attempt the reverse transform.

My personal opinion is that hinting the user to possibly refactor
his code (guided by profiling to be not too noisy) is much
prefered to the idea that the compiler can ever apply such transform
to the loops where it matters and not to the loops where it is
harmful.

[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops

2021-01-08 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598

--- Comment #2 from Jiangning Liu  ---
Loop distribution can only handle very simple case. If the inner loop has
complicated control flow and other memory accesses with loop-carried
dependence, it would be hard to handle it. For example,

int foo (int n, int m, A *pa) {
  int sum;

  for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j++) {
  sum += pa[j].pb->pc->val;  // each value is repeatedly loaded "n" times
  sum = sum % 7;
}
sum = sum % 13;
  }

  return sum;
}

Alternatively, we can detect "invariant" dependent memory loads for the nested
loops with alias conflict checked. If the outer loop is hot enough, we could
have a chance to "hoist" them to create cache.

As for temp storage, is it a gcc's rule of thumb not to introduce temp storage
on heap, or it is just gcc doesn't have it yet and we want to have it?

[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops

2021-01-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598

Richard Biener  changed:

   What|Removed |Added

Version|tree-ssa|11.0
 Status|UNCONFIRMED |NEW
   Keywords||missed-optimization
   Last reconfirmed||2021-01-08
 CC||amker at gcc dot gnu.org,
   ||matz at gcc dot gnu.org,
   ||rguenth at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #1 from Richard Biener  ---
GCC doesn't have any such capability (introducing temporary storage to allow
loop distribution or other transforms).  But the transform would fit into loop
distribution (or rather after introducing the array loop distribution could
distribute the loop - even though it's cost model likely says that's not
wanted).