Re: [PATCH/RFC] Add a new memory gathering optimization for loop (PR98598)

Feng Xue OS via Gcc-patches Thu, 29 Apr 2021 21:02:35 -0700

>> This patch implements a new loop optimization according to the proposal
>> in RFC given at
>> https://gcc.gnu.org/pipermail/gcc/2021-January/234682.html.
>> So do not repeat the idea in this mail. Hope your comments on it.
> 
> With the caveat that I'm not an optimization expert (but no one else
> seems to have replied), here are some thoughts.
> 
> [...snip...]
> 
>> Subject: [PATCH 1/3] mgo: Add a new memory gathering optimization for loop
>>  [PR98598]
> 
> BTW, did you mean to also post patches 2 and 3?
>


Not yet, but they are ready. Since this is kind of special optimization that 
uses
heap as temporary storage, not a common means in gcc, we do not know
basic attitude of the community towards it. So only the first patch was sent
out for initial comments, in that it implements a generic MGO framework, and
is complete and self-contained. Other 2 patches just composed some
enhancements for specific code pattern and dynamic alias check. If possible,
this proposal would be accepted principally, we will submit other 2 for review.

> 
>> In nested loops, if scattered memory accesses inside inner loop remain
>> unchanged in outer loop, we can sequentialize these loads by caching
>> their values into a temporary memory region at the first time, and
>> reuse the caching data in following iterations. This way can improve
>> efficiency of cpu cache subsystem by reducing its unpredictable activies.
> 
> I don't think you've cited any performance numbers so far.  Does the
> optimization show a measurable gain on some benchmark(s)?  e.g. is this
> ready to run SPEC yet, and how does it do?

Yes, we have done that. Minor improvement about several point percentage
could gain for some real applications. And to be specific, we also get major
improvement as more than 30% for certain benchmark in SPEC2017.

> 
>> To illustrate what the optimization will do, two pieces of pseudo code,
>> before and after transformation, are given. Suppose all loads and
>> "iter_count" are invariant in outer loop.
>>
>> From:
>>
>>   outer-loop ()
>>     {
>>       inner-loop (iter, iter_count)
>>         {
>>           Type1 v1 = LOAD (iter);
>>           Type2 v2 = LOAD (v1);
>>           Type3 v3 = LOAD (v2);
>>           ...
>>           iter = NEXT (iter);
>>         }
>>     }
>>
>> To:
>>
>>   typedef struct cache_elem
>>     {
>>       bool   init;
>>       Type1  c_v1;
>>       Type2  c_v2;
>>       Type3  c_v3;
> 
> Putting the "bool init;" at the front made me think "what about
> packing?" but presumably the idea is that every element is accessed in
> order, so it presumably benefits speed to have "init" at the top of the
> element, right?

Yes, layout of the struct layout could be optimized in terms of size by
some means, such as:
  o. packing "init" into a padding hole after certain field
  o. if certain field is a pointer type, the field can take the role of "init"
      (Non-NULL implies "initialized")
Now this simple scheme is straightforward, and would be enhanced
in various aspects later.

>>     } cache_elem;
>>
>>   cache_elem *cache_arr = calloc (iter_count, sizeof (cache_elem));

> What if the allocation fails at runtime?  Do we keep an unoptimized
> copy of the nested loops around as a fallback and have an unlikely
> branch to that copy?

Yes, we should. But in a different way, a flag is added into original
nested loop to control runtime switch between optimized and
unoptimized execution. This definitely incurs runtime cost, but 
avoid possible code size bloating. A better handling, as a TODO is
to apply dynamic-switch for large loop, and loop-clone for small one.

> I notice that you're using calloc, presumably to clear all of the
> "init" flags (and the whole buffer).
> 
> FWIW, this feels like a case where it would be nice to have a thread-
> local heap allocation, perhaps something like an obstack implemented in
> the standard library - but that's obviously scope creep for this.

Yes, that's good, specially for many-thread application.

> Could it make sense to use alloca for small allocations?  (or is that
> scope creep?)

We did consider using alloca as you said.  But if we could not determine
up limit for a non-constant size, we have to place alloca inside a loop that
encloses the nested loop. Without a corresponding free operation, this
kind of alloca-in-loop might cause stack overflow. So it becomes another
TODO.

>>   outer-loop ()
>>     {
>>       size_t cache_idx = 0;
>>
>>       inner-loop (iter, iter_count)
>>         {
>>           if (!cache_arr[cache_idx]->init)
>>             {
>>               v1 = LOAD (iter);
>>               v2 = LOAD (v1);
>>               v3 = LOAD (v2);
>>
>>               cache_arr[cache_idx]->init = true;
>>               cache_arr[cache_idx]->c_v1 = v1;
>>               cache_arr[cache_idx]->c_v2 = v2;
>>               cache_arr[cache_idx]->c_v3 = v3;
>>             }
>>         else
>>             {
>>               v1 = cache_arr[cache_idx]->c_v1;
>>               v2 = cache_arr[cache_idx]->c_v2;
>>               v3 = cache_arr[cache_idx]->c_v3;
>>             }
>>           ...
>>           cache_idx++;
>>           iter = NEXT (iter);
>>         }
>>     }
>>
>>   free (cache_arr);
> 
> I see that the pass puts a "free" on every CFG exit from the loop.
> 
> What about "longjmp" and C++ exceptions?  Are we guaranteed that "free"
> will happen for those ways of exiting the loop?  (should have test
> cases for this, I think)
> 

We simply disable the optimization if a loop contains an abnormal or
eh exit.

> [...]
> 
> 
>> 2020-12-25  Feng Xue  <f...@os.amperecomputing.com>
>>
>> gcc/
>>       PR tree-optimization/98598
>>       * Makefile.in (OBJS): Add tree-ssa-loop-mgo.o.
>>       * common.opt (-ftree-loop-mgo): New option.
>>       * opts.c (default_options_table): Enable -ftree-loop-mgo at -O3+.
>>       * params.opt (mgo-max-dep-load-level): New parameter.
>>       (mgo-max-cache-elem-size): Likewise.
>>       (mgo-min-cache-array-length): Likewise.
>>       (mgo-max-cache-array-length): Likewise.
>>       * doc/invoke.texi (mgo-max-dep-load-level): Document new parameter.
>>       (mgo-max-cache-elem-size): Likewise.
>>       (mgo-min-cache-array-length): Likewise.
>>       (mgo-max-cache-array-length): Likewise.
> 
> Nit: also need to document "-ftree-loop-mgo".

OK.

> 
>>       * passes.def: Add new pass_loop_mgo pass.
>>       * timevar.def (TV_LOOP_MGO): New timevar.
>>       * tree-pass.h (make_pass_loop_mgo): New declaration.
>>       * tree-ssa-loop-mgo.c: New file.
> 
> Nit: new C++ source files should have a .cc extension, rather than .c

OK.

> 
>> gcc/testsuite/
>>       PR tree-optimization/98598
>>       * gcc.dg/tree-ssa/mgo/mgo.exp: New file.
>>       * gcc.dg/tree-ssa/mgo/mgo-common.h: New test header.
>>       * gcc.dg/tree-ssa/mgo/list/list-1.c: New test.
>>       * gcc.dg/tree-ssa/mgo/array/simple-1.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/simple-2.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/simple-3.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/param-1.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/param-2.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/param-3.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/dep-load-1.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/dep-load-2.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/dep-load-3.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/dep-load-4.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/dep-load-5.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/dep-load-6.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/dep-load-7.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/dep-load-8.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/dep-load-9.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/dep-load-10.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/indx-iter-1.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/indx-iter-2.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/indx-iter-3.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/indx-iter-4.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/indx-iter-5.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/indx-iter-6.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/outer-loop-1.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/outer-loop-2.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/outer-loop-3.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/outer-loop-4.c: Likewise.
>>       * gcc.dg/tree-ssa/mgo/array/outer-loop-5.c: Likewise.
> 
> [...]
> 
> Presumably the optimization can't happen if the underlying data gets
> modified during the iteration.

Yes. Any data to be gathered should be read-only in the loop.

> I briefly looked through the test cases, but I didn't see any/(much?)
> test coverage for loops that write back to the data, or call functions
> that might do that.  Sorry if I missed some, but clearly we need to
> test that we don't incorrectly optimize those cases.

If we could not statically deduce whether data might be written, we
add dynamic alias-check to detect write hazard, which has been
implemented in other patch. Those test cases covering data-write
and function call are also included in it.

> What happens in a multithreaded program in which another thread might
> be writing to the data?  What are we allowed to optimize?

For data access in multi-thread, we assume programmer know how to
avoid data race using synchronization primitive call or  "volatile" specifier
to annotate related data. If non-pure call or "volatile" memory access
is found, the optimization will be suppressed.

> Are there some C++ examples?  Fortran?

OK. Will add some.

> It would be great to be able to inject calloc failures when testing
> this, but obviously that's out of scope for such a patch.

OK. This is a TODO.

> Hope this is constructive

It definitely is, thanks for your comments.

Feng

Re: [PATCH/RFC] Add a new memory gathering optimization for loop (PR98598)

Reply via email to