Re: [PATCH/RFC] Add a new memory gathering optimization for loop (PR98598)

Feng Xue OS via Gcc-patches Thu, 06 May 2021 21:29:24 -0700

>> gcc/
>>         PR tree-optimization/98598
>>         * Makefile.in (OBJS): Add tree-ssa-loop-mgo.o.
>>         * common.opt (-ftree-loop-mgo): New option.
> 
> Just a quick comment - -ftree-loop-mgo is user-facing and it isn't really a 
> good
> name.  -floop-mgo would be better but still I'd have no idea what this would 
> do.
> 
> I don't have a good suggestion here other than to expand it to
> -floop-gather-memory (?!).


OK. Better than "mgo", this abbr. is only a term for development use.

> The option documentation isn't informative either.
> 
> From:
> 
>   outer-loop ()
>     {
>       inner-loop (iter, iter_count)
>         {
>           Type1 v1 = LOAD (iter);
>           Type2 v2 = LOAD (v1);
>           Type3 v3 = LOAD (v2);
>           ...
>           iter = NEXT (iter);
>         }
>     }
> 
> To:
> 
>   typedef struct cache_elem
>     {
>       bool   init;
>       Type1  c_v1;
>       Type2  c_v2;
>       Type3  c_v3;
>     } cache_elem;
> 
>   cache_elem *cache_arr = calloc (iter_count, sizeof (cache_elem));
> 
>   outer-loop ()
>     {
>       size_t cache_idx = 0;
> 
>       inner-loop (iter, iter_count)
>         {
>           if (!cache_arr[cache_idx]->init)
>             {
>               v1 = LOAD (iter);
>               v2 = LOAD (v1);
>               v3 = LOAD (v2);
> 
>               cache_arr[cache_idx]->init = true;
>               cache_arr[cache_idx]->c_v1 = v1;
>               cache_arr[cache_idx]->c_v2 = v2;
>               cache_arr[cache_idx]->c_v3 = v3;
>             }
>           else
>             {
>               v1 = cache_arr[cache_idx]->c_v1;
>               v2 = cache_arr[cache_idx]->c_v2;
>               v3 = cache_arr[cache_idx]->c_v3;
>             }
>           ...
>           cache_idx++;
>           iter = NEXT (iter);
>         }
>     }
> 
>   free (cache_arr);
> 
> This is a _very_ special transform.  What it seems to do is
> optimize the dependent loads for outer loop iteration n > 1
> by caching the result(s).  If that's possible then you should
> be able to distribute the outer loop to one doing the caching
> and one using the cache.  Then this transform would be more
> like a tradidional array expansion of scalars?  In some cases
> also loop interchange could remove the need for the caching.
> 
> Doing MGO as the very first loop pass thus looks bad, I think
> MGO should be much later, for example after interchange.
> I also think that MGO should work in concert with loop
> distribution (which could have an improved cost model)
> rather than being a separate pass.
> 
> Your analysis phase looks quite expensive, building sth
> like a on-the side representation very closely matching SSA.
> It seems to work from PHI defs to uses, which looks backwards.

Did not catch this point very clearly. Would you please detail it more?

> You seem to roll your own dependence analysis code :/  Please
> have a look at loop distribution.
> 
> Also you build an actual structure type for reasons that escape
> me rather than simply accessing the allocated storage at
> appropriate offsets.
> 
> I think simply calling 'calloc' isn't OK because you might need
> aligned storage and because calloc might not be available.
> Please at least use 'malloc' and make sure MALLOC_ABI_ALIGNMENT
> is large enough for the data you want to place (or perform
> dynamic re-alignment yourself).  We probably want some generic
> middle-end utility to obtain aligned allocated storage at some
> point.
> 
> As said above I think you want to re-do this transform as
> a loop distribution transform.  I think if caching works then
> the loads should be distributable and the loop distribution
> transform should be enhanced to expand the scalars to arrays.

I checked code of loop distribution, and its trigger strategy seems
to be very conservative, now only targets simple and regular
index-based loop, and could not handle link-list traversal, which
consists of a series of discrete memory accesses, and MGO would
matter a lot. Additionally, for some complicate cases,  we could
not completely decompose MGO as two separate loops for
"do caching" and "use caching" respectively. An example:

for (i = 0; i < N; i++)
  {
    for (j = 0; j < i; j++)
       {
           Type1 v1 = LOAD_FN1 (j);
           Type2 v2 = LOAD_FN2 (v1);
           Type3 v3 = LOAD_FN3 (v2);

           ...

           condition = ...
       }

    if (condition)
      break;
  }

We should not cache all loads (Totally N) in one step since some
of them might be invalid after "condition" breaks loops. We have to
mix up "do caching" and "use caching", and let them dynamically
switched against "init" flag.  But loop distribution does have some
overlap on analysis and transformation with MGO, we will try to
see if there is a way to unify them.

Thanks,
Feng

Re: [PATCH/RFC] Add a new memory gathering optimization for loop (PR98598)

Reply via email to