On 2021/12/6 13:09, Xionghu Luo via Gcc-patches wrote:
> 
> 
> On 2021/12/1 18:09, Richard Biener wrote:
>> On Wed, Nov 10, 2021 at 4:08 AM Xionghu Luo <luo...@linux.ibm.com> wrote:
>>>
>>>
>>>
>>> On 2021/11/4 21:00, Richard Biener wrote:
>>>> On Wed, Nov 3, 2021 at 2:29 PM Xionghu Luo <luo...@linux.ibm.com> wrote:
>>>>>
>>>>>
>>>>>> +  while (outmost_loop != loop)
>>>>>> +    {
>>>>>> +      if (bb_colder_than_loop_preheader (loop_preheader_edge
>>>>>> (outmost_loop)->src,
>>>>>> +                                        loop_preheader_edge 
>>>>>> (cold_loop)->src))
>>>>>> +       cold_loop = outmost_loop;
>>>>>> +      outmost_loop = superloop_at_depth (loop, loop_depth 
>>>>>> (outmost_loop) + 1);
>>>>>> +    }
>>>>>>
>>>>>> could be instead written as
>>>>>>
>>>>>>   coldest_loop = coldest_outermost_loop[loop->num];
>>>>>>   if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
>>>>>>     return outermost_loop;
>>>>>>   return coldest_loop;
>>>>>>
>>>>>> ?  And in the usual case coldest_outermost_loop[L] would be the loop 
>>>>>> tree root.
>>>>>> It should be possible to compute such cache in a DFS walk of the loop 
>>>>>> tree
>>>>>> (the loop iterator by default visits in such order).
>>>>>
>>>>>
>>>>> Thanks.  Updated the patch with your suggestion.  Not sure whether it 
>>>>> strictly
>>>>> conforms to your comments.  Though the patch passed all my added 
>>>>> tests(coverage not enough),
>>>>> I am still a bit worried if pre-computed coldest_loop is outside of 
>>>>> outermost_loop, but
>>>>> outermost_loop is not the COLDEST LOOP, i.e. (outer->inner)
>>>>>
>>>>>  [loop tree root, coldest_loop, outermost_loop,..., second_coldest_loop, 
>>>>> ..., loop],
>>>>>
>>>>> then function find_coldest_out_loop will return a loop NOT accord with our
>>>>> expectation, that should return second_coldest_loop instead of 
>>>>> outermost_loop?
>>>> Hmm, interesting - yes.  I guess the common case will be that the 
>>>> pre-computed
>>>> outermost loop will be the loop at depth 1 since outer loops tend to
>>>> be colder than
>>>> inner loops?  That would then defeat the whole exercise.
>>>
>>> It is not easy to construct such cases, But finally I got below results,
>>>
>>> 1) many cases inner loop is hotter than outer loop, for example:
>>>
>>> loop 1's coldest_outermost_loop is 1, colder_than_inner_loop is NULL
>>> loop 2's coldest_outermost_loop is 1, colder_than_inner_loop is 1
>>> loop 3's coldest_outermost_loop is 1, colder_than_inner_loop is 2
>>> loop 4's coldest_outermost_loop is 1, colder_than_inner_loop is 2
>>>
>>>
>>> 2) But there are also cases inner loop is colder than outer loop, like:
>>>
>>> loop 1's coldest outermost loop is 1, colder_than_inner_loop is NULL
>>> loop 2's coldest outermost loop is 2, colder_than_inner_loop is NULL
>>> loop 3's coldest outermost loop is 3, colder_than_inner_loop is NULL
>>>
>>>
>>>>
>>>> To optimize the common case but not avoiding iteration in the cases we care
>>>> about we could instead cache the next outermost loop that is _not_ colder
>>>> than loop.  So for your [ ... ] example above we'd have> 
>>>> hotter_than_inner_loop[loop] == outer (second_coldest_loop), where the
>>>> candidate would then be 'second_coldest_loop' and we'd then iterate
>>>> to hotter_than_inner_loop[hotter_than_inner_loop[loop]] to find the next
>>>> cold candidate we can compare against?  For the common case we'd
>>>> have hotter_than_inner_loop[looo] == NULL (no such loop) and we then
>>>> simply pick 'outermost_loop'.
>>>
>>> Thanks.  It was difficult to understand, but finally I got to know what you
>>> want to express :)
>>>
>>> We should cache the next loop that is *colder* than loop instead of '_not_ 
>>> colder
>>> than loop', and 'hotter_than_inner_loop' should be 'colder_than_inner_loop',
>>> then it makes sense if the coldest loop is outside of outermost loop, 
>>> continue to
>>> find a colder loop between outermost loop and current loop in
>>> colder_than_inner_loop[loop->num]?  Hope I understood you correctly...
>>
>> Heh, looking at the patch - I don't know.
>>
>> To make the calls to bb_colder_than_loop_preheader more obvious can you
>> change that function to
>>
>> +bool bb_colder_than_loop_preheader (basic_block bb, loop *loop)
>> +{
>> +  return bb->count < loop_preheader_edge (loop)->src->count;
>>
>> ?
>>
>> +      if (colder_than_inner_loop[loop->num] != NULL
>> +         && loop_depth (outermost_loop)
>> +              < loop_depth (colder_than_inner_loop[loop->num]))
>> +       return colder_than_inner_loop[loop->num];
>>
>> that is
>>
>> +  3) If COLDEST_LOOP is outside of OUTERMOST_LOOP, check whether there is a
>> +  colder loop between OUTERMOST_LOOP and loop in pre-computed
>> +  COLDER_THAN_INNER_LOOP, return it, otherwise return OUTERMOST_LOOP.  */
>>
>> since we index colder_than_inner_loop by loop->num as well this doesn't find 
>> the
>> coldest outer loop of 'loop' within the nest up to outermost_loop,
>> correct?  For the
>> "common" case of each successive outer loop being colder than the immediate
>> inner loop doesn't that result in only hoisting one level in case the
>> coldest outer loop
>> is outside of the outermost loop we can legally hoist to?
>>
>> My suggestion to cache the next outer "hotter" loop was to make the search 
>> for
>> the optimal hoisting position O(1) for the common case (there is no
>> hotter outer loop),
>> allowing to do
>>
>>    hot_loop = next_hot_loop[loop->num];
>>    if (!hot_loop || loop_depth (hot_loop) < loop_depth (outermost_loop))
>>      return outermost_loop;
>>
>> and do the linear search in the other uncommon case (eventually the cache
>> can be exploited to optimize that search as well).
>>
>> The cache should guarantee that for loops between next_hot_loop[loop->num]
>> and loop each loop is colder than its inner loop, so the immediate inner loop
>> of 'hot_loop' is the coldest place in the sub-nest.
>>
>> Basically the cache should make the common case O(1) (well, I _do_ hope
>> that is the common case ;))
>>
>> Btw, can you make the cache array a vec<class loop> * please so we get
>> index checking?
>>
> 
> Thanks, updated with your comments to use hotter_than_inner_loop instead of
> colder_than_inner_loop, it works.
> 
> For typical case like below:
> 
> for (int j = 0; j < m; j++) // Loop 1
>   if (__builtin_expect (y, 0))
>    for (int j2 = 0; j2 < 256; j2++) // Loop 2
>       for (int i = 0; i < m; i++) // Loop 3
>         for (int s = 0; s < m; s++) // Loop 4
>           for (int s1 = 0; s1 < m; s1++) // Loop 5
>             if (__builtin_expect (y, 0))
>               for (int s2 = 0; s2 < m; s2++) // Loop 6
> 
> 
> The output is:
> 
> loop 1's coldest_outermost_loop is 1, hotter_than_inner_loop is NULL
> loop 2's coldest_outermost_loop is 2, hotter_than_inner_loop is 1
> loop 3's coldest_outermost_loop is 2, hotter_than_inner_loop is NULL
> loop 4's coldest_outermost_loop is 2, hotter_than_inner_loop is NULL
> loop 5's coldest_outermost_loop is 2, hotter_than_inner_loop is NULL
> loop 6's coldest_outermost_loop is 2, hotter_than_inner_loop is 5
> 
> 
> 
> v8-0002-Don-t-move-cold-code-out-of-loop-by-checking-bb-c.patch
> 
> 
> From: Xiong Hu Luo <luo...@linux.ibm.com>
> 
> v8 changes:
> 1. Use hotter_than_inner_loop instead of colder to store a hotter loop
> nearest to loop.
> 2. Update the logic in fill_coldest_and_hotter_out_loop and
> get_coldest_out_loop to make common case O(1).
> 3. Update function argument bb_colder_than_loop_preheader.
> 4. Make cached array to vec<class *loop> for index checking.
> 
> v7 changes:
> 1. Refine get_coldest_out_loop to replace loop with checking
> pre-computed coldest_outermost_loop and colder_than_inner_loop.
> 2. Add function fill_cold_out_loop, compute coldest_outermost_loop and
> colder_than_inner_loop recursively without loop.
> 
> v6 changes:
> 1. Add function fill_coldest_out_loop to pre compute the coldest
> outermost loop for each loop.
> 2. Rename find_coldest_out_loop to get_coldest_out_loop.
> 3. Add testcase ssa-lim-22.c to differentiate with ssa-lim-19.c.
> 
> v5 changes:
> 1. Refine comments for new functions.
> 2. Use basic_block instead of count in bb_colder_than_loop_preheader
> to align with function name.
> 3. Refine with simpler implementation for get_coldest_out_loop and
> ref_in_loop_hot_body::operator for better understanding.
> 
> v4 changes:
> 1. Sort out profile_count comparision to function bb_cold_than_loop_preheader.
> 2. Update ref_in_loop_hot_body::operator () to find cold_loop before compare.
> 3. Split RTL invariant motion part out.
> 4. Remove aux changes.
> 
> v3 changes:
> 1. Handle max_loop in determine_max_movement instead of 
> outermost_invariant_loop.
> 2. Remove unnecessary changes.
> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
> infinite loop when implementing v1 and the iteration is missed to be
> updated actually.
> 
> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
> v3: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580211.html
> v4: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581231.html
> v5: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html
> 
> There was a patch trying to avoid move cold block out of loop:
> 
> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
> 
> Richard suggested to "never hoist anything from a bb with lower execution
> frequency to a bb with higher one in LIM invariantness_dom_walker
> before_dom_children".
> 
> In gimple LIM analysis, add get_coldest_out_loop to move invariants to
> expected target loop, if profile count of the loop bb is colder
> than target loop preheader, it won't be hoisted out of loop.
> Likely for store motion, if all locations of the REF in loop is cold,
> don't do store motion of it.
> 
> SPEC2017 performance evaluation shows 1% performance improvement for
> intrate GEOMEAN and no obvious regression for others.  Especially,
> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> on P8LE.
> 
> gcc/ChangeLog:
> 
>       * tree-ssa-loop-im.c (bb_colder_than_loop_preheader): New
>       function.
>       (get_coldest_out_loop): New function.
>       (determine_max_movement): Use get_coldest_out_loop.
>       (move_computations_worker): Adjust and fix iteration udpate.
>       (class ref_in_loop_hot_body): New functor.
>       (ref_in_loop_hot_body::operator): New.
>       (can_sm_ref_p): Use for_all_locs_in_loop.
>       (fill_coldest_and_hotter_out_loop): New.
>       (tree_ssa_lim_finalize): Free coldest_outermost_loop and
>       hotter_than_inner_loop.
>       (loop_invariant_motion_in_fun): Call fill_coldet_and_hotter_out_loop.
> 
> gcc/testsuite/ChangeLog:
> 
>       * gcc.dg/tree-ssa/recip-3.c: Adjust.
>       * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
>       * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
>       * gcc.dg/tree-ssa/ssa-lim-21.c: New test.
>       * gcc.dg/tree-ssa/ssa-lim-22.c: New test.
>       * gcc.dg/tree-ssa/ssa-lim-23.c: New test.
> ---
>  gcc/tree-ssa-loop-im.c                     | 150 ++++++++++++++++++++-
>  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |   2 +-
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c |  29 ++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c |  25 ++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c |  35 +++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c |  32 +++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-23.c |  21 +++
>  7 files changed, 291 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-23.c
> 
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index 4b187c2cdaf..0bcc3bca91b 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -146,6 +146,11 @@ public:
>  enum dep_kind { lim_raw, sm_war, sm_waw };
>  enum dep_state { dep_unknown, dep_independent, dep_dependent };
> 
> +/* coldest outermost loop for given loop.  */
> +vec<class loop *> coldest_outermost_loop;
> +/* hotter outer loop nearest to given loop.  */
> +vec<class loop *> hotter_than_inner_loop;
> +
>  /* Populate the loop dependence cache of REF for LOOP, KIND with STATE.  */
> 
>  static void
> @@ -417,6 +422,61 @@ movement_possibility (gimple *stmt)
>    return ret;
>  }
> 
> +/* Compare the profile count inequality of bb and loop's preheader, it is
> +   three-state as stated in profile-count.h, FALSE is returned if inequality
> +   cannot be decided.  */
> +bool
> +bb_colder_than_loop_preheader (basic_block bb, class loop *loop)
> +{
> +  gcc_assert (bb && loop);
> +  return bb->count < loop_preheader_edge (loop)->src->count;
> +}
> +
> +/* Check coldest loop between OUTERMOST_LOOP and LOOP by comparing profile
> +   count.
> +  It does three steps check:
> +  1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, 
> just
> +  return NULL which means it should not be moved out at all;
> +  2) CURR_BB is NOT cold, check if pre-computed COLDEST_LOOP is outside of
> +  OUTERMOST_LOOP, if it is inside of OUTERMOST_LOOP, return the COLDEST_LOOP;
> +  3) If COLDEST_LOOP is outside of OUTERMOST_LOOP, check whether there is a
> +  hotter loop between OUTERMOST_LOOP and loop in pre-computed
> +  HOTTER_THAN_INNER_LOOP, return it's nested inner loop, otherwise return
> +  OUTERMOST_LOOP.  */
> +
> +static class loop *
> +get_coldest_out_loop (class loop *outermost_loop, class loop *loop,
> +                   basic_block curr_bb)
> +{
> +  gcc_assert (outermost_loop == loop
> +           || flow_loop_nested_p (outermost_loop, loop));
> +
> +  /* If bb_colder_than_loop_preheader returns false due to three-state
> +    comparision, OUTERMOST_LOOP is returned finally to preserve the behavior.
> +    Otherwise, return the coldest loop between OUTERMOST_LOOP and LOOP.  */
> +  if (curr_bb && bb_colder_than_loop_preheader (curr_bb, loop))
> +    return NULL;
> +
> +  class loop *coldest_loop = coldest_outermost_loop[loop->num];
> +  if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
> +    {
> +      class loop *hotter_loop = hotter_than_inner_loop[loop->num];
> +      if (!hotter_loop
> +       || loop_depth (hotter_loop) < loop_depth (outermost_loop))
> +     return outermost_loop;
> +
> +      /*  hotter_loop is between OUTERMOST_LOOP and LOOP like:
> +     [loop tree root, ..., coldest_loop, ..., outermost_loop, ...,
> +     hotter_loop, second_coldest_loop, ..., loop]
> +     return second_coldest_loop to be the hoist target.  */
> +      class loop *aloop;
> +      for (aloop = hotter_loop->inner; aloop; aloop = aloop->next)
> +     if (flow_loop_nested_p (aloop, loop))

should be:

        if (aloop == loop || flow_loop_nested_p (aloop, loop))


> +       return aloop;
> +    }
> +  return coldest_loop;
> +}
> +


-- 
Thanks,
Xionghu

Reply via email to