Re: [PATCH, GCC, AArch64] Fix PR88398 for AArch64

Richard Biener Fri, 15 Nov 2019 05:50:12 -0800

On Fri, Nov 15, 2019 at 1:07 PM Sudakshina Das <sudi....@arm.com> wrote:
>
> Hi Richard
>
> I apologise I should have given more explanation on my cover letter.
> Although the bug was filed for vectorization, the conversation on it talked
> about loops with two exits not being supported in the vectorizer and being not
> being possible without lto and peeling causing more harm than benefit.
>
> There was also no clear consensus among the discussion about the best way to 
> do
> unrolling. So I looked at Wilco's suggestion of unrolling here
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398#c8
> Although unroll_stupid does not exactly unroll it as he shows but
> it gets closer than unroll_runtime_iterations. So I ran an experiment
> to see if unrolliong the loop with unroll_stupid gets any benefit. The code
> size benefit was easy to see with the small example but it also gave 
> performance
> benefit on Spec2017. The benefit comes because unroll_runtime_iteration adds
> a switch case at the beginning for iteration check. This is less efficient
> because it creates too many branch close together specially for a loop which
> has more than 1 exit.
>         beq     .L70
>         cmp     x12, 1
>         beq     .L55
>         cmp     x12, 2
>         beq     .L57
>         cmp     x12, 3
>         beq     .L59
>         cmp     x12, 4
>         beq     .L61
>         cmp     x12, 5
>         beq     .L63
>         cmp     x12, 6
>         bne     .L72
>
> Finally I agree that unroll_stupid by default did not touch loops with 
> multiple
> exists but that was marked as a "TODO" to change later so I assumed that check
> was not a hard requirement for the unrolling alghorithm.
> /* Do not unroll loops with branches inside -- it increases number
>      of mispredicts.
>      TODO: this heuristic needs tunning; call inside the loop body
>      is also relatively good reason to not unroll.  */
> unroll_stupid is also not touched unless there is -funroll-all-loops or a loop
> pragma incidcating that maybe this could be potentially harmful on certain 
> targets.
> Since my experiments on AArch64 showed otherwise, I thought the easiest 
> starting
> point would be to do this in a target hook and only for a specific case
> (multiple exits).


So what do we actually do unpatched with -funroll-loops here?  On x86
I see us choosing the runtime variant which actually creates above
"stupid" way of implementing the documented

   mod = n % 4;

   switch (mod)
     {
       case 3:
         body; i++;
       case 2:
         body; i++;
       case 1:
         body; i++;
       case 0: ;
     }

also unrolling 8 times (of course).  This is the part of the code-gen
that makes problems on arm? (also on x86 I think given the high
branch density)    When I force "stupid" unrolling the unrolled
part is clearly worse and I see no benefit from unrolling here
(eventually over the histogram of the number of iterations the
repeated exits are now better predicted because they are duplicated).

"stupid" was actually implemented for the case where the loop
wasn't simple, otherwise the runtime algorithm is supposed to be
the way to go (and I agree when looking at the generated code).
What needs to be improved there is probably the code genreration
for the "prologue" which ends up with too high branch density.

I wonder if structuring  it more like the "stupid" case would be
better, thus

  mod = n % 4;

  if (mod < 3) goto skip_first;
  body; i++;
skip_first:
  if (mod < 2) goto skip_second;
  body; i++;
skip_second:
  if  (mod < 1) goto skip_third;
  body; i++;
skip_third:
  body; i++;

   while (cond)
     {
       body;
       if (!cond) break;
       body;
       if (!cond) break;
       body;
       if (!cond) break;
       body;
     }

that would cater for your case, does not require -funroll-all-loops
or a new target hook.  It might be that all that is needed is
reordering of basic blocks with the current scheme to decrease
branch density.  Alternatively limit the unroll factor or pad
the prologue jumps to make the branch predictor happy.

But your patch is a hack.

Richard.

> Thanks
> Sudi
>
>
>
>
>
>
>
> From: Richard Biener <richard.guent...@gmail.com>
>
> Sent: Friday, November 15, 2019 9:32 AM
>
> To: Sudakshina Das <sudi....@arm.com>
>
> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>; Kyrill Tkachov 
> <kyrylo.tkac...@foss.arm.com>; James Greenhalgh <james.greenha...@arm.com>; 
> Richard Earnshaw <richard.earns...@arm.com>; bin.ch...@linux.alibaba.com 
> <bin.ch...@linux.alibaba.com>;
>  o...@ucw.cz <o...@ucw.cz>
>
> Subject: Re: [PATCH, GCC, AArch64] Fix PR88398 for AArch64
>
>
> On Thu, Nov 14, 2019 at 4:41 PM Sudakshina Das <sudi....@arm.com> wrote:
>
> >
>
> > Hi
>
> >
>
> > This patch is trying to fix PR88398 for AArch64. As discussed in the PR,
>
> > loop unrolling is probably what we can do here. As an easy fix, the
>
> > existing unroll_stupid is unrolling the given example better than the
>
> > unroll_runtime_iterations since the the loop contains a break inside it.
>
>
> Hm, the bug reference doesn't help me at all in reviewing this - the bug
>
> is about vectorization.
>
>
> So why is unroll_stupid better than unroll_runtime_iterations for a loop
>
> with a break (or as your implementation, with multiple exists)?
>
>
> I don't like this target hook, it seems like general heuristics can be
>
> improved here, but it seems unroll-stupid doesn't consider loops
>
> with multiple exits at all?
>
>
> Richard.
>
>
> > So all I have done here is:
>
> > 1) Add a target hook so that this is AArch64 specific.
>
> > 2) We are not unrolling the loops that decide_unroll_runtime_iterations
>
> > would reject.
>
> > 3) Out of the ones that decide_unroll_runtime_iterations would accept,
>
> > check if the loop has more than 1 exit (this is done in the new target
>
> > hook) and if it does, try to unroll using unroll_stupid.
>
> >
>
> > Regression tested on AArch64 and added the test from the PR. This gives
>
> > an overall code size reduction of 2.35% and performance gain of 0.498%
>
> > on Spec2017 Intrate.
>
> >
>
> > Is this ok for trunk?
>
> >
>
> > Thanks
>
> > Sudi
>
> >
>
> > gcc/ChangeLog:
>
> >
>
> > 2019-xx-xx  Sudakshina Das  <sudi....@arm.com>
>
> >
>
> >         PR88398
>
> >         * cfgloop.h: Include target.h.
>
> >         (lpt_dec): Move to...
>
> >         * target.h (lpt_dec): ... Here.
>
> >         * target.def: Define TARGET_LOOP_DECISION_ADJUST.
>
> >         * loop-unroll.c (decide_unroll_runtime_iterations): Use new target 
> > hook.
>
> >         (decide_unroll_stupid): Likewise.
>
> >         * config/aarch64/aarch64.c (aarch64_loop_decision_adjust): New 
> > function.
>
> >         (TARGET_LOOP_DECISION_ADJUST): Define for AArch64.
>
> >         * doc/tm.texi: Regenerated.
>
> >         * doc/tm.texi.in: Document TARGET_LOOP_DECISION_ADJUST.
>
> >
>
> > gcc/testsuite/ChangeLog:
>
> >
>
> > 2019-xx-xx  Sudakshina Das  <sudi....@arm.com>
>
> >
>
> >         PR88398
>
> >         * gcc.target/aarch64/pr88398.c: New test.

Re: [PATCH, GCC, AArch64] Fix PR88398 for AArch64

Reply via email to