Re: [PATCH] improve ifcvt optimization (PR rtl-optimization/89430)

Richard Biener Fri, 21 Jun 2019 05:24:55 -0700

On Fri, Jun 21, 2019 at 1:13 AM Jeff Law <l...@redhat.com> wrote:

> On 6/20/19 3:53 AM, JiangNing OS wrote:
> > Hi Jeff,
> >
> > Appreciate your effort to review my patch! I've updated my patch as
> attached. See my answers below.
> >
> >> in current function, so the store speculation can be avoided.
> >> So at a high level should we be doing this in gimple rather than RTL?
> >> We're going to have a lot more information about types, better
> >> infrastructure for looking at uses/defs, access to the alias oracle, we
> should
> >> be able to accurately distinguish between potentially shared objects vs
> those
> >> which are local to the thread, etc.  We lose the low level costing
> information
> >> though.
> >>
> >> I'm still going to go through the patch and do some level of review,
> but I do
> >> think we need to answer the higher level question though.
> >>
> > I have the following reasons,
> >
> > 1) Following the clue Richard B gave me before about parameter --param
> allow-store-data-races,
> > I did check the middle-end pass tree-if-conv, but I think this pass at
> the moment doesn't work
> > for the issue I'm trying to solve. Tree-if-conv is to do if conversion
> for loop, and its final goal is to
> > help loop vectorization, while my case doesn't have a loop at all.
> I think the fact that it's focused so much on loops is a historical
> accident.  We certainly have a variety of places in the gimple
> optimizers that do if-conversion, and they're not all in tree-if-conv :(
>  For example, some are done in tree-ssa-phiopt.
>
> In the gimple optimizers the testcase from 89430 is going to look
> something like this:
>
>
> > ;   basic block 2, loop depth 0, count 1073741824 (estimated locally),
> maybe hot
> > ;;    prev block 0, next block 3, flags: (NEW, REACHABLE, VISITED)
> > ;;    pred:       ENTRY [always]  count:1073741824 (estimated locally)
> (FALLTHRU,EXECUTABLE)
> >   a.0_1 = a;
> >   _2 = (long unsigned int) k_8(D);
> >   _3 = _2 * 4;
> >   _4 = a.0_1 + _3;
> >   _5 = *_4;
> >   if (_5 > b_9(D))
> >     goto <bb 3>; [50.00%]
> >   else
> >     goto <bb 4>; [50.00%]
> > ;;    succ:       3 [50.0% (guessed)]  count:536870912 (estimated
> locally) (TRUE_VALUE,EXECUTABLE)
> > ;;                4 [50.0% (guessed)]  count:536870912 (estimated
> locally) (FALSE_VALUE,EXECUTABLE)
> >
> > ;;   basic block 3, loop depth 0, count 536870913 (estimated locally),
> maybe hot
> > ;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
> > ;;    pred:       2 [50.0% (guessed)]  count:536870912 (estimated
> locally) (TRUE_VALUE,EXECUTABLE)
> >   *_4 = b_9(D);
> > ;;    succ:       4 [always]  count:536870913 (estimated locally)
> (FALLTHRU,EXECUTABLE)
> >
> > ;;   basic block 4, loop depth 0, count 1073741824 (estimated locally),
> maybe hot
> > ;;    prev block 3, next block 1, flags: (NEW, REACHABLE, VISITED)
> > ;;    pred:       3 [always]  count:536870913 (estimated locally)
> (FALLTHRU,EXECUTABLE)
> > ;;                2 [50.0% (guessed)]  count:536870912 (estimated
> locally) (FALSE_VALUE,EXECUTABLE)
> >   return;
>
> That looks like a pretty easy form to analyze.  I'd suggest looking
> through tree-ssa-phiopt.c closely.  There's several transformations in
> there that share similarities with yours.
>


More specifically there is the conditional store elimination (cselim) pass
inside this file.


>
>
>
>
>
>
>
>
> > 2) My current solution fits into current back-end if-conversion pass
> very well. I don't want to invent
> > a new framework to solve this relatively small issue. Besides, this
> back-end patch doesn't only
> > enhance store speculation detection, but also fix a bug in the original
> code.
> Understood, but I still wonder if we're better off addressing this in
> gimple.
>

Traditionally if-conversions profitability heavily depends on the target
and esp.
if memory is involved costing on GIMPLE is hard.  That's also one reason why
we turned off loop if-conversion on GIMPLE for non-vectorized code.

phiopt is really about followup optimization opportunities in passes that do
not handle conditionals well.

cselim is on the border...


> >> Just from a design standpoint, what are the consequences if this
> function
> >> returns true for something that isn't actually in the stack or false for
> >> something that is on the stack?
> >>
> > If noce_mem_is_on_stack returns true for something that isn't actually
> in the stack,
> > it could potentially introduce store speculation, then the if-conversion
> optimization
> > will be incorrect. If this function returns false for something that is
> on stack, it doesn't
> > matter, because the optimization will not be triggered.
> OK.  That's what I expected.
> >
> >
> >
> >
> >>
> >>> +
> >>> +/* Always return true, if there is a dominating write.
> >>> +
> >>> +   When there is a dominating read from memory on stack,
> >>> +   1) if x = a is a memory read, return true.
> >>> +   2) if x = a is a memory write, return true if the memory is on
> stack.
> >>> +      This is the guarantee the memory is *not* readonly. */
> >>> +
> >>> +static bool
> >>> +noce_valid_for_dominating (basic_block bb, rtx_insn *a_insn,
> >>> +                           const_rtx x, bool is_store) {
> >>> +  rtx_insn *insn;
> >>> +  rtx set;
> >>> +
> >>> +  gcc_assert (MEM_P (x));
> >>> +
> >>> +  FOR_BB_INSNS (bb, insn)
> >>> +    {
> >>> +      set = single_set (insn);
> >>> +      if (!set)
> >>> +        continue;
> >>> +
> >>> +      /* Dominating store */
> >>> +      if (rtx_equal_p (x, SET_DEST (set)))
> >>> +        return true;
> >>> +
> >>> +      /* Dominating load */
> >>> +      if (rtx_equal_p (x, SET_SRC (set)))
> >>> +        if (is_store && noce_mem_is_on_stack (a_insn, x))
> >>> +          return true;
> >>> +    }
> >>> +
> >>> +  return false;
> >>> +}
> >> So what would be the consequences here of returning false when in fact
> >> there was a dominating read or write?  That could easily happen if the
> >> dominating read or write was not a single_set.
> > If return false when in fact there is a dominating read or write, the
> optimization will not
> > be triggered, because noce_mem_maybe_invalid_p will be true, which is
> playing the same
> > role as may_trap_or_fault_p.
> >
> >>
> >> I'm guessing that from a design standpoint you're trying to find cases
> where
> >> you know an object was written to within the block and does not
> escape.  So
> >> returning false when there was a dominating write is safe.
> >> Returning true when there was not would be disastrous.  Right?
> >>
> > YES! You've completely understood my code! 😊
> So in this context, only handling single_sets is safe.  Perfect.
>
>
> >
> >>
> >>
> >>
> >>> +
> >>> +  /* Iterate all insns until finding all stack pointers, and store
> >>> +     them into bba_sets_must_be_sfp. */  bba_sets_must_be_sfp =
> >>> + BITMAP_ALLOC (&reg_obstack);  do
> >>> +    {
> >>> +      sfp_found = false;
> >>> +      FOR_ALL_BB_FN (bb, cfun)
> >>> +        FOR_BB_INSNS (bb, a_insn)
> >>> +          {
> >>> +            rtx sset_a = single_set (a_insn);
> >>> +            if (!sset_a)
> >>> +              continue;
> >>> +            rtx src = SET_SRC (sset_a);
> >>> +            rtx dest = SET_DEST (sset_a);
> >> So again, we can get things that aren't a single_set here and they'll be
> >> ignored.  But I believe that's safe as you're trying to identify insns
> which store
> >> stack addresses into a REG.  Missing a more complicated case where a
> stack
> >> address is stored into a REG is safe.  Right?
> >>
> > Yes. Missing some pointers to stack is safe here, because we will use it
> to check if a memory
> > in if-conversion candidate doesn't have store speculation. If we miss
> it, the optimization will
> > not be triggered, so there will not be risky.
> Thanks for the confirmation.
>
> >>
> >>> +
> >>> +            /* Handle case like "x2 = x1 + offset", in which x1 is
> already
> >>> +               identified as a stack pointer. */
> >>> +            if ((GET_CODE (src) == PLUS || GET_CODE (src) == MINUS)
> >>> +                && CONST_INT_P (XEXP (src, 1)))
> >>> +              {
> >>> +                rtx x1 = XEXP (src, 0);
> >>> +                if (!REG_P (x1))
> >>> +                  continue;
> >>> +
> >>> +                FOR_EACH_INSN_USE (use, a_insn)
> >>> +                  {
> >>> +                    if (!rtx_equal_p (x1, DF_REF_REG (use)))
> >>> +                      continue;
> >>> +
> >>> +                    if (bitmap_bit_p (bba_sets_must_be_sfp,
> DF_REF_REGNO
> >> (use)))
> >>> +                      {
> >>> +                        bitmap_set_bit (bba_sets_must_be_sfp,
> dest_regno);
> >>> +                        sfp_found = true;
> >>> +                        break;
> >>> +                      }
> >>> +                  }
> >> So how much is there to be gained from going from this kind of localized
> >> computation to global?  It shouldn't be hard to turn this into a truly
> global
> >> algorithm, but I'm not sure it's worth the effort.
> >>
> >> I _do_ think it's worth visiting blocks in dominator order.  It's
> better than
> >> what you're doing, but nowhere near as expensive as a global propagation
> >> engine.
> >>
> > Fixed in attached new patch using dom_walk class. Please review again.
> >
> >> Is it worth handling simple copies in the code above?
> > I thought it less likely to have a copy for pointer to stack before
> register allocation. Right?
> We certainly try to eliminate as many copies as possible, but experience
> would tell us that they often sneak through.  I guess some quick
> instrumentation would tell us if it's worth the effort.
>
>
>
>
> >
> >>
> >>> +/* Return true if current function doesn't pass stack address out of
> >>> +frame. */
> >>> +
> >>> +static bool
> >>> +no_stack_address_taken (void)
> >>> +{
> >>> +  basic_block bb;
> >>> +  rtx_insn *a_insn;
> >>> +  df_ref use;
> >>> +
> >>> +  FOR_ALL_BB_FN (bb, cfun)
> >>> +    FOR_BB_INSNS (bb, a_insn)
> >>> +      {
> >>> +        if (!INSN_P (a_insn))
> >>> +          continue;
> >>> +
> >>> +        rtx sset_a = single_set (a_insn);
> >>> +        if (!sset_a)
> >>> +          continue;
> >>> +        rtx src = SET_SRC (sset_a);
> >>> +        rtx dest = SET_DEST (sset_a);
> >>> +
> >>> +        /* Skip insn that is already identified as frame pointers. */
> >>> +        if (bitmap_bit_p (bba_sets_may_be_sfp, REGNO (dest)))
> >>> +          continue;
> >>> +
> >>> +        FOR_EACH_INSN_USE (use, a_insn)
> >>> +          {
> >>> +            /* Check if current insn is using any stack pointer.
> Ignore
> >>> +               if it doesn't use frame pointers at all. */
> >>> +            if (!bitmap_bit_p (bba_sets_may_be_sfp, DF_REF_REGNO
> (use)))
> >>> +              continue;
> >>> +
> >>> +            /* Load cannot introduce address taken. */
> >>> +            if (MEM_P (src))
> >>> +              continue;
> >>> +
> >>> +            /* Store is safe if only the src doesn't contain stack
> pointer. */
> >>> +            if (MEM_P (dest)
> >>> +                && !reg_mentioned_p (DF_REF_REG (use), src))
> >>> +              continue;
> >>> +
> >>> +            /* All other cases potentially introduce address taken. */
> >>> +            return false;
> >>> +          }
> >> So what if you have a PARALLEL where one entry causes an escape of a
> >> pointer into the stack and another entry does some other operation?  If
> so,
> >> don't you miss the fact that you've got an escaping pointer into the
> stack?  I
> >> don't think you can restrict yourself to single_sets here.
> > Yes. You are right. I have added code to handle PARALLEL case. I handle
> it in a rather conservative way though. Review again, please!
> I'll take another look at the patch.   I'll also do a little
> experimentation to see if we're seeing enough copies to make handling
> them worth the effort.
>
> While I'm doing that, if you could walk through tree-ssa-phiopt.c and
> ponder if your transformation could fit into that framework it'd be
> appreciated (note the comment at the start of the file only covers one
> class of transformations, there are others later).
>
> jeff
>

Re: [PATCH] improve ifcvt optimization (PR rtl-optimization/89430)

Reply via email to