Re: extend fwprop optimization

Wei Mi Tue, 12 Mar 2013 00:18:42 -0700

Thanks for the helpful comments! I have some replies inlined.

Regards,
Wei.


On Mon, Mar 11, 2013 at 12:52 PM, Steven Bosscher <stevenb....@gmail.com> wrote:
> On Mon, Mar 11, 2013 at 6:52 AM, Wei Mi wrote:
>> This is the fwprop extension patch which is put in order. Regression
>> test and bootstrap pass. Please help to review its rationality. The
>> following is a brief description what I have done in the patch.
>>
>> In order to make fwprop more effective in rtl optimization, we extend
>> it to handle general expressions instead of the three cases listed in
>> the head comment in fwprop.c. The major changes include a) We need to
>> check propagation correctness for src exprs of def which contain mem
>> references. Previous fwprop for the three cases above doesn't have the
>> problem. b) We need a better cost model because the benefit is usually
>> not so apparent as the three cases above.
>>
>> For a general fwprop problem, there are two possible sources where
>> benefit comes from. The frist is the new use insn after propagation
>> and simplification may have lower cost than itself before propagation,
>> or propagation may create a new insn, that could be splitted or
>> peephole optimized later and get a lower cost. The second is that if
>> all the uses are replaced with the src of the def insn, the def insn
>> could be deleted.
>>
>> So instead of check each def-use pair independently, we use DU chain
>> to track all the uses for a def. For each def-use pair, we attempt the
>> propagation, record the change candidate in changes[] array, but we
>> wait to confirm the changes until all the pairs with the same def are
>> iterated. The changes confirmation is done in the func
>> confirm_change_group_by_cost. We only do this for fwprop. For
>> fwprop_addr, the benefit of each change is ensured by
>> propagation_rtx_1 using should_replace_address, so we just confirm all
>> the changes without checking benefit again.
>
> Hello Wei Mi,
>
> So IIUC, in essence you are doing:
>
> main:
>   FOR_EACH_BB:
>     FOR_BB_INSNS, non-debug insns only:
>       for each df_ref DEF operand on insn:
>         iterate_def_uses
>
> iterate_def_uses:
>   for each UD chain from DEF to USE(i):
>     forward_propagate_into
>   confirm changes by total benefit
>
> I still like the idea, but there are also still a few "design issues"
> to resolve.
>
> Some of the same comments as before apply: Do you really, really,
> reallyreally have to go so low-level as to insn splitting, peephole
> optimizations, and even register allocation, to get the cost right?
> That will almost certainly not be acceptable, and I for one would
> oppose such a change. It's IMHO a violation of proper engineering when
> your medium-to-high level code transformations have to do that. If you
> have strong reasons for your approach, it'd be helpful if you can
> explain them so that we can together look for a less intrusive
> solution (e.g. splitting earlier, adjusting the cost model, etc.).
>

For the motivational case, I need insn splitting to get the cost
right. insn splitting is not very intrusive. All I need is to call
split_insns func. I think split_insns is just a pattern matching func
just like recog(), which is called at many places. Peephole is not
necessary (I add it in order to find as many oppotunities as possible,
but from my trace analysis, it doesn't help much). To call
peephole2_insn() is indeed intrusive, because peephole assumes reg
allocation is completed, I have to insert the ugly workaround below.
peephole also requires setting DF_LR_RUN_DCE flag and some
initialization of peep2_insn_data array.

So how about keep split_insns and remove peephole in the cost estimation func?

> So things like:
>> +  /* we may call peephole2_insns in fwprop phase to estimate how
>> +     peephole will affect the cost of the insn transformed by fwprop.
>> +     fwprop is done before ira phase. In that case, we simply return
>> +     a new pseudo register.  */
>> +  if (!strncmp (current_pass->name, "fwprop", 6))
>> +    return gen_reg_rtx (mode);
>
> and
>
>> Index: config/i386/i386.c
>> ===================================================================
>> --- config/i386/i386.c        (revision 196270)
>> +++ config/i386/i386.c        (working copy)
>> @@ -15901,8 +15901,14 @@ ix86_expand_clear (rtx dest)
>>  {
>>    rtx tmp;
>>
>> -  /* We play register width games, which are only valid after reload.  */
>> -  gcc_assert (reload_completed);
>> +  /* We play register width games, which are only valid after reload.
>> +     An exception: fwprop call peephole to estimate the change benefit,
>> +     and peephole will call this func. That is before reload complete.
>> +     It will not bring any problem because the peephole2_insns call is
>> +     only used for cost estimation in fwprop, and its change will be
>> +     abandoned immediately after the cost estimation.  */
>> +  if (strncmp (current_pass->name, "fwprop", 6))
>> +    gcc_assert (reload_completed);
>
> are IMHO not OK.
>

They are intrusive and inserted for peephole.

> Note that your patch is a bit difficult to read at some points because
> you have included a bunch of non-changes (whitespaces fixes --
> necessary cleanups but not relevant for your patch), see e.g. the
> changed lines that contain "lra_in_progress". Also the changes like:
>>  static bool
>> -propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, int flags)
>> +propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, bool speed)
>
> which are quite distracting, making it harder to see what has *really* 
> changed.
>

In the old fwprop, the flags is of enum type {PR_CAN_APPEAR,
PR_HANDLE_MEM, PR_OPTIMIZE_FOR_SPEED}.   PR_CAN_APPEAR is used to
restrict fwprop to only handle the three typical cases in the comment
at the head of fwprop.c. PR_HANDLE_MEM is used for mem addr.
PR_OPTIMIZE_FOR_SPEED indicates optimization is for speed or for size.
New fwprop will only use PR_OPTIMIZE_FOR_SPEED, and the other two are
useless. So I change the param from "int flags" to "bool speed".

> You should probably just a helper function apply_change_group_num()
> and avoid all the apply_change_group use fixups.
>
>

Yes, they are distracting. I can use apply_change_group_num for easy
code review, but I think to commit, extending apply_change_group and
avoid creating a very similar func is more welcomed?


> In fwprop.c:
>> +  /* DF_LR_RUN_DCE is used in peephole2_insns, which is called for cost
>> +     estimation in estimate_split_and_peephole.  */
>> +  df_set_flags (DF_LR_RUN_DCE);
>>    df_md_add_problem ();
>>    df_note_add_problem ();
>> -  df_analyze ();
>> +  df_chain_add_problem (DF_UD_CHAIN | DF_DU_CHAIN);
>>    df_maybe_reorganize_use_refs (DF_REF_ORDER_BY_INSN_WITH_NOTES);
>> +  df_analyze ();
>
> you add DU and UD chains, and implicitly the RD problem, but you also
> already have the MD problem. I think my reaching-defs patches for GCC
> 4.8 make the MD problem less necessary, but you certainly don't need
> MD + RD + UD + DU.
>

MD problem is used to set use_def_ref and check whether a use has
multiple def in the old fwprop. I reuse that part in new fwprop. With
UD chain, we may remove MD problem and use_def_ref vector, and use UD
chain to do the check. I will try it.

> You've noticed so yourself:
>> +   We think the maintainance for use_def_ref vector is not necessary, so
>> +   we remove update_df/update_uses/update_df_init/register_active_defs.  */
>
> and it looks like you're simply avoiding the problem by queuing up
> changes and commit them all at the end. I don't believe that will
> work, you'll break the UD and DU chains and may end up with dangling
> pointers to free'd or removed df_refs.
>

I only remove the code related with use_def_ref vector. The code to
update df_refs is kept but moved to update_df in recog.c.

>
>> +  /* see when the insn is not a set  */
>> +  if (!set)
>> +    return false;
>
> fwprop.c was speciflcally developed to also handle multiple-set
> instructions, like the bits of cse.c that it tried to replace. Your
> patch should not change this.
>

ok, I will remove the limitation.

>
>> +static bool
>> +mem_may_be_modified (rtx from, rtx to)
>
> This has "potentially slow" written all over it :-) (You're punting on
> any MEM for now, but someone at some point will find a reason to use
> alias analysis, blowing up really bad test cases like PR39326...)
>

If someone wants to add alias analysis here, he must find out some
good reasons :-). That is what I expect.

>
>> +int
>> +reg_mentioned_num (const_rtx reg, const_rtx in
>
> Should use DF caches instead of deep-diving the pattern. Or if DF
> cache updates are deferred, use for_each_rtx on the pattern.
>

I will fix it.

>
>> +/* Find whether the set define a return reg.  */
>> +
>> +static bool
>> +def_return_reg (rtx set)
>> +{
>> +  edge eg;
>> +  edge_iterator ei;
>> +  rtx dest = SET_DEST (set);
>> +
>> +  if (!REG_P (dest))
>> +    return false;
>
> The return USE will also be a hard reg:
>  +  if (!REG_P (dest) || ! HARD_REGISTER_P (dest))
>  +    return false;
>
>
>> +  FOR_EACH_EDGE (eg, ei, EXIT_BLOCK_PTR->preds)
>> +    if (eg->flags & EDGE_FALLTHRU)
>> +      {
>> +     basic_block src_bb = eg->src;
>> +     rtx last_insn, ret_reg;
>> +     if (EDGE_COUNT (EXIT_BLOCK_PTR->preds) == 1
>
> single_pred_p(), but why do FOR_EACH_EDGE and then chec that there is
> only one pred to begin with?

My mistake. I copy the chunk of code from mode-switching.
FOR_EACH_EDGE is unneeded.

>
>> +         && NONJUMP_INSN_P ((last_insn = BB_END (src_bb)))
>> +         && GET_CODE (PATTERN (last_insn)) == USE
>> +         && GET_CODE ((ret_reg = XEXP (PATTERN (last_insn), 0))) == REG
>> +         && REGNO (ret_reg) == REGNO (dest))
>> +       return true;
>> +      }
>> +  return false;
>> +}
>
> Actually this whole change makes me nervous. I don't think you should
> propagate into any USE at all, for return value or otherwise.

Yes, reasonable. I will add the restriction.

>
>> +  if (def_insn)
>> +  {
>> +    rtx set = single_set (def_insn);
>> +    if (set)
>> +      def_insn_cost = set_src_cost (SET_SRC (set), speed)
>> +                   + set_src_cost (SET_DEST (set), speed) + 1;
>> +    else
>> +      return false;
>> +  }
>
> As before: You'll have to deal with non-single_set insns also.
>

I will remove the limitation.

>
>> +void
>> +dump_cfg (FILE *file)
>> +{
>
> You'll find -fdump-rtl-fwprop-graph useful, as well as brief_dump_cfg.
>

Oh, thanks.

>
>> +  if (fwprop_addr)
>> +     return confirm_change_group_by_cost (false,
>> +                                       0,
>> +                                       false);
>> +  else
>> +    {
>> +      all_uses_replaced = (use_num == reg_replaced_num);
>> +      return confirm_change_group_by_cost (all_uses_replaced,
>> +                                        def_insn_cost,
>> +                                        true);
>> +    }
>
>
> What happens if you propagate into an insn that uses the same register
> twice? Will the DU chains still be valid (I don't think that's
> guaranteed)?

I think the DU chains still be valid. If propagate into the insn that
uses the same register twice, the two uses will be replaced when the
first use is seen (propagate_rtx_1 will propagate all the occurrances
of the same reg in the use insn).  When the second use is seen, the
df_use and use insn in its insn_info are still available.
forward_propagate_into will early return after check reg_mentioned_p
(DF_REF_REG (use), parent) and find out no reg is used  any more.

A testcase in gcc regression testsuite shows the case:
gcc/testsuite/gcc.target/i386/387-10.c

>
> Is the extra_benefit flag always applicable if all USEs of a DEF have
> been propagated out? What if the DEF is in an insn that is inherently
> necessary?
>

If the DEF is an insn that is inherently necessary, we may mistakenly
assume it could be deleted in cost estimation. But it will not affect
correctness because fwprop depends on delete_trivially_dead_insns in
fwprop_done or DCE after fwprop to delete dead DEF.

> Have you measured what effect this pass has on combine?

I didn't do the measure. I only see a case that new fwprop does the
same optimization as what combine does, which old fwprop doesn't
optimize. New fwprop bring the optimization to an earlier stage. The
testcase is gcc/testsuite/gcc.target/i386/387-10.c.
Qualitatively, fwprop optimize the single-def multiple down uses case
which combine cannot handle. But I didn't do measurement about their
interactive potential.

>
> Ciao!
> Steven

Re: extend fwprop optimization

Reply via email to