Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

Richard Sandiford Tue, 15 Sep 2020 02:12:43 -0700

Qing Zhao <qing.z...@oracle.com> writes:
>> On Sep 14, 2020, at 2:20 PM, Richard Sandiford <richard.sandif...@arm.com> 
>> wrote:
>> 
>> Qing Zhao <qing.z...@oracle.com <mailto:qing.z...@oracle.com>> writes:
>>>> On Sep 14, 2020, at 11:33 AM, Richard Sandiford 
>>>> <richard.sandif...@arm.com> wrote:
>>>> 
>>>> Qing Zhao <qing.z...@oracle.com> writes:
>>>>>> Like I mentioned earlier though, passes that run after
>>>>>> pass_thread_prologue_and_epilogue can use call-clobbered registers that
>>>>>> weren't previously used.  For example, on x86_64, the function might
>>>>>> not use %r8 when the prologue, epilogue and returns are generated,
>>>>>> but pass_regrename might later introduce a new use of %r8.  AIUI,
>>>>>> the “used” version of the new command-line option is supposed to clear
>>>>>> %r8 in these circumstances, but it wouldn't do so if the data was
>>>>>> collected at the point that the return is generated.
>>>>> 
>>>>> Thanks for the information.
>>>>> 
>>>>>> 
>>>>>> That's why I think it's more robust to do this later (at the beginning
>>>>>> of pass_late_compilation) and insert the zeroing before returns that
>>>>>> already exist.
>>>>> 
>>>>> Yes, looks like it’s not correct to insert the zeroing at the time when 
>>>>> prologue, epilogue and return are generated.
>>>>> As I also checked, “return” might be also generated as late as pass 
>>>>> “pass_delay_slots”,  So, shall we move the
>>>>> New pass as late as possible?
>>>> 
>>>> If we insert the zeroing before pass_delay_slots and describe the
>>>> result correctly, pass_delay_slots should do the right thing.
>>>> 
>>>> Describing the result correctly includes ensuring that the cleared
>>>> registers are treated as live on exit from the function, so that the
>>>> zeroing doesn't get deleted again, or skipped by pass_delay_slots.
>>> 
>>> In the current implementation for x86, when we generating a zeroing insn as 
>>> the following:
>>> 
>>> (insn 18 16 19 2 (set (reg:SI 1 dx)
>>>        (const_int 0 [0])) "t10.c":11:1 -1
>>>     (nil))
>>> (insn 19 18 20 2 (unspec_volatile [
>>>            (reg:SI 1 dx)
>>>        ] UNSPECV_PRO_EPILOGUE_USE) "t10.c":11:1 -1
>>>     (nil))
>>> 
>>> i.e, after each zeroing insn, the register that is zeroed is marked as 
>>> “UNSPECV_PRO_EPILOGUE_USE”, 
>>> By doing this, we can avoid this zeroing insn from being deleted or 
>>> skipped. 
>>> 
>>> Is doing this enough to describe the result correctly?
>>> Is there other thing we need to do in addition to this?
>> 
>> I guess that works, but I think it would be better to abstract
>> EPILOGUE_USES into a new target-independent wrapper function that
>> (a) returns true if EPILOGUE_USES itself returns true and (b) returns
>> true for registers that need to be zero on return, if the zeroing
>> instructions have already been inserted.  The places that currently
>> test EPILOGUE_USES should then test this new wrapper function instead.
>
> Okay, I see. 
> Looks like that EPILOGUE_USES is used in df-scan.c to compute the data flow 
> information. If EPILOUGE_USES return true
> for registers that need to be zeroed on return, those registers will be 
> included in the data flow information, as a result, later
> passes will not be able to delete them. 
>
> This sounds to be a cleaner approach than the current one that marks the 
> registers  “UNSPECV_PRO_EPILOGUE_USE”. 
>
> A more detailed implementation question on this: 
> Where should I put this new target-independent wrapper function in? Which 
> header file will be a proper place to hold this new function?


Not a strong opinion, but: maybe df.h and df-scan.c, since this is
really a DF query.

>> After inserting the zeroing instructions, the pass should recompute the
>> live-out sets based on this.

Sorry, I was wrong here.  It should *cause* the sets to be recomputed
where necessary (rather than recompute them directly), but see below.

> Is only computing the live-out sets of the block that including the return 
> insn enough? Or we should re-compute the whole procedure? 
>
> Which utility routine I should use to recompute the live-out sets?

Inserting the instructions will cause the containing block to be marked
dirty, via df_set_bb_dirty.  I think the pass should also call
df_set_bb_dirty on the exit block itself, to indicate that the
wrapper around EPILOGUE_USES has changed behaviour, but that might
not be necessary.

This gives the df machinery enough information to work out what has changed.
It will then propagate those changes throughout the function.  (I don't
think any propagation would be necessary here, but if I'm wrong about that,
then the df machinery will do whatever propagation is necessary.)

However, the convention is for a pass that uses the df machinery to call
df_analyze first.  This call to df_analyze updates any stale df information.

So unlike what I said yesterday, the pass itself doesn't need to make sure
that the df information is up-to-date.  It just needs to indicate what
has changed, as above.

In the case of pass_delay_slots, pass_free_cfg has:

  /* The resource.c machinery uses DF but the CFG isn't guaranteed to be
     valid at that point so it would be too late to call df_analyze.  */
  if (DELAY_SLOTS && optimize > 0 && flag_delayed_branch)
    {
      df_note_add_problem ();
      df_analyze ();
    }

Any other machine-specific passes that use df already need to call
df_analyze (if they use the df machinery).  So simply marking what
has changed is enough (by design).

> My understanding is:
>
> In our new pass that is put in the beginning of the pass_late_compilation, 
> I,e pass_zero_call_used_regs;
>
>       PUSH_INSERT_PASSES_WITHIN (pass_late_compilation)
> ++++  NEXT_PASS (pass_zero_call_used_regs);
>           NEXT_PASS (pass_compute_alignments);
>           NEXT_PASS (pass_variable_tracking);
>           NEXT_PASS (pass_free_cfg);
>           NEXT_PASS (pass_machine_reorg);
>           NEXT_PASS (pass_cleanup_barriers);
>           NEXT_PASS (pass_delay_slots);
>
> When we scan the EXIT BLOCK of the routine, all the return insns have already 
> been there.
> The later passes including “pass_delay_slots” will not generate additional 
> returns anymore,  they might just call “target.gen_return” or 
> “target.gen_simple_return() to replace 
> “ret_rtx” or “simple_ret_rtx” ?

Kind-of.  pass_delay_slots can also duplicate code, so it's not always a
straight replacement.  But the point is that returns don't appear out of
nowhere.  There has to be a semantic reason for them to exist.  The
behaviour of the function after pass_delay_slots has to be the same
as it was before the pass (disregarding undefined behaviour).  Once we've
added clearing of the zero registers to all return paths, that clearing
becomes part of the behaviour of the function, and so will be part of
the behaviour after pass_delay_slots as well.

So I don't think the problem is with passes generating new returns.
It's more whether they could use new registers that then need to be
cleared, which is the main justification for running the new pass
so late in the pipeline.

In principle, there's nothing stopping pass_delay_slots allocating
new registers (like pass_regrename does), and in principle that could
introduce the need to do more clearing.  But I don't think the current
pass does that.  The pass is also very much legacy code at this point,
so the chances of new optimisations being added to it are fairly low.
If that did happen, I think it would be reasonable to expect the pass
to work within the set of registers that have already been allocated,
at least when your new option is in effect.

Thanks,
Richard

Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

Reply via email to