Re: [PATCH] lower-subreg, expr: Mitigate inefficiencies derived from "(clobber (reg X))" followed by "(set (subreg (reg X)) (...))"

Takayuki 'January June' Suwa via Gcc-patches Wed, 03 Aug 2022 04:17:41 -0700

Thanks for your response.

On 2022/08/03 16:52, Richard Sandiford wrote:
> Takayuki 'January June' Suwa via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> Emitting "(clobber (reg X))" before "(set (subreg (reg X)) (...))" keeps
>> data flow consistent, but it also increases register allocation pressure
>> and thus often creates many unwanted register-to-register moves that
>> cannot be optimized away.
> 
> There are two things here:
> 
> - If emit_move_complex_parts emits a clobber of a hard register,
>   then that's probably a bug/misfeature.  The point of the clobber is
>   to indicate that the register has no useful contents.  That's useful
>   for wide pseudos that are written to in parts, since it avoids the
>   need to track the liveness of each part of the pseudo individually.
>   But it shouldn't be necessary for hard registers, since subregs of
>   hard registers are simplified to hard registers wherever possible
>   (which on most targets is "always").
> 
>   So I think the emit_move_complex_parts clobber should be restricted
>   to !HARD_REGISTER_P, like the lower-subreg clobber is.  If that helps
>   (if only partly) then it would be worth doing as its own patch.
> 
> - I think it'd be worth looking into more detail why a clobber makes
>   a difference to register pressure.  A clobber of a pseudo register R
>   shouldn't make R conflict with things that are live at the point of
>   the clobber.


I agree with its worth.
In fact, aside from other ports, on the xtensa one, RA in code with frequent 
D[FC]mode pseudos is terribly bad.
For example, in __muldc3 on libgcc2, the size of the stack frame reserved will 
almost double depending on whether or not this patch is applied.

> 
>>  It seems just analogous to partial register
>> stall which is a famous problem on processors that do register renaming.
>>
>> In my opinion, when the register to be clobbered is a composite of hard
>> ones, we should clobber the individual elements separetely, otherwise
>> clear the entire to zero prior to use as the "init-regs" pass does (like
>> partial register stall workarounds on x86 CPUs).  Such redundant zero
>> constant assignments will be removed later in the "cprop_hardreg" pass.
> 
> I don't think we should rely on the zero being optimised away later.
> 
> Emitting the zero also makes it harder for the register allocator
> to elide the move.  For example, if we have:
> 
>   (set (subreg:SI (reg:DI P) 0) (reg:SI R0))
>   (set (subreg:SI (reg:DI P) 4) (reg:SI R1))
> 
> then there is at least a chance that the RA could assign hard registers
> R0:R1 to P, which would turn the moves into nops.  If we emit:
> 
>   (set (reg:DI P) (const_int 0))
> 
> beforehand then that becomes impossible, since R0 and R1 would then
> conflict with P.

Ah, surely, as you pointed out for targets where "(reg: DI)" corresponds to one 
hard register.

> 
> TBH I'm surprised we still run init_regs for LRA.  I thought there was
> a plan to stop doing that, but perhaps I misremember.

Sorry I am not sure about the status of LRA... because the xtensa port is still 
using reload.

As conclusion, trying to tweak the common code side may have been a bit 
premature.
I'll consider if I can deal with those issues on the side of the 
target-specific code.

> 
> Thanks,
> Richard
> 
>> This patch may give better output code quality for the reasons above,
>> especially on architectures that don't have DFmode hard registers
>> (On architectures with such hard registers, this patch changes virtually
>> nothing).
>>
>> For example (Espressif ESP8266, Xtensa without FP hard regs):
>>
>>     /* example */
>>     double _Complex conjugate(double _Complex z) {
>>       __imag__(z) *= -1;
>>       return z;
>>     }
>>
>>     ;; before
>>     conjugate:
>>         movi.n  a6, -1
>>         slli    a6, a6, 31
>>         mov.n   a8, a2
>>         mov.n   a9, a3
>>         mov.n   a7, a4
>>         xor     a6, a5, a6
>>         mov.n   a2, a8
>>         mov.n   a3, a9
>>         mov.n   a4, a7
>>         mov.n   a5, a6
>>         ret.n
>>
>>     ;; after
>>     conjugate:
>>         movi.n  a6, -1
>>         slli    a6, a6, 31
>>         xor     a6, a5, a6
>>         mov.n   a5, a6
>>         ret.n
>>
>> gcc/ChangeLog:
>>
>>      * lower-subreg.cc (resolve_simple_move):
>>      Add zero clear of the entire register immediately after
>>      the clobber.
>>      * expr.cc (emit_move_complex_parts):
>>      Change to clobber the real and imaginary parts separately
>>      instead of the whole complex register if possible.
>> ---
>>  gcc/expr.cc         | 26 ++++++++++++++++++++------
>>  gcc/lower-subreg.cc |  7 ++++++-
>>  2 files changed, 26 insertions(+), 7 deletions(-)
>>
>> diff --git a/gcc/expr.cc b/gcc/expr.cc
>> index 80bb1b8a4c5..9732e8fd4e5 100644
>> --- a/gcc/expr.cc
>> +++ b/gcc/expr.cc
>> @@ -3775,15 +3775,29 @@ emit_move_complex_push (machine_mode mode, rtx x, 
>> rtx y)
>>  rtx_insn *
>>  emit_move_complex_parts (rtx x, rtx y)
>>  {
>> -  /* Show the output dies here.  This is necessary for SUBREGs
>> -     of pseudos since we cannot track their lifetimes correctly;
>> -     hard regs shouldn't appear here except as return values.  */
>> -  if (!reload_completed && !reload_in_progress
>> -      && REG_P (x) && !reg_overlap_mentioned_p (x, y))
>> -    emit_clobber (x);
>> +  rtx_insn *re_insn, *im_insn;
>>  
>>    write_complex_part (x, read_complex_part (y, false), false, true);
>> +  re_insn = get_last_insn ();
>>    write_complex_part (x, read_complex_part (y, true), true, false);
>> +  im_insn = get_last_insn ();
>> +
>> +  /* Show the output dies here.  This is necessary for SUBREGs
>> +     of pseudos since we cannot track their lifetimes correctly.  */
>> +  if (can_create_pseudo_p ()
>> +      && REG_P (x) && ! reg_overlap_mentioned_p (x, y))
>> +    {
>> +      /* Hard regs shouldn't appear here except as return values.  */
>> +      if (HARD_REGISTER_P (x) && REG_NREGS (x) % 2 == 0)
>> +    {
>> +      emit_insn_before (gen_clobber (SET_DEST (PATTERN (re_insn))),
>> +                        re_insn);
>> +      emit_insn_before (gen_clobber (SET_DEST (PATTERN (im_insn))),
>> +                        im_insn);
>> +    }
>> +      else
>> +    emit_insn_before (gen_clobber (x), re_insn);
>> +    }
>>  
>>    return get_last_insn ();
>>  }
>> diff --git a/gcc/lower-subreg.cc b/gcc/lower-subreg.cc
>> index 03e9326c663..4ff0a7d1556 100644
>> --- a/gcc/lower-subreg.cc
>> +++ b/gcc/lower-subreg.cc
>> @@ -1086,7 +1086,12 @@ resolve_simple_move (rtx set, rtx_insn *insn)
>>        unsigned int i;
>>  
>>        if (REG_P (dest) && !HARD_REGISTER_NUM_P (REGNO (dest)))
>> -    emit_clobber (dest);
>> +    {
>> +      emit_clobber (dest);
>> +      /* We clear the entire of dest with zero after the clobber,
>> +         similar to the "init-regs" pass.  */
>> +      emit_move_insn (dest, CONST0_RTX (GET_MODE (dest)));
>> +    }
>>  
>>        for (i = 0; i < words; ++i)
>>      {

Re: [PATCH] lower-subreg, expr: Mitigate inefficiencies derived from "(clobber (reg X))" followed by "(set (subreg (reg X)) (...))"

Reply via email to