Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

Richard Sandiford Mon, 10 Jun 2024 03:28:25 -0700

Ajit Agarwal <aagar...@linux.ibm.com> writes:
> On 10/06/24 3:20 pm, Richard Sandiford wrote:
>> Ajit Agarwal <aagar...@linux.ibm.com> writes:
>>> On 10/06/24 2:52 pm, Richard Sandiford wrote:
>>>> Ajit Agarwal <aagar...@linux.ibm.com> writes:
>>>>> On 10/06/24 2:12 pm, Richard Sandiford wrote:
>>>>>> Ajit Agarwal <aagar...@linux.ibm.com> writes:
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +         rtx set = single_set (insn);
>>>>>>>>>>>>>>> +         if (set == NULL_RTX)
>>>>>>>>>>>>>>> +           return false;
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +         rtx op0 = SET_SRC (set);
>>>>>>>>>>>>>>> +         rtx_code code = GET_CODE (op0);
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +         // This check is added as register pairs are not 
>>>>>>>>>>>>>>> generated
>>>>>>>>>>>>>>> +         // by RA for neg:V2DF (fma: V2DF (reg1)
>>>>>>>>>>>>>>> +         //                  (reg2)
>>>>>>>>>>>>>>> +         //                  (neg:V2DF (reg3)))
>>>>>>>>>>>>>>> +         if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>>>>>>>>>>>>> +           return false;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What's special about (neg (fma ...))?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am not sure why register allocator fails allocating register 
>>>>>>>>>>>>> pairs with
>>>>>>>>>>>>> NEG Unary operation with fma operand. I have not debugged 
>>>>>>>>>>>>> register allocator why the NEG
>>>>>>>>>>>>> Unary operation with fma operand. 
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 
>>>>>>>>>>> bits are
>>>>>>>>>>> set correctly. 
>>>>>>>>>>> IRA marked them spill candidates as spill priority is zero.
>>>>>>>>>>>
>>>>>>>>>>> Due to this LRA reload pass couldn't allocate register pairs.
>>>>>>>>>>
>>>>>>>>>> I think this is just restating the symptom though.  I suppose the 
>>>>>>>>>> same
>>>>>>>>>> kind of questions apply here too: what was the instruction before the
>>>>>>>>>> pass runs, what was the instruction after the pass runs, and why is
>>>>>>>>>> the rtl change incorrect (by the meaning above)?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Original case where we dont do load fusion, spill happens, in that
>>>>>>>>> case we dont require sequential register pairs to be generated for 2 
>>>>>>>>> loads
>>>>>>>>> for. Hence it worked.
>>>>>>>>>
>>>>>>>>> rtl change is correct and there is no error.
>>>>>>>>>
>>>>>>>>> for load fusion spill happens and we dont generate sequential 
>>>>>>>>> register pairs
>>>>>>>>> because pf spill candidate and lxvp gives incorrect results as 
>>>>>>>>> sequential register
>>>>>>>>> pairs are required for lxvp.
>>>>>>>>
>>>>>>>> Can you go into more detail?  How is the lxvp represented?  And how do
>>>>>>>> we end up not getting a sequential register pair?  What does the rtl
>>>>>>>> look like (before and after things have gone wrong)?
>>>>>>>>
>>>>>>>> It seems like either the rtl is not describing the result of the fusion
>>>>>>>> correctly or there is some problem in the .md description of lxvp.
>>>>>>>>
>>>>>>>
>>>>>>> After fusion pass:
>>>>>>>
>>>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>>>>>>> [240])
>>>>>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> 
>>>>>>> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 
>>>>>>> {vsx_movv2df_64bit}
>>>>>>>      (nil))
>>>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>>>>>>> [240])
>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) 
>>>>>>> real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>>>>                 (reg:V2DF 44 12 [3119])
>>>>>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>>>>>>> [240]))))) {*vsx_nfmsv2df4}
>>>>>>>      (nil))
>>>>>>>
>>>>>>> In LRA reload.
>>>>>>>
>>>>>>> (insn 2472 2461 2412 161 (set (reg:OO 2572 [ vect__300.543_236 ])
>>>>>>>         (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM 
>>>>>>> <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])) 
>>>>>>> "shell_lam.fppized.f":238:72 2187 {*movoo}
>>>>>>>      (expr_list:REG_EQUIV (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] 
>>>>>>> [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 
>>>>>>> A64])
>>>>>>>         (nil)))
>>>>>>> (insn 2412 2472 2477 161 (set (reg:V2DF 240 [ vect__302.545 ])
>>>>>>>         (neg:V2DF (fma:V2DF (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) 
>>>>>>> real(kind=8)> [(real(kind=8) *)_4050] ]) 16)
>>>>>>>                 (reg:V2DF 4283 [3119])
>>>>>>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 
>>>>>>> ]) 16)))))  {*vsx_nfmsv2df4}
>>>>>>>      (nil))
>>>>>>>
>>>>>>>
>>>>>>> In LRA reload sequential registers are not generated as r2572 is splled 
>>>>>>> and move to spill location
>>>>>>> in stack and subsequent uses loads from stack. Hence sequential 
>>>>>>> registers pairs are not generated.
>>>>>>>
>>>>>>> lxvp vsx0, 0(r1).
>>>>>>>
>>>>>>> It loads from from r1+0 into vsx0 and vsx1 and appropriate uses use 
>>>>>>> sequential register pairs.
>>>>>>>
>>>>>>> Without load fusion since 2 loads exists and 2 loads need not require 
>>>>>>> sequential registers
>>>>>>> hence it worked but with load fusion and using lxvp it requires 
>>>>>>> sequential register pairs.
>>>>>>
>>>>>> Do you mean that this is a performance regression?  I.e. the fact that
>>>>>> lxvp requires sequential registers causes extra spilling, due to having
>>>>>> less allocation freedom?
>>>>>>
>>>>>> Or is it a correctness problem?  If so, what is it?  Nothing in the rtl
>>>>>> above looks wrong in principle (although I've no idea if the REG_EQUIV
>>>>>> is correct in this context).  What does the allocated code look like,
>>>>>> and why is it wrong?
>>>>>>
>>>>>> If (reg:OO 2561) is spilled and then one half of it used, only that half
>>>>>> needs to be loaded from the spill slot.  E.g. if (reg:OO 2561) is 
>>>>>> reloaded
>>>>>> for insn 2412 on its own, only the second half of the register needs to 
>>>>>> be
>>>>>> loaded from memory.
>>>>>>
>>>>>
>>>>> This is bwaves spec 2017 benchmark. Spill happening in register allocator
>>>>> could be because of less registers available in order to generate
>>>>> sequential registers for lxvp.
>>>>>
>>>>> Because of spill sequential registers are not generated and breaks the
>>>>> correctness. REG_EQUIV is generated by IRA.
>>>>>
>>>>> Allocated code because of spill doesn't generate sequential registers.
>>>>> In LRA reload because of spill marked for IRA adjust to another
>>>>> register causing not to generate sequential registers.
>>>>>
>>>>> reg:OO 2572 is spilled not reg:OO 2561. Because of spilled its
>>>>> loaded from memory instead of generating sequential registers.
>>>>>
>>>>> Other reasons for spilling because of long live range for reg:OO 2572.
>>>>>
>>>>> We can add heuristics in fusion code not to fuse for longer live
>>>>> ranges that should solve the problem.
>>>>
>>>> This doesn't describe the real problem though.  It's natural for
>>>> registers to be spilled sometimes.  That would happen for OOmode
>>>> registers even if we didn't form them in the fusion pass.  And it's
>>>> normal GCC semantics that the two hard registers allocated to an OOmode
>>>> pseudo are consecutive.
>>>>
>>>> Like I asked above, please show the allocated code (including spills
>>>> and reloads) and explain why it's wrong.
>>>>
>>>
>>> After LRA reload:
>>>
>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>>> [240])
>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> 
>>> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 
>>> {vsx_movv2df_64bit}
>>>      (nil))
>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>>> [240])
>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> 
>>> [(real(kind=8) *)_4050]+16 ])
>>>                 (reg:V2DF 44 12 [3119])
>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>>> [240]))))) {*vsx_nfmsv2df4}
>>>      (nil))
>>>
>>> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] 
>>> [905])
>>>         (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>>>                 (reg:V2DF 38 6 [orig:2561 MEM <vector(2) real(kind=8)> 
>>> [(real(kind=8) *)_4050] ] [2561])
>>>                 (neg:V2DF (reg:V2DF 47 15 [5266]))))) {*vsx_nfmsv2df4}
>>>      (nil))
>>>
>>> In the above allocated code it assign registers 51 and 47 and they are not 
>>> sequential.
>> 
>> The reload for 2412 looks valid.  What was the original pre-reload
>> version of insn 2473?  Also, what happened to insn 2472?  Was it deleted?
>> 
>
> This is preload version of 2473:
>
> (insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ])
>         (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119])
>                 (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> 
> [(real(kind=8) *)_4050] ]) 0)
>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 
> 0))))) {*vsx_nfmsv2df4}
>      (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ])
>         (expr_list:REG_DEAD (reg:OO 2561 [ MEM <vector(2) real(kind=8)> 
> [(real(kind=8) *)_4050] ])
>             (nil))))
>
> insn 2472 is replaced with 9299 after reload.


You'd have to check the dumps to be sure, but I think 9299 is instead
generated as an input reload of 2412, rather than being a replacement
of insn 2472.  That is, LRA needs to reload (subreg:V2DF (reg:OO 2572) 16)
from memory for insn 2412.  It can use the destination of insn 2412 (r51)
as a temporary to do that.  It doesn't need to load the other half of
reg:OO 2572 for this instruction.  That in itself looks ok.

So it looks like the problem is specific to insn 2473.  Perhaps LRA
thinks that r47 already contains the low half of (reg:OO 2572),
left behind by some previous instruction not shown above?
If LRA is wrong about that -- if r47 doesn't already contain the
low half of (reg:OO 2572) -- then there's a bug somewhere.
But we need to track down and fix the bug rather than try to sidestep
it in the fusion pass.

Richard

Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

Reply via email to