Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

Ajit Agarwal Mon, 10 Jun 2024 02:15:58 -0700

Hello Richard:

On 10/06/24 2:12 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagar...@linux.ibm.com> writes:
>>>>>>>>>> +
>>>>>>>>>> +      rtx set = single_set (insn);
>>>>>>>>>> +      if (set == NULL_RTX)
>>>>>>>>>> +        return false;
>>>>>>>>>> +
>>>>>>>>>> +      rtx op0 = SET_SRC (set);
>>>>>>>>>> +      rtx_code code = GET_CODE (op0);
>>>>>>>>>> +
>>>>>>>>>> +      // This check is added as register pairs are not generated
>>>>>>>>>> +      // by RA for neg:V2DF (fma: V2DF (reg1)
>>>>>>>>>> +      //                  (reg2)
>>>>>>>>>> +      //                  (neg:V2DF (reg3)))
>>>>>>>>>> +      if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>>>>>>>> +        return false;
>>>>>>>>>
>>>>>>>>> What's special about (neg (fma ...))?
>>>>>>>>>
>>>>>>>>
>>>>>>>> I am not sure why register allocator fails allocating register pairs 
>>>>>>>> with
>>>>>>>> NEG Unary operation with fma operand. I have not debugged register 
>>>>>>>> allocator why the NEG
>>>>>>>> Unary operation with fma operand. 
>>>>>>>
>>>>>>
>>>>>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 bits 
>>>>>> are
>>>>>> set correctly. 
>>>>>> IRA marked them spill candidates as spill priority is zero.
>>>>>>
>>>>>> Due to this LRA reload pass couldn't allocate register pairs.
>>>>>
>>>>> I think this is just restating the symptom though.  I suppose the same
>>>>> kind of questions apply here too: what was the instruction before the
>>>>> pass runs, what was the instruction after the pass runs, and why is
>>>>> the rtl change incorrect (by the meaning above)?
>>>>>
>>>>
>>>> Original case where we dont do load fusion, spill happens, in that
>>>> case we dont require sequential register pairs to be generated for 2 loads
>>>> for. Hence it worked.
>>>>
>>>> rtl change is correct and there is no error.
>>>>
>>>> for load fusion spill happens and we dont generate sequential register 
>>>> pairs
>>>> because pf spill candidate and lxvp gives incorrect results as sequential 
>>>> register
>>>> pairs are required for lxvp.
>>>
>>> Can you go into more detail?  How is the lxvp represented?  And how do
>>> we end up not getting a sequential register pair?  What does the rtl
>>> look like (before and after things have gone wrong)?
>>>
>>> It seems like either the rtl is not describing the result of the fusion
>>> correctly or there is some problem in the .md description of lxvp.
>>>
>>
>> After fusion pass:
>>
>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>> [240])
>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> 
>> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 
>> {vsx_movv2df_64bit}
>>      (nil))
>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>> [240])
>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> 
>> [(real(kind=8) *)_4050]+16 ])
>>                 (reg:V2DF 44 12 [3119])
>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>> [240]))))) {*vsx_nfmsv2df4}
>>      (nil))
>>
>> In LRA reload.
>>
>> (insn 2472 2461 2412 161 (set (reg:OO 2572 [ vect__300.543_236 ])
>>         (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM 
>> <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])) 
>> "shell_lam.fppized.f":238:72 2187 {*movoo}
>>      (expr_list:REG_EQUIV (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] 
>> [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])
>>         (nil)))
>> (insn 2412 2472 2477 161 (set (reg:V2DF 240 [ vect__302.545 ])
>>         (neg:V2DF (fma:V2DF (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) 
>> real(kind=8)> [(real(kind=8) *)_4050] ]) 16)
>>                 (reg:V2DF 4283 [3119])
>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 
>> 16)))))  {*vsx_nfmsv2df4}
>>      (nil))
>>
>>
>> In LRA reload sequential registers are not generated as r2572 is splled and 
>> move to spill location
>> in stack and subsequent uses loads from stack. Hence sequential registers 
>> pairs are not generated.
>>
>> lxvp vsx0, 0(r1).
>>
>> It loads from from r1+0 into vsx0 and vsx1 and appropriate uses use 
>> sequential register pairs.
>>
>> Without load fusion since 2 loads exists and 2 loads need not require 
>> sequential registers
>> hence it worked but with load fusion and using lxvp it requires sequential 
>> register pairs.
> 
> Do you mean that this is a performance regression?  I.e. the fact that
> lxvp requires sequential registers causes extra spilling, due to having
> less allocation freedom?
> 
> Or is it a correctness problem?  If so, what is it?  Nothing in the rtl
> above looks wrong in principle (although I've no idea if the REG_EQUIV
> is correct in this context).  What does the allocated code look like,
> and why is it wrong?
> 
> If (reg:OO 2561) is spilled and then one half of it used, only that half
> needs to be loaded from the spill slot.  E.g. if (reg:OO 2561) is reloaded
> for insn 2412 on its own, only the second half of the register needs to be
> loaded from memory.
>


This is bwaves spec 2017 benchmark. Spill happening in register allocator
could be because of less registers available in order to generate
sequential registers for lxvp.

Because of spill sequential registers are not generated and breaks the
correctness. REG_EQUIV is generated by IRA.

Allocated code because of spill doesn't generate sequential registers.
In LRA reload because of spill marked for IRA adjust to another
register causing not to generate sequential registers.

reg:OO 2572 is spilled not reg:OO 2561. Because of spilled its
loaded from memory instead of generating sequential registers.

Other reasons for spilling because of long live range for reg:OO 2572.

We can add heuristics in fusion code not to fuse for longer live
ranges that should solve the problem.

> Richard
> 

Thanks & Regards
Ajit

Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

Reply via email to