Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

Ajit Agarwal Fri, 12 Apr 2024 07:33:23 -0700

Hello Alex:

On 11/04/24 7:55 pm, Alex Coplan wrote:
> On 10/04/2024 23:48, Ajit Agarwal wrote:
>> Hello Alex:
>>
>> On 10/04/24 7:52 pm, Alex Coplan wrote:
>>> Hi Ajit,
>>>
>>> On 10/04/2024 15:31, Ajit Agarwal wrote:
>>>> Hello Alex:
>>>>
>>>> On 10/04/24 1:42 pm, Alex Coplan wrote:
>>>>> Hi Ajit,
>>>>>
>>>>> On 09/04/2024 20:59, Ajit Agarwal wrote:
>>>>>> Hello Alex:
>>>>>>
>>>>>> On 09/04/24 8:39 pm, Alex Coplan wrote:
>>>>>>> On 09/04/2024 20:01, Ajit Agarwal wrote:
>>>>>>>> Hello Alex:
>>>>>>>>
>>>>>>>> On 09/04/24 7:29 pm, Alex Coplan wrote:
>>>>>>>>> On 09/04/2024 17:30, Ajit Agarwal wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
>>>>>>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
>>>>>>>>>>>> Hello Alex/Richard:
>>>>>>>>>>>>
>>>>>>>>>>>> All review comments are incorporated.
>>> <snip>
>>>>>>>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t 
>>>>>>>>>>>> &left_list,
>>>>>>>>>>>>  // of accesses.  If we find two sets of adjacent accesses, call
>>>>>>>>>>>>  // merge_pairs.
>>>>>>>>>>>>  void
>>>>>>>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>> -                           access_group &group)
>>>>>>>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>> +                                   access_group &group)
>>>>>>>>>>>>  {
>>>>>>>>>>>>    const auto lfs = decode_lfs (encoded_lfs);
>>>>>>>>>>>>    const unsigned access_size = lfs.size;
>>>>>>>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int 
>>>>>>>>>>>> encoded_lfs,
>>>>>>>>>>>>                   access.cand_insns,
>>>>>>>>>>>>                   lfs.load_p,
>>>>>>>>>>>>                   access_size);
>>>>>>>>>>>> -    skip_next = access.cand_insns.empty ();
>>>>>>>>>>>> +    skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
>>>>>>>>>>>
>>>>>>>>>>> As above, why is this needed?
>>>>>>>>>>
>>>>>>>>>> For rs6000 we want to return always true. as load store pair
>>>>>>>>>> that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
>>>>>>>>>> And we want load store pair to 8/16 32/64. Thats why we want
>>>>>>>>>> to generate always true for rs6000 to skip pairs as above.
>>>>>>>>>
>>>>>>>>> Hmm, sorry, I'm not sure I follow.  Are you saying that for rs6000 
>>>>>>>>> you have
>>>>>>>>> load/store pair instructions where the two arms of the access are 
>>>>>>>>> storing
>>>>>>>>> operands of different sizes?  Or something else?
>>>>>>>>>
>>>>>>>>> As it stands the logic is to skip the next iteration only if we
>>>>>>>>> exhausted all the candidate insns for the current access.  In the case
>>>>>>>>> that we didn't exhaust all such candidates, then the idea is that when
>>>>>>>>> access becomes prev_access, we can attempt to use those candidates as
>>>>>>>>> the "left-hand side" of a pair in the next iteration since we failed 
>>>>>>>>> to
>>>>>>>>> use them as the "right-hand side" of a pair in the current iteration.
>>>>>>>>> I don't see why you wouldn't want that behaviour.  Please can you
>>>>>>>>> explain?
>>>>>>>>>
>>>>>>>>
>>>>>>>> In merge_pair we get the 2 load candiates one load from 0 offset and
>>>>>>>> other load is from 16th offset. Then in next iteration we get load
>>>>>>>> from 16th offset and other load from 32 offset. In next iteration
>>>>>>>> we get load from 32 offset and other load from 48 offset.
>>>>>>>>
>>>>>>>> For example:
>>>>>>>>
>>>>>>>> Currently we get the load candiates as follows.
>>>>>>>>
>>>>>>>> pairs:
>>>>>>>>
>>>>>>>> load from 0th offset.
>>>>>>>> load from 16th offset.
>>>>>>>>
>>>>>>>> next pairs:
>>>>>>>>
>>>>>>>> load from 16th offset.
>>>>>>>> load from 32th offset.
>>>>>>>>
>>>>>>>> next pairs:
>>>>>>>>
>>>>>>>> load from 32th offset
>>>>>>>> load from 48th offset.
>>>>>>>>
>>>>>>>> Instead in rs6000 we should get:
>>>>>>>>
>>>>>>>> pairs:
>>>>>>>>
>>>>>>>> load from 0th offset
>>>>>>>> load from 16th offset.
>>>>>>>>
>>>>>>>> next pairs:
>>>>>>>>
>>>>>>>> load from 32th offset
>>>>>>>> load from 48th offset.
>>>>>>>
>>>>>>> Hmm, so then I guess my question is: why wouldn't you consider merging
>>>>>>> the pair with offsets (16,32) for rs6000?  Is it because you have a
>>>>>>> stricter alignment requirement on the base pair offsets (i.e. they have
>>>>>>> to be a multiple of 32 when the operand size is 16)?  So the pair
>>>>>>> offsets have to be a multiple of the entire pair size rather than a
>>>>>>> single operand size> 
>>>>>>
>>>>>> We get load pair at a certain point with (0,16) and other program
>>>>>> point we get load pair (32, 48).
>>>>>>
>>>>>> In current implementation it takes offsets loads as (0, 16),
>>>>>> (16, 32), (32, 48).
>>>>>>
>>>>>> But In rs6000 we want  the load pair to be merged at different points
>>>>>> as (0,16) and (32, 48). for (0,16) we want to replace load lxvp with
>>>>>> 0 offset and other load (32, 48) with lxvp with 32 offset.
>>>>>>
>>>>>> In current case it will merge with lxvp with 0 offset and lxvp with
>>>>>> 16 offset, then lxvp with 32 offset and lxvp with 48 offset which
>>>>>> is incorrect in our case as the (16-32) case 16 offset will not
>>>>>> load from even register and break for rs6000.
>>>>>
>>>>> Sorry, I think I'm still missing something here.  Why does the address 
>>>>> offset
>>>>> affect the parity of the tranfser register?  ISTM they needn't be related 
>>>>> at
>>>>> all (and indeed we can't even know the parity of the transfer register 
>>>>> before
>>>>> RA, but perhaps you're only intending on running the pass after RA?)
>>>>>
>>>>
>>>> We have load pair with (0,16) wherein these loads are adjacent and
>>>> replaced with lxvp.
>>>>
>>>> Semantic of lxvp instruction is that it loads adjacent load pair in
>>>> even register and even_register + 1.
>>>>
>>>> We replace the above load pair with lxvp instruction and then we
>>>> dont need to merge (16,32) as (0, 16) is already merged and instead
>>>> we merge (32,48).
>>>
>>> Ok, but the existing logic should already account for this.  I.e. if we
>>> successfully merge (0,16), then we don't attempt to merge (16,32).  We'd 
>>> only
>>> attempt to merge (16,32) if the merge of (0,16) failed (for whatever 
>>> reason).
>>> So I don't see that there's anything specific to lxvp that requires this 
>>> logic
>>> to change, _unless_ you have a stricter alignment requirement on the 
>>> offsets as
>>> I mentioned before.
>>>
>>
>> Thanks for the suggestion. It worked for rs6000 also with current changes.
>> Sorry for the confusion.
> 
> Alright, glad we got to the bottom of this!


Thanks.
> 
>>
>>>>
>>>> Yes you are correct, the addresss offset doesn't affect the parity of
>>>> the register transfer as we are doing fusion pass before RA.
>>>> If that is the case then I think it would be better to introduce a
>>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
>>>>>>> offset of the pair (prev_access->offset in transform_for_base).  I guess
>>>>>>> it would also take access_size as a parameter and for aarch64 it should
>>>>>>> check:
>>>>>>>
>>>>>>>   multiple_p (offset, access_size)
>>>>>>>
>>>>>>> and for rs6000 it could check:
>>>>>>>
>>>>>>>   multiple_p (offset, access_size * 2)
>>>>>>>
>>>>>>> and we would then incorporate a call to that predicate in the else if
>>>>>>> condition of tranform_for_base.
>>>>>>>
>>>>>>> It would have the same limitation whereby we assume that MEM_EXPR offset
>>>>>>> alignment is a good proxy for RTL offset alignment, but we already
>>>>>>> depend on that with the multiple_p check in track_via_mem_expr.
>>>>>>>
>>>> I have addressed the above hooks and it worked fine with both rs6000
>>>> and aarch64. I am sending subsequent patch in some time that address
>>>> above.
>>>>
>>>>> How do you plan on handling this even-odd requirement for rs6000?
>>>>>
>>>>
>>>> We plan to handle with V16QI subreg: 0 and V16QI subreg : 16 to
>>>> generate register pair and thats what we generate and implement
>>>> in rs6000 target
>>>> code.
>>>
>>> Ah, this is coming back to me now.  Sorry, I should have remembered this 
>>> from
>>> the previous discussion with Richard S.
>>>
>>> Apologies for going on a slight tangent here, but if you're running
>>> before RA are you planning to create a new OImode pseudo register for
>>> the lxvp insn and then somehow update uses of the old transfer registers
>>> to replace them with subregs of that OImode pseudo?  
>>
>> Yes I do the same as you have mentioned. We generate register pairs
>> with 256 bit mode with two subregs of 128 bit modes with 0 and
>> 16 offset.
>>
>> Or do you just plan
>>> or replacing the individual loads with moves (instead of deleting them)?
>>> I guess the latter would be simpler and might work better in the
>>> presence of hard regs.
>>>
>>
>> Would you mind explaining how to generate register pairs with lxvp by
>> replacing loads with moves.
> 
> Yeah, so suppose you have something like:
> 
> (set (reg:V16QI v1) (mem:V16QI addr))
> (set (reg:V16QI v2) (mem:V16QI addr+16))
> 
> then when you insert the lxvp you can then (logically) replace the
> original load instructions with moves from the appropriate subreg, as
> follows:
> 
> (set (reg:OI pair-pseudo) (mem:OI addr)) ; lxvp
> (set (reg:V16QI v1) (subreg:V16QI (reg:OI pair-pseudo) 0))
> (set (reg:V16QI v2) (subreg:V16QI (reg:OI pair-pseudo) 16))
> 

Any Pseudo created with gen_rtx_REG like 
gen_rtx_REG (OOmode, REGNO (dest_exp) will error'ed out 
with unrecognize insn by LRA.

If I create pseudo with gen_reg_rtx (OOmode) will error'ed
out with new_defs Pseudo register is not found in
change->new_defs.

Also the sequential register pairs are not generated by
Register Allocator.

Thats why I haven't used above method as I also thought
through.

Please let me know what do you think.

Thanks & Regards
Ajit 
> now I'm not sure off-hand if this exact combination of subregs and mode
> changes is valid, but hopefully you get the idea.  The benefit of this
> approach is that it keeps the transformation local and is more
> compile-time efficient (we don't have to look through all mentions of
> v1/v2 and replace them with subregs).  I'd hope that the RA can then
> clean up any redundant moves (especially in the case that v1/v2 are
> pseudos).
> 
> That would mean that you don't need all the grubbing around with DF
> looking for occurrences of the transfer regs.  Instead we'd just need
> some logic to insert the extra moves after the lxvp for rs6000, and I
> think this might fit more cleanly into the current pass.
> 
> Does that make sense?  In any case, this shouldn't really affect the
> preparatory aarch64 patch because we were planning to defer adding any
> hooks that are only needed for rs6000 from the initial aarch64/generic
> split.
> 
> Thanks,
> Alex
> 
>>
>> Thanks & Regards
>> Ajit
>>  
>>> Thanks,
>>> Alex
>>>
>>>>
>>>> Thanks & Regards
>>>> Ajit 
>>>>> Thanks,
>>>>> Alex
>>>>>
>>>>>>
>>>>>> lxvp should load from even registers and then loaded value will
>>>>>> be in even register and even register +1 (which is odd).
>>>>>>
>>>>>> Thanks & Regards
>>>>>> Ajit
>>>>>>> If that is the case then I think it would be better to introduce a
>>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
>>>>>>> offset of the pair (prev_access->offset in transform_for_base).  I guess
>>>>>>> it would also take access_size as a parameter and for aarch64 it should
>>>>>>> check:
>>>>>>>
>>>>>>>   multiple_p (offset, access_size)
>>>>>>>
>>>>>>> and for rs6000 it could check:
>>>>>>>
>>>>>>>   multiple_p (offset, access_size * 2)
>>>>>>>
>>>>>>> and we would then incorporate a call to that predicate in the else if
>>>>>>> condition of tranform_for_base.
>>>>>>>
>>>>>>> It would have the same limitation whereby we assume that MEM_EXPR offset
>>>>>>> alignment is a good proxy for RTL offset alignment, but we already
>>>>>>> depend on that with the multiple_p check in track_via_mem_expr.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Alex
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks & Regards
>>>>>>>> Ajit
>>>>>>>>> Thanks,
>>>>>>>>> Alex
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>    }
>>>>>>>>>>>>        prev_access = &access;
>>>>>>>>>>>>      }
>>>>>>>>>>>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int 
>>>>>>>>>>>> encoded_lfs,
>>>>>>>>>>>>  // and remove all the tombstone insns, being sure to reparent any 
>>>>>>>>>>>> uses
>>>>>>>>>>>>  // of mem to previous defs when we do this.
>>>>>>>>>>>>  void
>>>>>>>>>>>> -ldp_bb_info::cleanup_tombstones ()
>>>>>>>>>>>> +pair_fusion_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>  {
>>>>>>>>>>>>    // No need to do anything if we didn't emit a tombstone insn 
>>>>>>>>>>>> for this BB.
>>>>>>>>>>>>    if (!m_emitted_tombstone)
>>>>>>>>>>>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>  
>>>>>>>>>>>>  template<typename Map>
>>>>>>>>>>>>  void
>>>>>>>>>>>> -ldp_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>> +pair_fusion_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>  {
>>>>>>>>>>>>    for (auto kv : map)
>>>>>>>>>>>>      {
>>>>>>>>>>>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>  }
>>>>>>>>>>>>  
>>>>>>>>>>>>  void
>>>>>>>>>>>> -ldp_bb_info::transform ()
>>>>>>>>>>>> +pair_fusion_bb_info::transform ()
>>>>>>>>>>>>  {
>>>>>>>>>>>>    traverse_base_map (expr_map);
>>>>>>>>>>>>    traverse_base_map (def_map);
>>>>>>>>>>>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn)
>>>>>>>>>>>>  // for load/store candidates.  If running after RA, also try and 
>>>>>>>>>>>> promote
>>>>>>>>>>>>  // non-writeback pairs to use writeback addressing.  Then try to 
>>>>>>>>>>>> fuse
>>>>>>>>>>>>  // candidates into pairs.
>>>>>>>>>>>> -void ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>  {
>>>>>>>>>>>> -  const bool track_loads
>>>>>>>>>>>> -    = aarch64_tune_params.ldp_policy_model != 
>>>>>>>>>>>> AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>>>>>> -  const bool track_stores
>>>>>>>>>>>> -    = aarch64_tune_params.stp_policy_model != 
>>>>>>>>>>>> AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>>>>>> +  const bool track_loads = track_load_p ();
>>>>>>>>>>>> +  const bool track_stores = track_store_p ();
>>>>>>>>>>>>  
>>>>>>>>>>>> -  ldp_bb_info bb_state (bb);
>>>>>>>>>>>
>>>>>>>>>>> This:
>>>>>>>>>>>
>>>>>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>>>>>
>>>>>>>>>>> can be deleted and then:
>>>>>>>>>>>
>>>>>>>>>>>> +  pair_fusion_bb_info bb_info (bb, &derived);
>>>>>>>>>>>
>>>>>>>>>>> can just be:
>>>>>>>>>>>
>>>>>>>>>>>   pair_fusion_bb_info bb_info (bb, this);
>>>>>>>>>>>
>>>>>>>>>>> (or you can pass *this if you make bb_info take a reference).
>>>>>>>>>>>
>>>>>>>>>>> I don't think there's a particular need to change the variable name
>>>>>>>>>>> (bb_state -> bb_info).  I chose the former because it doens't clash
>>>>>>>>>>> with the RTL-SSA structure of the same name as the latter.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Addressed.
>>>>>>>>>>>>  
>>>>>>>>>>>>    for (auto insn : bb->nondebug_insns ())
>>>>>>>>>>>>      {
>>>>>>>>>>>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>    continue;
>>>>>>>>>>>>  
>>>>>>>>>>>>        rtx pat = PATTERN (rti);
>>>>>>>>>>>> -      if (reload_completed
>>>>>>>>>>>> -    && aarch64_ldp_writeback > 1
>>>>>>>>>>>> -    && GET_CODE (pat) == PARALLEL
>>>>>>>>>>>> -    && XVECLEN (pat, 0) == 2)
>>>>>>>>>>>> +      if (pair_mem_promote_writeback_p (pat))
>>>>>>>>>>>>    try_promote_writeback (insn);
>>>>>>>>>>>
>>>>>>>>>>> It looks like try_promote_writeback itself will need some further 
>>>>>>>>>>> work
>>>>>>>>>>> to make it target-independent.  I suppose this check:
>>>>>>>>>>>
>>>>>>>>>>>   auto rti = insn->rtl ();
>>>>>>>>>>>   const auto attr = get_attr_ldpstp (rti);
>>>>>>>>>>>   if (attr == LDPSTP_NONE)
>>>>>>>>>>>     return;
>>>>>>>>>>>
>>>>>>>>>>>   bool load_p = (attr == LDPSTP_LDP);
>>>>>>>>>>>   gcc_checking_assert (load_p || attr == LDPSTP_STP);
>>>>>>>>>>>
>>>>>>>>>>> will need to become part of the pair_mem_promote_writeback_p hook 
>>>>>>>>>>> that you
>>>>>>>>>>> added, potentially changing it to return a boolean for load_p.
>>>>>>>>>>>
>>>>>>>>>>> Then I guess we will need hooks for destructuring the pair insn and
>>>>>>>>>>> another hook to wrap aarch64_gen_writeback_pair.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Addressed.
>>>>>>>>>>>>  
>>>>>>>>>>>>        if (GET_CODE (pat) != SET)
>>>>>>>>>>>>    continue;
>>>>>>>>>>>>  
>>>>>>>>>>>>        if (track_stores && MEM_P (XEXP (pat, 0)))
>>>>>>>>>>>> -  bb_state.track_access (insn, false, XEXP (pat, 0));
>>>>>>>>>>>> +  bb_info.track_access (insn, false, XEXP (pat, 0));
>>>>>>>>>>>>        else if (track_loads && MEM_P (XEXP (pat, 1)))
>>>>>>>>>>>> -  bb_state.track_access (insn, true, XEXP (pat, 1));
>>>>>>>>>>>> +  bb_info.track_access (insn, true, XEXP (pat, 1));
>>>>>>>>>>>>      }
>>>>>>>>>>>>  
>>>>>>>>>>>> -  bb_state.transform ();
>>>>>>>>>>>> -  bb_state.cleanup_tombstones ();
>>>>>>>>>>>> +  bb_info.transform ();
>>>>>>>>>>>> +  bb_info.cleanup_tombstones ();
>>>>>>>>>>>>  }
>>>>>>>>>>>>  
>>>>>>>>>>>>  void ldp_fusion ()
>>>>>>>>>>>>  {
>>>>>>>>>>>>    ldp_fusion_init ();
>>>>>>>>>>>> +  pair_fusion *pfuse;
>>>>>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>>>>>> +  pfuse = &derived;
>>>>>>>>>>>
>>>>>>>>>>> This is indeed the one place where I think it is acceptable to
>>>>>>>>>>> instantiate aarch64_pair_fusion.  But again there's no need to 
>>>>>>>>>>> create a
>>>>>>>>>>> pointer to the parent class, just call any function you like 
>>>>>>>>>>> directly.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Addressed.
>>>>>>>>>>>>  
>>>>>>>>>>>>    for (auto bb : crtl->ssa->bbs ())
>>>>>>>>>>>> -    ldp_fusion_bb (bb);
>>>>>>>>>>>> +    pfuse->ldp_fusion_bb (bb);
>>>>>>>>>>>
>>>>>>>>>>> I think even the code to iterate over bbs should itself be a member
>>>>>>>>>>> function of pair_fusion (say "run") and then that becomes part of 
>>>>>>>>>>> the
>>>>>>>>>>> generic code.
>>>>>>>>>>>
>>>>>>>>>>> So this function would just become:
>>>>>>>>>>>
>>>>>>>>>>> aarch64_pair_fusion pass;
>>>>>>>>>>> pass.run ();
>>>>>>>>>>>
>>>>>>>>>>> and could be inlined into the caller.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Addressed.
>>>>>>>>>>> Perhaps you could also add an early return like:
>>>>>>>>>>>
>>>>>>>>>>>   if (!track_loads_p () && !track_stores_p ())
>>>>>>>>>>>     return;
>>>>>>>>>>>
>>>>>>>>>>> in pair_fusion::run () and then remove the corresponding code from
>>>>>>>>>>> pass_ldp_fusion::gate?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Addressed.
>>>>>>>>>>
>>>>>>>>>>>>  
>>>>>>>>>>>>    ldp_fusion_destroy ();
>>>>>>>>>>>>  }
>>>>>>>>>>>> -- 
>>>>>>>>>>>> 2.39.3
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Alex
>>>>>>>>>>
>>>>>>>>>> Thanks & Regards
>>>>>>>>>> Ajit

Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

Reply via email to