Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

Ajit Agarwal Wed, 10 Apr 2024 11:19:10 -0700

Hello Alex:

On 10/04/24 7:52 pm, Alex Coplan wrote:
> Hi Ajit,
> 
> On 10/04/2024 15:31, Ajit Agarwal wrote:
>> Hello Alex:
>>
>> On 10/04/24 1:42 pm, Alex Coplan wrote:
>>> Hi Ajit,
>>>
>>> On 09/04/2024 20:59, Ajit Agarwal wrote:
>>>> Hello Alex:
>>>>
>>>> On 09/04/24 8:39 pm, Alex Coplan wrote:
>>>>> On 09/04/2024 20:01, Ajit Agarwal wrote:
>>>>>> Hello Alex:
>>>>>>
>>>>>> On 09/04/24 7:29 pm, Alex Coplan wrote:
>>>>>>> On 09/04/2024 17:30, Ajit Agarwal wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
>>>>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
>>>>>>>>>> Hello Alex/Richard:
>>>>>>>>>>
>>>>>>>>>> All review comments are incorporated.
> <snip>
>>>>>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t 
>>>>>>>>>> &left_list,
>>>>>>>>>>  // of accesses.  If we find two sets of adjacent accesses, call
>>>>>>>>>>  // merge_pairs.
>>>>>>>>>>  void
>>>>>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>> -                             access_group &group)
>>>>>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>> +                                     access_group &group)
>>>>>>>>>>  {
>>>>>>>>>>    const auto lfs = decode_lfs (encoded_lfs);
>>>>>>>>>>    const unsigned access_size = lfs.size;
>>>>>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int 
>>>>>>>>>> encoded_lfs,
>>>>>>>>>>                     access.cand_insns,
>>>>>>>>>>                     lfs.load_p,
>>>>>>>>>>                     access_size);
>>>>>>>>>> -      skip_next = access.cand_insns.empty ();
>>>>>>>>>> +      skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
>>>>>>>>>
>>>>>>>>> As above, why is this needed?
>>>>>>>>
>>>>>>>> For rs6000 we want to return always true. as load store pair
>>>>>>>> that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
>>>>>>>> And we want load store pair to 8/16 32/64. Thats why we want
>>>>>>>> to generate always true for rs6000 to skip pairs as above.
>>>>>>>
>>>>>>> Hmm, sorry, I'm not sure I follow.  Are you saying that for rs6000 you 
>>>>>>> have
>>>>>>> load/store pair instructions where the two arms of the access are 
>>>>>>> storing
>>>>>>> operands of different sizes?  Or something else?
>>>>>>>
>>>>>>> As it stands the logic is to skip the next iteration only if we
>>>>>>> exhausted all the candidate insns for the current access.  In the case
>>>>>>> that we didn't exhaust all such candidates, then the idea is that when
>>>>>>> access becomes prev_access, we can attempt to use those candidates as
>>>>>>> the "left-hand side" of a pair in the next iteration since we failed to
>>>>>>> use them as the "right-hand side" of a pair in the current iteration.
>>>>>>> I don't see why you wouldn't want that behaviour.  Please can you
>>>>>>> explain?
>>>>>>>
>>>>>>
>>>>>> In merge_pair we get the 2 load candiates one load from 0 offset and
>>>>>> other load is from 16th offset. Then in next iteration we get load
>>>>>> from 16th offset and other load from 32 offset. In next iteration
>>>>>> we get load from 32 offset and other load from 48 offset.
>>>>>>
>>>>>> For example:
>>>>>>
>>>>>> Currently we get the load candiates as follows.
>>>>>>
>>>>>> pairs:
>>>>>>
>>>>>> load from 0th offset.
>>>>>> load from 16th offset.
>>>>>>
>>>>>> next pairs:
>>>>>>
>>>>>> load from 16th offset.
>>>>>> load from 32th offset.
>>>>>>
>>>>>> next pairs:
>>>>>>
>>>>>> load from 32th offset
>>>>>> load from 48th offset.
>>>>>>
>>>>>> Instead in rs6000 we should get:
>>>>>>
>>>>>> pairs:
>>>>>>
>>>>>> load from 0th offset
>>>>>> load from 16th offset.
>>>>>>
>>>>>> next pairs:
>>>>>>
>>>>>> load from 32th offset
>>>>>> load from 48th offset.
>>>>>
>>>>> Hmm, so then I guess my question is: why wouldn't you consider merging
>>>>> the pair with offsets (16,32) for rs6000?  Is it because you have a
>>>>> stricter alignment requirement on the base pair offsets (i.e. they have
>>>>> to be a multiple of 32 when the operand size is 16)?  So the pair
>>>>> offsets have to be a multiple of the entire pair size rather than a
>>>>> single operand size> 
>>>>
>>>> We get load pair at a certain point with (0,16) and other program
>>>> point we get load pair (32, 48).
>>>>
>>>> In current implementation it takes offsets loads as (0, 16),
>>>> (16, 32), (32, 48).
>>>>
>>>> But In rs6000 we want  the load pair to be merged at different points
>>>> as (0,16) and (32, 48). for (0,16) we want to replace load lxvp with
>>>> 0 offset and other load (32, 48) with lxvp with 32 offset.
>>>>
>>>> In current case it will merge with lxvp with 0 offset and lxvp with
>>>> 16 offset, then lxvp with 32 offset and lxvp with 48 offset which
>>>> is incorrect in our case as the (16-32) case 16 offset will not
>>>> load from even register and break for rs6000.
>>>
>>> Sorry, I think I'm still missing something here.  Why does the address 
>>> offset
>>> affect the parity of the tranfser register?  ISTM they needn't be related at
>>> all (and indeed we can't even know the parity of the transfer register 
>>> before
>>> RA, but perhaps you're only intending on running the pass after RA?)
>>>
>>
>> We have load pair with (0,16) wherein these loads are adjacent and
>> replaced with lxvp.
>>
>> Semantic of lxvp instruction is that it loads adjacent load pair in
>> even register and even_register + 1.
>>
>> We replace the above load pair with lxvp instruction and then we
>> dont need to merge (16,32) as (0, 16) is already merged and instead
>> we merge (32,48).
> 
> Ok, but the existing logic should already account for this.  I.e. if we
> successfully merge (0,16), then we don't attempt to merge (16,32).  We'd only
> attempt to merge (16,32) if the merge of (0,16) failed (for whatever reason).
> So I don't see that there's anything specific to lxvp that requires this logic
> to change, _unless_ you have a stricter alignment requirement on the offsets 
> as
> I mentioned before.
>


Thanks for the suggestion. It worked for rs6000 also with current changes.
Sorry for the confusion.

>>
>> Yes you are correct, the addresss offset doesn't affect the parity of
>> the register transfer as we are doing fusion pass before RA.
>> If that is the case then I think it would be better to introduce a
>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
>>>>> offset of the pair (prev_access->offset in transform_for_base).  I guess
>>>>> it would also take access_size as a parameter and for aarch64 it should
>>>>> check:
>>>>>
>>>>>   multiple_p (offset, access_size)
>>>>>
>>>>> and for rs6000 it could check:
>>>>>
>>>>>   multiple_p (offset, access_size * 2)
>>>>>
>>>>> and we would then incorporate a call to that predicate in the else if
>>>>> condition of tranform_for_base.
>>>>>
>>>>> It would have the same limitation whereby we assume that MEM_EXPR offset
>>>>> alignment is a good proxy for RTL offset alignment, but we already
>>>>> depend on that with the multiple_p check in track_via_mem_expr.
>>>>>
>> I have addressed the above hooks and it worked fine with both rs6000
>> and aarch64. I am sending subsequent patch in some time that address
>> above.
>>
>>> How do you plan on handling this even-odd requirement for rs6000?
>>>
>>
>> We plan to handle with V16QI subreg: 0 and V16QI subreg : 16 to
>> generate register pair and thats what we generate and implement
>> in rs6000 target
>> code.
> 
> Ah, this is coming back to me now.  Sorry, I should have remembered this from
> the previous discussion with Richard S.
> 
> Apologies for going on a slight tangent here, but if you're running
> before RA are you planning to create a new OImode pseudo register for
> the lxvp insn and then somehow update uses of the old transfer registers
> to replace them with subregs of that OImode pseudo?  

Yes I do the same as you have mentioned. We generate register pairs
with 256 bit mode with two subregs of 128 bit modes with 0 and
16 offset.

Or do you just plan
> or replacing the individual loads with moves (instead of deleting them)?
> I guess the latter would be simpler and might work better in the
> presence of hard regs.
>

Would you mind explaining how to generate register pairs with lxvp by
replacing loads with moves.

Thanks & Regards
Ajit
 
> Thanks,
> Alex
> 
>>
>> Thanks & Regards
>> Ajit 
>>> Thanks,
>>> Alex
>>>
>>>>
>>>> lxvp should load from even registers and then loaded value will
>>>> be in even register and even register +1 (which is odd).
>>>>
>>>> Thanks & Regards
>>>> Ajit
>>>>> If that is the case then I think it would be better to introduce a
>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
>>>>> offset of the pair (prev_access->offset in transform_for_base).  I guess
>>>>> it would also take access_size as a parameter and for aarch64 it should
>>>>> check:
>>>>>
>>>>>   multiple_p (offset, access_size)
>>>>>
>>>>> and for rs6000 it could check:
>>>>>
>>>>>   multiple_p (offset, access_size * 2)
>>>>>
>>>>> and we would then incorporate a call to that predicate in the else if
>>>>> condition of tranform_for_base.
>>>>>
>>>>> It would have the same limitation whereby we assume that MEM_EXPR offset
>>>>> alignment is a good proxy for RTL offset alignment, but we already
>>>>> depend on that with the multiple_p check in track_via_mem_expr.
>>>>>
>>>>> Thanks,
>>>>> Alex
>>>>>
>>>>>>
>>>>>> Thanks & Regards
>>>>>> Ajit
>>>>>>> Thanks,
>>>>>>> Alex
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>      }
>>>>>>>>>>        prev_access = &access;
>>>>>>>>>>      }
>>>>>>>>>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int 
>>>>>>>>>> encoded_lfs,
>>>>>>>>>>  // and remove all the tombstone insns, being sure to reparent any 
>>>>>>>>>> uses
>>>>>>>>>>  // of mem to previous defs when we do this.
>>>>>>>>>>  void
>>>>>>>>>> -ldp_bb_info::cleanup_tombstones ()
>>>>>>>>>> +pair_fusion_bb_info::cleanup_tombstones ()
>>>>>>>>>>  {
>>>>>>>>>>    // No need to do anything if we didn't emit a tombstone insn for 
>>>>>>>>>> this BB.
>>>>>>>>>>    if (!m_emitted_tombstone)
>>>>>>>>>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
>>>>>>>>>>  
>>>>>>>>>>  template<typename Map>
>>>>>>>>>>  void
>>>>>>>>>> -ldp_bb_info::traverse_base_map (Map &map)
>>>>>>>>>> +pair_fusion_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>  {
>>>>>>>>>>    for (auto kv : map)
>>>>>>>>>>      {
>>>>>>>>>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>  }
>>>>>>>>>>  
>>>>>>>>>>  void
>>>>>>>>>> -ldp_bb_info::transform ()
>>>>>>>>>> +pair_fusion_bb_info::transform ()
>>>>>>>>>>  {
>>>>>>>>>>    traverse_base_map (expr_map);
>>>>>>>>>>    traverse_base_map (def_map);
>>>>>>>>>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn)
>>>>>>>>>>  // for load/store candidates.  If running after RA, also try and 
>>>>>>>>>> promote
>>>>>>>>>>  // non-writeback pairs to use writeback addressing.  Then try to 
>>>>>>>>>> fuse
>>>>>>>>>>  // candidates into pairs.
>>>>>>>>>> -void ldp_fusion_bb (bb_info *bb)
>>>>>>>>>> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>  {
>>>>>>>>>> -  const bool track_loads
>>>>>>>>>> -    = aarch64_tune_params.ldp_policy_model != 
>>>>>>>>>> AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>>>> -  const bool track_stores
>>>>>>>>>> -    = aarch64_tune_params.stp_policy_model != 
>>>>>>>>>> AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>>>> +  const bool track_loads = track_load_p ();
>>>>>>>>>> +  const bool track_stores = track_store_p ();
>>>>>>>>>>  
>>>>>>>>>> -  ldp_bb_info bb_state (bb);
>>>>>>>>>
>>>>>>>>> This:
>>>>>>>>>
>>>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>>>
>>>>>>>>> can be deleted and then:
>>>>>>>>>
>>>>>>>>>> +  pair_fusion_bb_info bb_info (bb, &derived);
>>>>>>>>>
>>>>>>>>> can just be:
>>>>>>>>>
>>>>>>>>>   pair_fusion_bb_info bb_info (bb, this);
>>>>>>>>>
>>>>>>>>> (or you can pass *this if you make bb_info take a reference).
>>>>>>>>>
>>>>>>>>> I don't think there's a particular need to change the variable name
>>>>>>>>> (bb_state -> bb_info).  I chose the former because it doens't clash
>>>>>>>>> with the RTL-SSA structure of the same name as the latter.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Addressed.
>>>>>>>>>>  
>>>>>>>>>>    for (auto insn : bb->nondebug_insns ())
>>>>>>>>>>      {
>>>>>>>>>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>      continue;
>>>>>>>>>>  
>>>>>>>>>>        rtx pat = PATTERN (rti);
>>>>>>>>>> -      if (reload_completed
>>>>>>>>>> -      && aarch64_ldp_writeback > 1
>>>>>>>>>> -      && GET_CODE (pat) == PARALLEL
>>>>>>>>>> -      && XVECLEN (pat, 0) == 2)
>>>>>>>>>> +      if (pair_mem_promote_writeback_p (pat))
>>>>>>>>>>      try_promote_writeback (insn);
>>>>>>>>>
>>>>>>>>> It looks like try_promote_writeback itself will need some further work
>>>>>>>>> to make it target-independent.  I suppose this check:
>>>>>>>>>
>>>>>>>>>   auto rti = insn->rtl ();
>>>>>>>>>   const auto attr = get_attr_ldpstp (rti);
>>>>>>>>>   if (attr == LDPSTP_NONE)
>>>>>>>>>     return;
>>>>>>>>>
>>>>>>>>>   bool load_p = (attr == LDPSTP_LDP);
>>>>>>>>>   gcc_checking_assert (load_p || attr == LDPSTP_STP);
>>>>>>>>>
>>>>>>>>> will need to become part of the pair_mem_promote_writeback_p hook 
>>>>>>>>> that you
>>>>>>>>> added, potentially changing it to return a boolean for load_p.
>>>>>>>>>
>>>>>>>>> Then I guess we will need hooks for destructuring the pair insn and
>>>>>>>>> another hook to wrap aarch64_gen_writeback_pair.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Addressed.
>>>>>>>>>>  
>>>>>>>>>>        if (GET_CODE (pat) != SET)
>>>>>>>>>>      continue;
>>>>>>>>>>  
>>>>>>>>>>        if (track_stores && MEM_P (XEXP (pat, 0)))
>>>>>>>>>> -    bb_state.track_access (insn, false, XEXP (pat, 0));
>>>>>>>>>> +    bb_info.track_access (insn, false, XEXP (pat, 0));
>>>>>>>>>>        else if (track_loads && MEM_P (XEXP (pat, 1)))
>>>>>>>>>> -    bb_state.track_access (insn, true, XEXP (pat, 1));
>>>>>>>>>> +    bb_info.track_access (insn, true, XEXP (pat, 1));
>>>>>>>>>>      }
>>>>>>>>>>  
>>>>>>>>>> -  bb_state.transform ();
>>>>>>>>>> -  bb_state.cleanup_tombstones ();
>>>>>>>>>> +  bb_info.transform ();
>>>>>>>>>> +  bb_info.cleanup_tombstones ();
>>>>>>>>>>  }
>>>>>>>>>>  
>>>>>>>>>>  void ldp_fusion ()
>>>>>>>>>>  {
>>>>>>>>>>    ldp_fusion_init ();
>>>>>>>>>> +  pair_fusion *pfuse;
>>>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>>>> +  pfuse = &derived;
>>>>>>>>>
>>>>>>>>> This is indeed the one place where I think it is acceptable to
>>>>>>>>> instantiate aarch64_pair_fusion.  But again there's no need to create 
>>>>>>>>> a
>>>>>>>>> pointer to the parent class, just call any function you like directly.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Addressed.
>>>>>>>>>>  
>>>>>>>>>>    for (auto bb : crtl->ssa->bbs ())
>>>>>>>>>> -    ldp_fusion_bb (bb);
>>>>>>>>>> +    pfuse->ldp_fusion_bb (bb);
>>>>>>>>>
>>>>>>>>> I think even the code to iterate over bbs should itself be a member
>>>>>>>>> function of pair_fusion (say "run") and then that becomes part of the
>>>>>>>>> generic code.
>>>>>>>>>
>>>>>>>>> So this function would just become:
>>>>>>>>>
>>>>>>>>> aarch64_pair_fusion pass;
>>>>>>>>> pass.run ();
>>>>>>>>>
>>>>>>>>> and could be inlined into the caller.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Addressed.
>>>>>>>>> Perhaps you could also add an early return like:
>>>>>>>>>
>>>>>>>>>   if (!track_loads_p () && !track_stores_p ())
>>>>>>>>>     return;
>>>>>>>>>
>>>>>>>>> in pair_fusion::run () and then remove the corresponding code from
>>>>>>>>> pass_ldp_fusion::gate?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Addressed.
>>>>>>>>
>>>>>>>>>>  
>>>>>>>>>>    ldp_fusion_destroy ();
>>>>>>>>>>  }
>>>>>>>>>> -- 
>>>>>>>>>> 2.39.3
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Alex
>>>>>>>>
>>>>>>>> Thanks & Regards
>>>>>>>> Ajit

Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

Reply via email to