Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

Ajit Agarwal Sun, 14 Apr 2024 09:59:37 -0700

Hello Alex:

On 12/04/24 11:02 pm, Ajit Agarwal wrote:
> Hello Alex:
> 
> On 12/04/24 8:15 pm, Alex Coplan wrote:
>> On 12/04/2024 20:02, Ajit Agarwal wrote:
>>> Hello Alex:
>>>
>>> On 11/04/24 7:55 pm, Alex Coplan wrote:
>>>> On 10/04/2024 23:48, Ajit Agarwal wrote:
>>>>> Hello Alex:
>>>>>
>>>>> On 10/04/24 7:52 pm, Alex Coplan wrote:
>>>>>> Hi Ajit,
>>>>>>
>>>>>> On 10/04/2024 15:31, Ajit Agarwal wrote:
>>>>>>> Hello Alex:
>>>>>>>
>>>>>>> On 10/04/24 1:42 pm, Alex Coplan wrote:
>>>>>>>> Hi Ajit,
>>>>>>>>
>>>>>>>> On 09/04/2024 20:59, Ajit Agarwal wrote:
>>>>>>>>> Hello Alex:
>>>>>>>>>
>>>>>>>>> On 09/04/24 8:39 pm, Alex Coplan wrote:
>>>>>>>>>> On 09/04/2024 20:01, Ajit Agarwal wrote:
>>>>>>>>>>> Hello Alex:
>>>>>>>>>>>
>>>>>>>>>>> On 09/04/24 7:29 pm, Alex Coplan wrote:
>>>>>>>>>>>> On 09/04/2024 17:30, Ajit Agarwal wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
>>>>>>>>>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
>>>>>>>>>>>>>>> Hello Alex/Richard:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> All review comments are incorporated.
>>>>>> <snip>
>>>>>>>>>>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t 
>>>>>>>>>>>>>>> &left_list,
>>>>>>>>>>>>>>>  // of accesses.  If we find two sets of adjacent accesses, call
>>>>>>>>>>>>>>>  // merge_pairs.
>>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>>>> -                                access_group &group)
>>>>>>>>>>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>>>> +                                        access_group &group)
>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>    const auto lfs = decode_lfs (encoded_lfs);
>>>>>>>>>>>>>>>    const unsigned access_size = lfs.size;
>>>>>>>>>>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int 
>>>>>>>>>>>>>>> encoded_lfs,
>>>>>>>>>>>>>>>                        access.cand_insns,
>>>>>>>>>>>>>>>                        lfs.load_p,
>>>>>>>>>>>>>>>                        access_size);
>>>>>>>>>>>>>>> -         skip_next = access.cand_insns.empty ();
>>>>>>>>>>>>>>> +         skip_next = bb_state->cand_insns_empty_p 
>>>>>>>>>>>>>>> (access.cand_insns);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As above, why is this needed?
>>>>>>>>>>>>>
>>>>>>>>>>>>> For rs6000 we want to return always true. as load store pair
>>>>>>>>>>>>> that are to be merged with 8/16 16/32 32/64 is occuring for 
>>>>>>>>>>>>> rs6000.
>>>>>>>>>>>>> And we want load store pair to 8/16 32/64. Thats why we want
>>>>>>>>>>>>> to generate always true for rs6000 to skip pairs as above.
>>>>>>>>>>>>
>>>>>>>>>>>> Hmm, sorry, I'm not sure I follow.  Are you saying that for rs6000 
>>>>>>>>>>>> you have
>>>>>>>>>>>> load/store pair instructions where the two arms of the access are 
>>>>>>>>>>>> storing
>>>>>>>>>>>> operands of different sizes?  Or something else?
>>>>>>>>>>>>
>>>>>>>>>>>> As it stands the logic is to skip the next iteration only if we
>>>>>>>>>>>> exhausted all the candidate insns for the current access.  In the 
>>>>>>>>>>>> case
>>>>>>>>>>>> that we didn't exhaust all such candidates, then the idea is that 
>>>>>>>>>>>> when
>>>>>>>>>>>> access becomes prev_access, we can attempt to use those candidates 
>>>>>>>>>>>> as
>>>>>>>>>>>> the "left-hand side" of a pair in the next iteration since we 
>>>>>>>>>>>> failed to
>>>>>>>>>>>> use them as the "right-hand side" of a pair in the current 
>>>>>>>>>>>> iteration.
>>>>>>>>>>>> I don't see why you wouldn't want that behaviour.  Please can you
>>>>>>>>>>>> explain?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> In merge_pair we get the 2 load candiates one load from 0 offset and
>>>>>>>>>>> other load is from 16th offset. Then in next iteration we get load
>>>>>>>>>>> from 16th offset and other load from 32 offset. In next iteration
>>>>>>>>>>> we get load from 32 offset and other load from 48 offset.
>>>>>>>>>>>
>>>>>>>>>>> For example:
>>>>>>>>>>>
>>>>>>>>>>> Currently we get the load candiates as follows.
>>>>>>>>>>>
>>>>>>>>>>> pairs:
>>>>>>>>>>>
>>>>>>>>>>> load from 0th offset.
>>>>>>>>>>> load from 16th offset.
>>>>>>>>>>>
>>>>>>>>>>> next pairs:
>>>>>>>>>>>
>>>>>>>>>>> load from 16th offset.
>>>>>>>>>>> load from 32th offset.
>>>>>>>>>>>
>>>>>>>>>>> next pairs:
>>>>>>>>>>>
>>>>>>>>>>> load from 32th offset
>>>>>>>>>>> load from 48th offset.
>>>>>>>>>>>
>>>>>>>>>>> Instead in rs6000 we should get:
>>>>>>>>>>>
>>>>>>>>>>> pairs:
>>>>>>>>>>>
>>>>>>>>>>> load from 0th offset
>>>>>>>>>>> load from 16th offset.
>>>>>>>>>>>
>>>>>>>>>>> next pairs:
>>>>>>>>>>>
>>>>>>>>>>> load from 32th offset
>>>>>>>>>>> load from 48th offset.
>>>>>>>>>>
>>>>>>>>>> Hmm, so then I guess my question is: why wouldn't you consider 
>>>>>>>>>> merging
>>>>>>>>>> the pair with offsets (16,32) for rs6000?  Is it because you have a
>>>>>>>>>> stricter alignment requirement on the base pair offsets (i.e. they 
>>>>>>>>>> have
>>>>>>>>>> to be a multiple of 32 when the operand size is 16)?  So the pair
>>>>>>>>>> offsets have to be a multiple of the entire pair size rather than a
>>>>>>>>>> single operand size> 
>>>>>>>>>
>>>>>>>>> We get load pair at a certain point with (0,16) and other program
>>>>>>>>> point we get load pair (32, 48).
>>>>>>>>>
>>>>>>>>> In current implementation it takes offsets loads as (0, 16),
>>>>>>>>> (16, 32), (32, 48).
>>>>>>>>>
>>>>>>>>> But In rs6000 we want  the load pair to be merged at different points
>>>>>>>>> as (0,16) and (32, 48). for (0,16) we want to replace load lxvp with
>>>>>>>>> 0 offset and other load (32, 48) with lxvp with 32 offset.
>>>>>>>>>
>>>>>>>>> In current case it will merge with lxvp with 0 offset and lxvp with
>>>>>>>>> 16 offset, then lxvp with 32 offset and lxvp with 48 offset which
>>>>>>>>> is incorrect in our case as the (16-32) case 16 offset will not
>>>>>>>>> load from even register and break for rs6000.
>>>>>>>>
>>>>>>>> Sorry, I think I'm still missing something here.  Why does the address 
>>>>>>>> offset
>>>>>>>> affect the parity of the tranfser register?  ISTM they needn't be 
>>>>>>>> related at
>>>>>>>> all (and indeed we can't even know the parity of the transfer register 
>>>>>>>> before
>>>>>>>> RA, but perhaps you're only intending on running the pass after RA?)
>>>>>>>>
>>>>>>>
>>>>>>> We have load pair with (0,16) wherein these loads are adjacent and
>>>>>>> replaced with lxvp.
>>>>>>>
>>>>>>> Semantic of lxvp instruction is that it loads adjacent load pair in
>>>>>>> even register and even_register + 1.
>>>>>>>
>>>>>>> We replace the above load pair with lxvp instruction and then we
>>>>>>> dont need to merge (16,32) as (0, 16) is already merged and instead
>>>>>>> we merge (32,48).
>>>>>>
>>>>>> Ok, but the existing logic should already account for this.  I.e. if we
>>>>>> successfully merge (0,16), then we don't attempt to merge (16,32).  We'd 
>>>>>> only
>>>>>> attempt to merge (16,32) if the merge of (0,16) failed (for whatever 
>>>>>> reason).
>>>>>> So I don't see that there's anything specific to lxvp that requires this 
>>>>>> logic
>>>>>> to change, _unless_ you have a stricter alignment requirement on the 
>>>>>> offsets as
>>>>>> I mentioned before.
>>>>>>
>>>>>
>>>>> Thanks for the suggestion. It worked for rs6000 also with current changes.
>>>>> Sorry for the confusion.
>>>>
>>>> Alright, glad we got to the bottom of this!
>>>
>>> Thanks.
>>>>
>>>>>
>>>>>>>
>>>>>>> Yes you are correct, the addresss offset doesn't affect the parity of
>>>>>>> the register transfer as we are doing fusion pass before RA.
>>>>>>> If that is the case then I think it would be better to introduce a
>>>>>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
>>>>>>>>>> offset of the pair (prev_access->offset in transform_for_base).  I 
>>>>>>>>>> guess
>>>>>>>>>> it would also take access_size as a parameter and for aarch64 it 
>>>>>>>>>> should
>>>>>>>>>> check:
>>>>>>>>>>
>>>>>>>>>>   multiple_p (offset, access_size)
>>>>>>>>>>
>>>>>>>>>> and for rs6000 it could check:
>>>>>>>>>>
>>>>>>>>>>   multiple_p (offset, access_size * 2)
>>>>>>>>>>
>>>>>>>>>> and we would then incorporate a call to that predicate in the else if
>>>>>>>>>> condition of tranform_for_base.
>>>>>>>>>>
>>>>>>>>>> It would have the same limitation whereby we assume that MEM_EXPR 
>>>>>>>>>> offset
>>>>>>>>>> alignment is a good proxy for RTL offset alignment, but we already
>>>>>>>>>> depend on that with the multiple_p check in track_via_mem_expr.
>>>>>>>>>>
>>>>>>> I have addressed the above hooks and it worked fine with both rs6000
>>>>>>> and aarch64. I am sending subsequent patch in some time that address
>>>>>>> above.
>>>>>>>
>>>>>>>> How do you plan on handling this even-odd requirement for rs6000?
>>>>>>>>
>>>>>>>
>>>>>>> We plan to handle with V16QI subreg: 0 and V16QI subreg : 16 to
>>>>>>> generate register pair and thats what we generate and implement
>>>>>>> in rs6000 target
>>>>>>> code.
>>>>>>
>>>>>> Ah, this is coming back to me now.  Sorry, I should have remembered this 
>>>>>> from
>>>>>> the previous discussion with Richard S.
>>>>>>
>>>>>> Apologies for going on a slight tangent here, but if you're running
>>>>>> before RA are you planning to create a new OImode pseudo register for
>>>>>> the lxvp insn and then somehow update uses of the old transfer registers
>>>>>> to replace them with subregs of that OImode pseudo?  
>>>>>
>>>>> Yes I do the same as you have mentioned. We generate register pairs
>>>>> with 256 bit mode with two subregs of 128 bit modes with 0 and
>>>>> 16 offset.
>>>>>
>>>>> Or do you just plan
>>>>>> or replacing the individual loads with moves (instead of deleting them)?
>>>>>> I guess the latter would be simpler and might work better in the
>>>>>> presence of hard regs.
>>>>>>
>>>>>
>>>>> Would you mind explaining how to generate register pairs with lxvp by
>>>>> replacing loads with moves.
>>>>
>>>> Yeah, so suppose you have something like:
>>>>
>>>> (set (reg:V16QI v1) (mem:V16QI addr))
>>>> (set (reg:V16QI v2) (mem:V16QI addr+16))
>>>>
>>>> then when you insert the lxvp you can then (logically) replace the
>>>> original load instructions with moves from the appropriate subreg, as
>>>> follows:
>>>>
>>>> (set (reg:OI pair-pseudo) (mem:OI addr)) ; lxvp
>>>> (set (reg:V16QI v1) (subreg:V16QI (reg:OI pair-pseudo) 0))
>>>> (set (reg:V16QI v2) (subreg:V16QI (reg:OI pair-pseudo) 16))
>>>>
>>>
>>> Any Pseudo created with gen_rtx_REG like 
>>> gen_rtx_REG (OOmode, REGNO (dest_exp) will error'ed out 
>>> with unrecognize insn by LRA.
>>
>> I'm not surprised that goes wrong: you can't just create a new REG
>> rtx in a different mode and reuse the regno of an existing pseudo.
>>


Thanks for the suggestion.
>>>
>>> If I create pseudo with gen_reg_rtx (OOmode) will error'ed
>>> out with new_defs Pseudo register is not found in
>>> change->new_defs.
>>
>> Yeah, I suppose you'd need to add an RTL-SSA def for the new pseudo.
>>
> 
> Would you mind explaining how can I add and RTL-SSA def for the
> new pseudo.

I have added and RTL-SSA def for the new pseudo. With that I could
get register oairs correctly.
> 
>>>
>>> Also the sequential register pairs are not generated by
>>> Register Allocator.
>>
>> So how do you get the sequential pairs with your current approach?  My
>> understanding was that what I suggested above doesn't really change what
>> you're doing w.r.t the lxvp insn itself, but maybe I didn't properly
>> understand the approach taken in your initial patchset.
>>
> 
> I generate (set (reg:OO pair-pseudo) (mem:OI addr)) ; lxvp
> and then at the use point of pair_pseudo generate the following.
> 
> (subreg:V16QI (reg:OI pair-pseudo) 0))
> (subreg:V16QI (reg:OI pair-pseudo) 16))
> 

I get register pairs correctly generating RTL as you have 
suggested but we get extra moves that would impact the performance.

Please let me know what do you think.

Thanks & Regards
Ajit
> Thanks & Regards
> Ajit
>> Thanks,
>> Alex
>>
>>>
>>> Thats why I haven't used above method as I also thought
>>> through.
>>>
>>> Please let me know what do you think.
>>>
>>> Thanks & Regards
>>> Ajit 
>>>> now I'm not sure off-hand if this exact combination of subregs and mode
>>>> changes is valid, but hopefully you get the idea.  The benefit of this
>>>> approach is that it keeps the transformation local and is more
>>>> compile-time efficient (we don't have to look through all mentions of
>>>> v1/v2 and replace them with subregs).  I'd hope that the RA can then
>>>> clean up any redundant moves (especially in the case that v1/v2 are
>>>> pseudos).
>>>>
>>>> That would mean that you don't need all the grubbing around with DF
>>>> looking for occurrences of the transfer regs.  Instead we'd just need
>>>> some logic to insert the extra moves after the lxvp for rs6000, and I
>>>> think this might fit more cleanly into the current pass.
>>>>
>>>> Does that make sense?  In any case, this shouldn't really affect the
>>>> preparatory aarch64 patch because we were planning to defer adding any
>>>> hooks that are only needed for rs6000 from the initial aarch64/generic
>>>> split.
>>>>
>>>> Thanks,
>>>> Alex
>>>>
>>>>>
>>>>> Thanks & Regards
>>>>> Ajit
>>>>>  
>>>>>> Thanks,
>>>>>> Alex
>>>>>>
>>>>>>>
>>>>>>> Thanks & Regards
>>>>>>> Ajit 
>>>>>>>> Thanks,
>>>>>>>> Alex
>>>>>>>>
>>>>>>>>>
>>>>>>>>> lxvp should load from even registers and then loaded value will
>>>>>>>>> be in even register and even register +1 (which is odd).
>>>>>>>>>
>>>>>>>>> Thanks & Regards
>>>>>>>>> Ajit
>>>>>>>>>> If that is the case then I think it would be better to introduce a
>>>>>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
>>>>>>>>>> offset of the pair (prev_access->offset in transform_for_base).  I 
>>>>>>>>>> guess
>>>>>>>>>> it would also take access_size as a parameter and for aarch64 it 
>>>>>>>>>> should
>>>>>>>>>> check:
>>>>>>>>>>
>>>>>>>>>>   multiple_p (offset, access_size)
>>>>>>>>>>
>>>>>>>>>> and for rs6000 it could check:
>>>>>>>>>>
>>>>>>>>>>   multiple_p (offset, access_size * 2)
>>>>>>>>>>
>>>>>>>>>> and we would then incorporate a call to that predicate in the else if
>>>>>>>>>> condition of tranform_for_base.
>>>>>>>>>>
>>>>>>>>>> It would have the same limitation whereby we assume that MEM_EXPR 
>>>>>>>>>> offset
>>>>>>>>>> alignment is a good proxy for RTL offset alignment, but we already
>>>>>>>>>> depend on that with the multiple_p check in track_via_mem_expr.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Alex
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks & Regards
>>>>>>>>>>> Ajit
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Alex
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         }
>>>>>>>>>>>>>>>        prev_access = &access;
>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int 
>>>>>>>>>>>>>>> encoded_lfs,
>>>>>>>>>>>>>>>  // and remove all the tombstone insns, being sure to reparent 
>>>>>>>>>>>>>>> any uses
>>>>>>>>>>>>>>>  // of mem to previous defs when we do this.
>>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>>> -ldp_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>>>> +pair_fusion_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>    // No need to do anything if we didn't emit a tombstone insn 
>>>>>>>>>>>>>>> for this BB.
>>>>>>>>>>>>>>>    if (!m_emitted_tombstone)
>>>>>>>>>>>>>>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>  template<typename Map>
>>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>>> -ldp_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>>>> +pair_fusion_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>    for (auto kv : map)
>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>>> -ldp_bb_info::transform ()
>>>>>>>>>>>>>>> +pair_fusion_bb_info::transform ()
>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>    traverse_base_map (expr_map);
>>>>>>>>>>>>>>>    traverse_base_map (def_map);
>>>>>>>>>>>>>>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn)
>>>>>>>>>>>>>>>  // for load/store candidates.  If running after RA, also try 
>>>>>>>>>>>>>>> and promote
>>>>>>>>>>>>>>>  // non-writeback pairs to use writeback addressing.  Then try 
>>>>>>>>>>>>>>> to fuse
>>>>>>>>>>>>>>>  // candidates into pairs.
>>>>>>>>>>>>>>> -void ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>>>> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>> -  const bool track_loads
>>>>>>>>>>>>>>> -    = aarch64_tune_params.ldp_policy_model != 
>>>>>>>>>>>>>>> AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>>>>>>>>> -  const bool track_stores
>>>>>>>>>>>>>>> -    = aarch64_tune_params.stp_policy_model != 
>>>>>>>>>>>>>>> AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>>>>>>>>> +  const bool track_loads = track_load_p ();
>>>>>>>>>>>>>>> +  const bool track_stores = track_store_p ();
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>> -  ldp_bb_info bb_state (bb);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> can be deleted and then:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +  pair_fusion_bb_info bb_info (bb, &derived);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> can just be:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   pair_fusion_bb_info bb_info (bb, this);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (or you can pass *this if you make bb_info take a reference).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't think there's a particular need to change the variable 
>>>>>>>>>>>>>> name
>>>>>>>>>>>>>> (bb_state -> bb_info).  I chose the former because it doens't 
>>>>>>>>>>>>>> clash
>>>>>>>>>>>>>> with the RTL-SSA structure of the same name as the latter.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>    for (auto insn : bb->nondebug_insns ())
>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>>>>         continue;
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>        rtx pat = PATTERN (rti);
>>>>>>>>>>>>>>> -      if (reload_completed
>>>>>>>>>>>>>>> -         && aarch64_ldp_writeback > 1
>>>>>>>>>>>>>>> -         && GET_CODE (pat) == PARALLEL
>>>>>>>>>>>>>>> -         && XVECLEN (pat, 0) == 2)
>>>>>>>>>>>>>>> +      if (pair_mem_promote_writeback_p (pat))
>>>>>>>>>>>>>>>         try_promote_writeback (insn);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It looks like try_promote_writeback itself will need some 
>>>>>>>>>>>>>> further work
>>>>>>>>>>>>>> to make it target-independent.  I suppose this check:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   auto rti = insn->rtl ();
>>>>>>>>>>>>>>   const auto attr = get_attr_ldpstp (rti);
>>>>>>>>>>>>>>   if (attr == LDPSTP_NONE)
>>>>>>>>>>>>>>     return;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   bool load_p = (attr == LDPSTP_LDP);
>>>>>>>>>>>>>>   gcc_checking_assert (load_p || attr == LDPSTP_STP);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> will need to become part of the pair_mem_promote_writeback_p 
>>>>>>>>>>>>>> hook that you
>>>>>>>>>>>>>> added, potentially changing it to return a boolean for load_p.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Then I guess we will need hooks for destructuring the pair insn 
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> another hook to wrap aarch64_gen_writeback_pair.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>        if (GET_CODE (pat) != SET)
>>>>>>>>>>>>>>>         continue;
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>        if (track_stores && MEM_P (XEXP (pat, 0)))
>>>>>>>>>>>>>>> -       bb_state.track_access (insn, false, XEXP (pat, 0));
>>>>>>>>>>>>>>> +       bb_info.track_access (insn, false, XEXP (pat, 0));
>>>>>>>>>>>>>>>        else if (track_loads && MEM_P (XEXP (pat, 1)))
>>>>>>>>>>>>>>> -       bb_state.track_access (insn, true, XEXP (pat, 1));
>>>>>>>>>>>>>>> +       bb_info.track_access (insn, true, XEXP (pat, 1));
>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>> -  bb_state.transform ();
>>>>>>>>>>>>>>> -  bb_state.cleanup_tombstones ();
>>>>>>>>>>>>>>> +  bb_info.transform ();
>>>>>>>>>>>>>>> +  bb_info.cleanup_tombstones ();
>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>  void ldp_fusion ()
>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>    ldp_fusion_init ();
>>>>>>>>>>>>>>> +  pair_fusion *pfuse;
>>>>>>>>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>>>>>>>>> +  pfuse = &derived;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is indeed the one place where I think it is acceptable to
>>>>>>>>>>>>>> instantiate aarch64_pair_fusion.  But again there's no need to 
>>>>>>>>>>>>>> create a
>>>>>>>>>>>>>> pointer to the parent class, just call any function you like 
>>>>>>>>>>>>>> directly.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>    for (auto bb : crtl->ssa->bbs ())
>>>>>>>>>>>>>>> -    ldp_fusion_bb (bb);
>>>>>>>>>>>>>>> +    pfuse->ldp_fusion_bb (bb);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think even the code to iterate over bbs should itself be a 
>>>>>>>>>>>>>> member
>>>>>>>>>>>>>> function of pair_fusion (say "run") and then that becomes part 
>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>> generic code.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So this function would just become:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> aarch64_pair_fusion pass;
>>>>>>>>>>>>>> pass.run ();
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and could be inlined into the caller.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>> Perhaps you could also add an early return like:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   if (!track_loads_p () && !track_stores_p ())
>>>>>>>>>>>>>>     return;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> in pair_fusion::run () and then remove the corresponding code 
>>>>>>>>>>>>>> from
>>>>>>>>>>>>>> pass_ldp_fusion::gate?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>    ldp_fusion_destroy ();
>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>> 2.39.3
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Alex
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks & Regards
>>>>>>>>>>>>> Ajit

Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

Reply via email to