Hello Alex: On 10/04/24 7:52 pm, Alex Coplan wrote: > Hi Ajit, > > On 10/04/2024 15:31, Ajit Agarwal wrote: >> Hello Alex: >> >> On 10/04/24 1:42 pm, Alex Coplan wrote: >>> Hi Ajit, >>> >>> On 09/04/2024 20:59, Ajit Agarwal wrote: >>>> Hello Alex: >>>> >>>> On 09/04/24 8:39 pm, Alex Coplan wrote: >>>>> On 09/04/2024 20:01, Ajit Agarwal wrote: >>>>>> Hello Alex: >>>>>> >>>>>> On 09/04/24 7:29 pm, Alex Coplan wrote: >>>>>>> On 09/04/2024 17:30, Ajit Agarwal wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote: >>>>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote: >>>>>>>>>> Hello Alex/Richard: >>>>>>>>>> >>>>>>>>>> All review comments are incorporated. > <snip> >>>>>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t >>>>>>>>>> &left_list, >>>>>>>>>> // of accesses. If we find two sets of adjacent accesses, call >>>>>>>>>> // merge_pairs. >>>>>>>>>> void >>>>>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs, >>>>>>>>>> - access_group &group) >>>>>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs, >>>>>>>>>> + access_group &group) >>>>>>>>>> { >>>>>>>>>> const auto lfs = decode_lfs (encoded_lfs); >>>>>>>>>> const unsigned access_size = lfs.size; >>>>>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int >>>>>>>>>> encoded_lfs, >>>>>>>>>> access.cand_insns, >>>>>>>>>> lfs.load_p, >>>>>>>>>> access_size); >>>>>>>>>> - skip_next = access.cand_insns.empty (); >>>>>>>>>> + skip_next = bb_state->cand_insns_empty_p (access.cand_insns); >>>>>>>>> >>>>>>>>> As above, why is this needed? >>>>>>>> >>>>>>>> For rs6000 we want to return always true. as load store pair >>>>>>>> that are to be merged with 8/16 16/32 32/64 is occuring for rs6000. >>>>>>>> And we want load store pair to 8/16 32/64. Thats why we want >>>>>>>> to generate always true for rs6000 to skip pairs as above. >>>>>>> >>>>>>> Hmm, sorry, I'm not sure I follow. Are you saying that for rs6000 you >>>>>>> have >>>>>>> load/store pair instructions where the two arms of the access are >>>>>>> storing >>>>>>> operands of different sizes? Or something else? >>>>>>> >>>>>>> As it stands the logic is to skip the next iteration only if we >>>>>>> exhausted all the candidate insns for the current access. In the case >>>>>>> that we didn't exhaust all such candidates, then the idea is that when >>>>>>> access becomes prev_access, we can attempt to use those candidates as >>>>>>> the "left-hand side" of a pair in the next iteration since we failed to >>>>>>> use them as the "right-hand side" of a pair in the current iteration. >>>>>>> I don't see why you wouldn't want that behaviour. Please can you >>>>>>> explain? >>>>>>> >>>>>> >>>>>> In merge_pair we get the 2 load candiates one load from 0 offset and >>>>>> other load is from 16th offset. Then in next iteration we get load >>>>>> from 16th offset and other load from 32 offset. In next iteration >>>>>> we get load from 32 offset and other load from 48 offset. >>>>>> >>>>>> For example: >>>>>> >>>>>> Currently we get the load candiates as follows. >>>>>> >>>>>> pairs: >>>>>> >>>>>> load from 0th offset. >>>>>> load from 16th offset. >>>>>> >>>>>> next pairs: >>>>>> >>>>>> load from 16th offset. >>>>>> load from 32th offset. >>>>>> >>>>>> next pairs: >>>>>> >>>>>> load from 32th offset >>>>>> load from 48th offset. >>>>>> >>>>>> Instead in rs6000 we should get: >>>>>> >>>>>> pairs: >>>>>> >>>>>> load from 0th offset >>>>>> load from 16th offset. >>>>>> >>>>>> next pairs: >>>>>> >>>>>> load from 32th offset >>>>>> load from 48th offset. >>>>> >>>>> Hmm, so then I guess my question is: why wouldn't you consider merging >>>>> the pair with offsets (16,32) for rs6000? Is it because you have a >>>>> stricter alignment requirement on the base pair offsets (i.e. they have >>>>> to be a multiple of 32 when the operand size is 16)? So the pair >>>>> offsets have to be a multiple of the entire pair size rather than a >>>>> single operand size> >>>> >>>> We get load pair at a certain point with (0,16) and other program >>>> point we get load pair (32, 48). >>>> >>>> In current implementation it takes offsets loads as (0, 16), >>>> (16, 32), (32, 48). >>>> >>>> But In rs6000 we want the load pair to be merged at different points >>>> as (0,16) and (32, 48). for (0,16) we want to replace load lxvp with >>>> 0 offset and other load (32, 48) with lxvp with 32 offset. >>>> >>>> In current case it will merge with lxvp with 0 offset and lxvp with >>>> 16 offset, then lxvp with 32 offset and lxvp with 48 offset which >>>> is incorrect in our case as the (16-32) case 16 offset will not >>>> load from even register and break for rs6000. >>> >>> Sorry, I think I'm still missing something here. Why does the address >>> offset >>> affect the parity of the tranfser register? ISTM they needn't be related at >>> all (and indeed we can't even know the parity of the transfer register >>> before >>> RA, but perhaps you're only intending on running the pass after RA?) >>> >> >> We have load pair with (0,16) wherein these loads are adjacent and >> replaced with lxvp. >> >> Semantic of lxvp instruction is that it loads adjacent load pair in >> even register and even_register + 1. >> >> We replace the above load pair with lxvp instruction and then we >> dont need to merge (16,32) as (0, 16) is already merged and instead >> we merge (32,48). > > Ok, but the existing logic should already account for this. I.e. if we > successfully merge (0,16), then we don't attempt to merge (16,32). We'd only > attempt to merge (16,32) if the merge of (0,16) failed (for whatever reason). > So I don't see that there's anything specific to lxvp that requires this logic > to change, _unless_ you have a stricter alignment requirement on the offsets > as > I mentioned before. >
Thanks for the suggestion. It worked for rs6000 also with current changes. Sorry for the confusion. >> >> Yes you are correct, the addresss offset doesn't affect the parity of >> the register transfer as we are doing fusion pass before RA. >> If that is the case then I think it would be better to introduce a >>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base >>>>> offset of the pair (prev_access->offset in transform_for_base). I guess >>>>> it would also take access_size as a parameter and for aarch64 it should >>>>> check: >>>>> >>>>> multiple_p (offset, access_size) >>>>> >>>>> and for rs6000 it could check: >>>>> >>>>> multiple_p (offset, access_size * 2) >>>>> >>>>> and we would then incorporate a call to that predicate in the else if >>>>> condition of tranform_for_base. >>>>> >>>>> It would have the same limitation whereby we assume that MEM_EXPR offset >>>>> alignment is a good proxy for RTL offset alignment, but we already >>>>> depend on that with the multiple_p check in track_via_mem_expr. >>>>> >> I have addressed the above hooks and it worked fine with both rs6000 >> and aarch64. I am sending subsequent patch in some time that address >> above. >> >>> How do you plan on handling this even-odd requirement for rs6000? >>> >> >> We plan to handle with V16QI subreg: 0 and V16QI subreg : 16 to >> generate register pair and thats what we generate and implement >> in rs6000 target >> code. > > Ah, this is coming back to me now. Sorry, I should have remembered this from > the previous discussion with Richard S. > > Apologies for going on a slight tangent here, but if you're running > before RA are you planning to create a new OImode pseudo register for > the lxvp insn and then somehow update uses of the old transfer registers > to replace them with subregs of that OImode pseudo? Yes I do the same as you have mentioned. We generate register pairs with 256 bit mode with two subregs of 128 bit modes with 0 and 16 offset. Or do you just plan > or replacing the individual loads with moves (instead of deleting them)? > I guess the latter would be simpler and might work better in the > presence of hard regs. > Would you mind explaining how to generate register pairs with lxvp by replacing loads with moves. Thanks & Regards Ajit > Thanks, > Alex > >> >> Thanks & Regards >> Ajit >>> Thanks, >>> Alex >>> >>>> >>>> lxvp should load from even registers and then loaded value will >>>> be in even register and even register +1 (which is odd). >>>> >>>> Thanks & Regards >>>> Ajit >>>>> If that is the case then I think it would be better to introduce a >>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base >>>>> offset of the pair (prev_access->offset in transform_for_base). I guess >>>>> it would also take access_size as a parameter and for aarch64 it should >>>>> check: >>>>> >>>>> multiple_p (offset, access_size) >>>>> >>>>> and for rs6000 it could check: >>>>> >>>>> multiple_p (offset, access_size * 2) >>>>> >>>>> and we would then incorporate a call to that predicate in the else if >>>>> condition of tranform_for_base. >>>>> >>>>> It would have the same limitation whereby we assume that MEM_EXPR offset >>>>> alignment is a good proxy for RTL offset alignment, but we already >>>>> depend on that with the multiple_p check in track_via_mem_expr. >>>>> >>>>> Thanks, >>>>> Alex >>>>> >>>>>> >>>>>> Thanks & Regards >>>>>> Ajit >>>>>>> Thanks, >>>>>>> Alex >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>>> } >>>>>>>>>> prev_access = &access; >>>>>>>>>> } >>>>>>>>>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int >>>>>>>>>> encoded_lfs, >>>>>>>>>> // and remove all the tombstone insns, being sure to reparent any >>>>>>>>>> uses >>>>>>>>>> // of mem to previous defs when we do this. >>>>>>>>>> void >>>>>>>>>> -ldp_bb_info::cleanup_tombstones () >>>>>>>>>> +pair_fusion_bb_info::cleanup_tombstones () >>>>>>>>>> { >>>>>>>>>> // No need to do anything if we didn't emit a tombstone insn for >>>>>>>>>> this BB. >>>>>>>>>> if (!m_emitted_tombstone) >>>>>>>>>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones () >>>>>>>>>> >>>>>>>>>> template<typename Map> >>>>>>>>>> void >>>>>>>>>> -ldp_bb_info::traverse_base_map (Map &map) >>>>>>>>>> +pair_fusion_bb_info::traverse_base_map (Map &map) >>>>>>>>>> { >>>>>>>>>> for (auto kv : map) >>>>>>>>>> { >>>>>>>>>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map) >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> void >>>>>>>>>> -ldp_bb_info::transform () >>>>>>>>>> +pair_fusion_bb_info::transform () >>>>>>>>>> { >>>>>>>>>> traverse_base_map (expr_map); >>>>>>>>>> traverse_base_map (def_map); >>>>>>>>>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn) >>>>>>>>>> // for load/store candidates. If running after RA, also try and >>>>>>>>>> promote >>>>>>>>>> // non-writeback pairs to use writeback addressing. Then try to >>>>>>>>>> fuse >>>>>>>>>> // candidates into pairs. >>>>>>>>>> -void ldp_fusion_bb (bb_info *bb) >>>>>>>>>> +void pair_fusion::ldp_fusion_bb (bb_info *bb) >>>>>>>>>> { >>>>>>>>>> - const bool track_loads >>>>>>>>>> - = aarch64_tune_params.ldp_policy_model != >>>>>>>>>> AARCH64_LDP_STP_POLICY_NEVER; >>>>>>>>>> - const bool track_stores >>>>>>>>>> - = aarch64_tune_params.stp_policy_model != >>>>>>>>>> AARCH64_LDP_STP_POLICY_NEVER; >>>>>>>>>> + const bool track_loads = track_load_p (); >>>>>>>>>> + const bool track_stores = track_store_p (); >>>>>>>>>> >>>>>>>>>> - ldp_bb_info bb_state (bb); >>>>>>>>> >>>>>>>>> This: >>>>>>>>> >>>>>>>>>> + aarch64_pair_fusion derived; >>>>>>>>> >>>>>>>>> can be deleted and then: >>>>>>>>> >>>>>>>>>> + pair_fusion_bb_info bb_info (bb, &derived); >>>>>>>>> >>>>>>>>> can just be: >>>>>>>>> >>>>>>>>> pair_fusion_bb_info bb_info (bb, this); >>>>>>>>> >>>>>>>>> (or you can pass *this if you make bb_info take a reference). >>>>>>>>> >>>>>>>>> I don't think there's a particular need to change the variable name >>>>>>>>> (bb_state -> bb_info). I chose the former because it doens't clash >>>>>>>>> with the RTL-SSA structure of the same name as the latter. >>>>>>>>> >>>>>>>> >>>>>>>> Addressed. >>>>>>>>>> >>>>>>>>>> for (auto insn : bb->nondebug_insns ()) >>>>>>>>>> { >>>>>>>>>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb) >>>>>>>>>> continue; >>>>>>>>>> >>>>>>>>>> rtx pat = PATTERN (rti); >>>>>>>>>> - if (reload_completed >>>>>>>>>> - && aarch64_ldp_writeback > 1 >>>>>>>>>> - && GET_CODE (pat) == PARALLEL >>>>>>>>>> - && XVECLEN (pat, 0) == 2) >>>>>>>>>> + if (pair_mem_promote_writeback_p (pat)) >>>>>>>>>> try_promote_writeback (insn); >>>>>>>>> >>>>>>>>> It looks like try_promote_writeback itself will need some further work >>>>>>>>> to make it target-independent. I suppose this check: >>>>>>>>> >>>>>>>>> auto rti = insn->rtl (); >>>>>>>>> const auto attr = get_attr_ldpstp (rti); >>>>>>>>> if (attr == LDPSTP_NONE) >>>>>>>>> return; >>>>>>>>> >>>>>>>>> bool load_p = (attr == LDPSTP_LDP); >>>>>>>>> gcc_checking_assert (load_p || attr == LDPSTP_STP); >>>>>>>>> >>>>>>>>> will need to become part of the pair_mem_promote_writeback_p hook >>>>>>>>> that you >>>>>>>>> added, potentially changing it to return a boolean for load_p. >>>>>>>>> >>>>>>>>> Then I guess we will need hooks for destructuring the pair insn and >>>>>>>>> another hook to wrap aarch64_gen_writeback_pair. >>>>>>>>> >>>>>>>> >>>>>>>> Addressed. >>>>>>>>>> >>>>>>>>>> if (GET_CODE (pat) != SET) >>>>>>>>>> continue; >>>>>>>>>> >>>>>>>>>> if (track_stores && MEM_P (XEXP (pat, 0))) >>>>>>>>>> - bb_state.track_access (insn, false, XEXP (pat, 0)); >>>>>>>>>> + bb_info.track_access (insn, false, XEXP (pat, 0)); >>>>>>>>>> else if (track_loads && MEM_P (XEXP (pat, 1))) >>>>>>>>>> - bb_state.track_access (insn, true, XEXP (pat, 1)); >>>>>>>>>> + bb_info.track_access (insn, true, XEXP (pat, 1)); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> - bb_state.transform (); >>>>>>>>>> - bb_state.cleanup_tombstones (); >>>>>>>>>> + bb_info.transform (); >>>>>>>>>> + bb_info.cleanup_tombstones (); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> void ldp_fusion () >>>>>>>>>> { >>>>>>>>>> ldp_fusion_init (); >>>>>>>>>> + pair_fusion *pfuse; >>>>>>>>>> + aarch64_pair_fusion derived; >>>>>>>>>> + pfuse = &derived; >>>>>>>>> >>>>>>>>> This is indeed the one place where I think it is acceptable to >>>>>>>>> instantiate aarch64_pair_fusion. But again there's no need to create >>>>>>>>> a >>>>>>>>> pointer to the parent class, just call any function you like directly. >>>>>>>>> >>>>>>>> >>>>>>>> Addressed. >>>>>>>>>> >>>>>>>>>> for (auto bb : crtl->ssa->bbs ()) >>>>>>>>>> - ldp_fusion_bb (bb); >>>>>>>>>> + pfuse->ldp_fusion_bb (bb); >>>>>>>>> >>>>>>>>> I think even the code to iterate over bbs should itself be a member >>>>>>>>> function of pair_fusion (say "run") and then that becomes part of the >>>>>>>>> generic code. >>>>>>>>> >>>>>>>>> So this function would just become: >>>>>>>>> >>>>>>>>> aarch64_pair_fusion pass; >>>>>>>>> pass.run (); >>>>>>>>> >>>>>>>>> and could be inlined into the caller. >>>>>>>>> >>>>>>>> >>>>>>>> Addressed. >>>>>>>>> Perhaps you could also add an early return like: >>>>>>>>> >>>>>>>>> if (!track_loads_p () && !track_stores_p ()) >>>>>>>>> return; >>>>>>>>> >>>>>>>>> in pair_fusion::run () and then remove the corresponding code from >>>>>>>>> pass_ldp_fusion::gate? >>>>>>>>> >>>>>>>> >>>>>>>> Addressed. >>>>>>>> >>>>>>>>>> >>>>>>>>>> ldp_fusion_destroy (); >>>>>>>>>> } >>>>>>>>>> -- >>>>>>>>>> 2.39.3 >>>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Alex >>>>>>>> >>>>>>>> Thanks & Regards >>>>>>>> Ajit