On Mon, Aug 10, 2020 at 10:41 PM Kewen.Lin <li...@linux.ibm.com> wrote: > > Hi Bin, > > on 2020/8/10 下午8:38, Bin.Cheng wrote: > > On Mon, Aug 10, 2020 at 12:27 PM Kewen.Lin <li...@linux.ibm.com> wrote: > >> > >> Hi Bin, > >> > >> Thanks for the review!! > >> > >> on 2020/8/8 下午4:01, Bin.Cheng wrote: > >>> Hi Kewen, > >>> Sorry for the late reply. > >>> The patch's most important change is below cost computation: > >>> > >>>> @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, > >>>> struct iv_cand *cand) > >>>> cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base))); > >>>> cost = cost_step + adjust_setup_cost (data, cost_base.cost); > >>>> > >>>> + /* Consider additional step updates during unrolling. */ > >>>> + if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p) > >>>> + cost += (data->current_loop->estimated_unroll - 1) * cost_step; > >>> This is a bit strange, to me the add instructions are additional > >>> computation caused by unrolling+addressing_mode, rather than a native > >>> part in candidate itself. Specifically, an additional cost is needed > >>> if candidates (without reg_offset_p) are chosen for the address type > >>> group/uses. > >> > >> Good point, ideally it should be one additional cost for each cand set, > >> when we select one cand for one group, we need to check this pair need > >> more (estimated_unroll - 1) step costs, we probably need to care about > >> this during remove/replace etc. IIUC the current IVOPTs cost framework > >> doesn't support this and it could increase the selection complexity and > >> time. I hesitated to do it and put it to cand step cost initially instead. > >> > >> I was thinking those candidates with reg_offset_p should be only used for > >> those reg_offset_p groups in most cases (very limited) meanwhile the others > >> are simply scaled up like before. But indeed this can cover some similar > >> cases like one cand is only used for the compare type group which is for > >> loop closing, then it doesn't need more step costs for unrolling. > >> > >> Do you prefer me to improve the current cost framework? > > No, I don't think it's relevant to the candidate selecting algorithm. > > I was thinking about adjusting cost somehow in > > determine_group_iv_cost_address. Given we don't expose selected > > addressing mode in this function, you may need to do it in > > get_address_cost, either way. > > > > Thanks for your suggestion! > > Sorry, I may miss something, but I still think the additional cost is > per candidate. The justification is that we miss to model the iv > candidate step well in the context of unrolling, the step cost is part > of candidate cost, which is per candidate. > > To initialize it in determine_iv_cost isn't perfect as you pointed out, > ideally we should check any uses of the candidate requires iv update > after each replicated iteration, and take extra step costs into account > if at least one needs, meanwhile scaling up all the computation cost to > reflect unrolling cost nature. I see, it's similar to the auto-increment case where cost should be recorded only once. So this is okay given 1) fine predicting rtl-unroll is likely impossible here; 2) the patch has very limited impact.
Thanks, bin > > Besides, the reg_offset desirable pair already takes zero cost for > cand/group cost, IIRC negative cost isn't preferred in IVOPTs, are you > suggesting increasing the cost for non reg_offset pairs? If so and per > pair, the extra cost looks possible to be computed several times > unexpectedly. > > >> > >>>> + > >>>> /* Prefer the original ivs unless we may gain something by replacing > >>>> it. > >>>> The reason is to make debugging simpler; so this is not relevant for > >>>> artificial ivs created by other optimization passes. */ > >>>> > >>> > >>>> @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data, > >>>> return; > >>>> } > >>>> > >>>> + /* Since we priced more on non reg_offset IV cand step cost, we > >>>> should scale > >>>> + up the appropriate IV group costs. Simply consider USE_COMPARE at > >>>> the > >>>> + loop exit, FIXME if multiple exits supported or no loop exit > >>>> comparisons > >>>> + matter. */ > >>>> + if (data->consider_reg_offset_for_unroll_p > >>>> + && group->vuses[0]->type != USE_COMPARE) > >>>> + cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll; > >>> Not quite follow here, giving "pricing more on on-reg_offset IV cand" > >>> doesn't make much sense to me. Also why generic type uses are not > >>> skipped? We want to model the cost required for address computation, > >>> however, for generic type uses there is no way to save the computation > >>> in "address expression". Once unrolled, the computation is always > >>> there? > >>> > >> > >> The main intention is to scale up the group/cand cost for unrolling since > >> we have scaled up the step costs. The assumption is that the original > > If we adjust cost appropriately in function *group_iv_cost_address, > > this would become unnecessary, right? And naturally. > >> costing (without this patch) can be viewed as either for all unrolled > >> iterations or just one single iteration. Since IVOPTs doesn't support > >> fractional costing, I interpreted it as single iterations, to emulate > >> unrolling scenario based on the previous step cost scaling, we need to > >> scale up the cost for all computation. > >> > >> In most cases, the compare type use is for loop closing, there is still > >> only one computation even unrolling happens, so I excluded it here. > >> As "FIXME", if we find some cases are off, we can further restrict it to > >> those USE_COMPARE uses which is exactly for loop closing. > >> > >>> And what's the impact on targets supporting [base + index + offset] > >>> addressing mode? > >> > >> Good question, I didn't notice it since power doesn't support it. > >> I noticed the comments of function addr_offset_valid_p only mentioning > >> [base + offset], I guess it excludes [base + index + offset]? > >> But I guess the address-based IV can work for this mode? > > No, addr_offset_valid_p is only used to split address use groups. See > > get_address_cost and struct mem_address. > > I forgot to ask, what about target only supports [base + offset] > > addressing mode like RISC-V? I would expect it's not affected at all. > > > addr_offset_valid_p is also used in this patch as Richard S. and Segher > suggested to check the offset after unrolling (like: offset+(uf-1)*step) > is still valid for the target. If address-based IV can not work for > [base + index + offset], it won't affect [base + index + offset]. > > It can help all targets which supports [base + offset], so I'm afraid > it can affect RISC-V too, but I would expect it's positive. Or do you > happen to see some potential issues? or have some concerns? > > And as Richard S. suggested before, it has one parameter to control, > target can simply disable this if it dislikes. > > >> > >>> > >>> Given the patch is not likely to harm because rtl loop unrolling is > >>> (or was?) by default disabled, so I am OK once the above comments are > >>> addressed. > >>> > >> > >> Yes, it needs explicit unrolling options, excepting for some targets > >> wants to enable it for some cases with specific loop_unroll_adjust checks. > >> > >>> I wonder if it's possible to get 10% of (all which should be unrolled) > >>> loops unrolled (conservatively) on gimple and enable it by default at > >>> O3, rather than teaching ivopts to model a future pass which not > >>> likely be used outside of benchmarks? > >>> > >> > >> Yeah, it would be nice if the unrolling happen before IVOPTs and won't > >> have any future unrollings to get it off. PR[1] seems to have some > >> discussion on gimple unrolling. > > Thanks for directing me to the discussion. I am on Wilco's side on > > this problem, IMHO, It might be useful getting small loops unrolled > > (at O3 by default) by simply saving induction variable stepping and > > exit condition check, which can be modeled on gimple. Especially for > > RISC-V, it doesn't support index addressing mode, which means there > > will be as many induction variables as distinct arrays. Also > > interleaving after unrolling is not that important, it's at high > > chance that small loops eligible for interleaving are handled by > > vectorizer already. > > > > Good idea, CC Jeff since I think Jeff (Jiufu) has been working to see > whether we can bring in one gimple unrolling pass. Regardless we > have/don't have it, if the RTL unrolling is there, I guess this patch > set is still beneficial? If so, I would take it separately. > > BR, > Kewen