Re: [PATCH 1/2] vect: Add costing_for_scalar parameter to init_cost hook
on 2021/5/17 下午4:55, Richard Biener wrote: > On Thu, May 13, 2021 at 9:04 AM Kewen.Lin wrote: >> >> Hi! >> > But in the end the vector code shouldn't end up worse than the > scalar code with respect to IVs - the cases where it would should > be already costed. So I wonder if you have specific examples > where things go worse enough for the heuristic to trigger? > One typical case that I worked on to reuse this density check is the function mat_times_vec of src file block_solver.fppized.f of SPEC2017 503.bwaves_r, the density with the existing heuristic is 83 (doesn't exceed the threshold unlikely). The interesting loop is the innermost one while option set is "-O2 -mcpu=power8 -ffast-math -ftree-vectorize". We have verified that this loop isn't profitable to be vectorized at O2 (without loop-interchange). >>> >>> Yeah, but that's because the loop only runs 5 iterations, not because >>> of some "density" (which suggests AGU overloading or some such)? >>> Because if you modify it so it iterates more then with keeping the >>> "density" measurement constant you suddenly become profitable? >>> >> >> Yes, I agree this isn't a perfect one showing how the density check >> matters, though it led me to find this check. I tried to run SPEC2017 >> bmks w/ and w/o this density heuristic to catch some "expected" case, >> but failed to unluckily. It may be worth to trying with some more >> option sets or even test with the previous SPECs later. >> >> I hacked the innermost loop iteration from 5 to 20, but baseline run >> didn't stop (after more than 7 hrs then I killed it), which was >> suspected to become endless because of some garbage (out of bound) data. >> >> But the current cost modeling for this loop on Power is still bad, the >> min profitable iteration (both static and run time) are evaluated as 2, >> while the reality shows 5 isn't profitable at least. >> >> >>> The loop does have quite many memory streams so optimizing >>> the (few) arithmetic ops by vectorizign them might not be worth >>> the trouble, esp. since most of the loads are "strided" (composed >>> from scalars) when no interchange is performed. So it's probably >>> more a "density" of # memory streams vs. # arithmetic ops, and >>> esp. with any non-consecutive vector loads this balance being >>> worse in the vector case? >>> >> >> Yeah, these many scalar "strided" loads make things worse. The fed >> vector CTORs have to wait for all of their required loads are ready, >> and these vector CTOR are required by further multiplications. >> >> I posted one patch[1] on this, which tries to model it with >> some counts: nload (total load number), nload_ctor (strided >> load number fed into CTOR) and nctor_strided (CTOR number fed >> by strided load). >> >> Restricting the penalization by considering some factors: >> 1) vect density ratio, if there are many vector instructions, >> the stalls from loads are easy to impact the subsequent >> computation. >> 2) total load number, if nload is small, it's unlikely to >> bother the load/store units much. >> 3) strided loads fed into CTOR pct., if there are high portion >> strided loads fed into CTOR, it's very likely to block >> the CTOR and its subsequent chain. >> >> btw, as your previous comments on add_stmt_cost, the load/strided/ctor >> statistics should be gathered there instead, like: >> >> if (!data->costing_for_scalar && data->loop_info && where == vect_body) >> { >> if (kind == scalar_load || kind == vector_load || kind == >> unaligned_load >> || kind == vector_gather_load) >> data->nload += count; >> if (stmt_info && STMT_VINFO_STRIDED_P (stmt_info)) >> { >> if (kind == scalar_load || kind == unaligned_load) >> data->nload_ctor += count; >> else if (kind == vec_construct) >> data->nctor_strided += count; >> } >> } >> >> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/569791.html >> >>> The x86 add_stmt_cost has >>> >>> /* If we do elementwise loads into a vector then we are bound by >>> latency and execution resources for the many scalar loads >>> (AGU and load ports). Try to account for this by scaling the >>> construction cost by the number of elements involved. */ >>> if ((kind == vec_construct || kind == vec_to_scalar) >>> && stmt_info >>> && (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type >>> || STMT_VINFO_TYPE (stmt_info) == store_vec_info_type) >>> && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE >>> && TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) != >>> INTEGER_CST) >>> { >>> stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); >>> stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1); >>> } >>> >>> so it penaltizes VMAT_ELEMENTWISE for variable step for both loads and >>> st
Re: [PATCH 1/2] vect: Add costing_for_scalar parameter to init_cost hook
On Thu, May 13, 2021 at 9:04 AM Kewen.Lin wrote: > > Hi! > > >>> But in the end the vector code shouldn't end up worse than the > >>> scalar code with respect to IVs - the cases where it would should > >>> be already costed. So I wonder if you have specific examples > >>> where things go worse enough for the heuristic to trigger? > >>> > >> > >> One typical case that I worked on to reuse this density check is the > >> function mat_times_vec of src file block_solver.fppized.f of SPEC2017 > >> 503.bwaves_r, the density with the existing heuristic is 83 (doesn't > >> exceed the threshold unlikely). The interesting loop is the innermost > >> one while option set is "-O2 -mcpu=power8 -ffast-math -ftree-vectorize". > >> We have verified that this loop isn't profitable to be vectorized at > >> O2 (without loop-interchange). > > > > Yeah, but that's because the loop only runs 5 iterations, not because > > of some "density" (which suggests AGU overloading or some such)? > > Because if you modify it so it iterates more then with keeping the > > "density" measurement constant you suddenly become profitable? > > > > Yes, I agree this isn't a perfect one showing how the density check > matters, though it led me to find this check. I tried to run SPEC2017 > bmks w/ and w/o this density heuristic to catch some "expected" case, > but failed to unluckily. It may be worth to trying with some more > option sets or even test with the previous SPECs later. > > I hacked the innermost loop iteration from 5 to 20, but baseline run > didn't stop (after more than 7 hrs then I killed it), which was > suspected to become endless because of some garbage (out of bound) data. > > But the current cost modeling for this loop on Power is still bad, the > min profitable iteration (both static and run time) are evaluated as 2, > while the reality shows 5 isn't profitable at least. > > > > The loop does have quite many memory streams so optimizing > > the (few) arithmetic ops by vectorizign them might not be worth > > the trouble, esp. since most of the loads are "strided" (composed > > from scalars) when no interchange is performed. So it's probably > > more a "density" of # memory streams vs. # arithmetic ops, and > > esp. with any non-consecutive vector loads this balance being > > worse in the vector case? > > > > Yeah, these many scalar "strided" loads make things worse. The fed > vector CTORs have to wait for all of their required loads are ready, > and these vector CTOR are required by further multiplications. > > I posted one patch[1] on this, which tries to model it with > some counts: nload (total load number), nload_ctor (strided > load number fed into CTOR) and nctor_strided (CTOR number fed > by strided load). > > Restricting the penalization by considering some factors: > 1) vect density ratio, if there are many vector instructions, > the stalls from loads are easy to impact the subsequent > computation. > 2) total load number, if nload is small, it's unlikely to > bother the load/store units much. > 3) strided loads fed into CTOR pct., if there are high portion > strided loads fed into CTOR, it's very likely to block > the CTOR and its subsequent chain. > > btw, as your previous comments on add_stmt_cost, the load/strided/ctor > statistics should be gathered there instead, like: > > if (!data->costing_for_scalar && data->loop_info && where == vect_body) > { > if (kind == scalar_load || kind == vector_load || kind == unaligned_load > || kind == vector_gather_load) > data->nload += count; > if (stmt_info && STMT_VINFO_STRIDED_P (stmt_info)) > { > if (kind == scalar_load || kind == unaligned_load) > data->nload_ctor += count; > else if (kind == vec_construct) > data->nctor_strided += count; > } > } > > [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/569791.html > > > The x86 add_stmt_cost has > > > > /* If we do elementwise loads into a vector then we are bound by > > latency and execution resources for the many scalar loads > > (AGU and load ports). Try to account for this by scaling the > > construction cost by the number of elements involved. */ > > if ((kind == vec_construct || kind == vec_to_scalar) > > && stmt_info > > && (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type > > || STMT_VINFO_TYPE (stmt_info) == store_vec_info_type) > > && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE > > && TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) != > > INTEGER_CST) > > { > > stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > > stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1); > > } > > > > so it penaltizes VMAT_ELEMENTWISE for variable step for both loads and > > stores. > > The above materialized over PRs 84037, 85491 and 87561, so not specifically > > for the bwaves ca
Re: [PATCH 1/2] vect: Add costing_for_scalar parameter to init_cost hook
Hi! >>> But in the end the vector code shouldn't end up worse than the >>> scalar code with respect to IVs - the cases where it would should >>> be already costed. So I wonder if you have specific examples >>> where things go worse enough for the heuristic to trigger? >>> >> >> One typical case that I worked on to reuse this density check is the >> function mat_times_vec of src file block_solver.fppized.f of SPEC2017 >> 503.bwaves_r, the density with the existing heuristic is 83 (doesn't >> exceed the threshold unlikely). The interesting loop is the innermost >> one while option set is "-O2 -mcpu=power8 -ffast-math -ftree-vectorize". >> We have verified that this loop isn't profitable to be vectorized at >> O2 (without loop-interchange). > > Yeah, but that's because the loop only runs 5 iterations, not because > of some "density" (which suggests AGU overloading or some such)? > Because if you modify it so it iterates more then with keeping the > "density" measurement constant you suddenly become profitable? > Yes, I agree this isn't a perfect one showing how the density check matters, though it led me to find this check. I tried to run SPEC2017 bmks w/ and w/o this density heuristic to catch some "expected" case, but failed to unluckily. It may be worth to trying with some more option sets or even test with the previous SPECs later. I hacked the innermost loop iteration from 5 to 20, but baseline run didn't stop (after more than 7 hrs then I killed it), which was suspected to become endless because of some garbage (out of bound) data. But the current cost modeling for this loop on Power is still bad, the min profitable iteration (both static and run time) are evaluated as 2, while the reality shows 5 isn't profitable at least. > The loop does have quite many memory streams so optimizing > the (few) arithmetic ops by vectorizign them might not be worth > the trouble, esp. since most of the loads are "strided" (composed > from scalars) when no interchange is performed. So it's probably > more a "density" of # memory streams vs. # arithmetic ops, and > esp. with any non-consecutive vector loads this balance being > worse in the vector case? > Yeah, these many scalar "strided" loads make things worse. The fed vector CTORs have to wait for all of their required loads are ready, and these vector CTOR are required by further multiplications. I posted one patch[1] on this, which tries to model it with some counts: nload (total load number), nload_ctor (strided load number fed into CTOR) and nctor_strided (CTOR number fed by strided load). Restricting the penalization by considering some factors: 1) vect density ratio, if there are many vector instructions, the stalls from loads are easy to impact the subsequent computation. 2) total load number, if nload is small, it's unlikely to bother the load/store units much. 3) strided loads fed into CTOR pct., if there are high portion strided loads fed into CTOR, it's very likely to block the CTOR and its subsequent chain. btw, as your previous comments on add_stmt_cost, the load/strided/ctor statistics should be gathered there instead, like: if (!data->costing_for_scalar && data->loop_info && where == vect_body) { if (kind == scalar_load || kind == vector_load || kind == unaligned_load || kind == vector_gather_load) data->nload += count; if (stmt_info && STMT_VINFO_STRIDED_P (stmt_info)) { if (kind == scalar_load || kind == unaligned_load) data->nload_ctor += count; else if (kind == vec_construct) data->nctor_strided += count; } } [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/569791.html > The x86 add_stmt_cost has > > /* If we do elementwise loads into a vector then we are bound by > latency and execution resources for the many scalar loads > (AGU and load ports). Try to account for this by scaling the > construction cost by the number of elements involved. */ > if ((kind == vec_construct || kind == vec_to_scalar) > && stmt_info > && (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type > || STMT_VINFO_TYPE (stmt_info) == store_vec_info_type) > && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE > && TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) != INTEGER_CST) > { > stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1); > } > > so it penaltizes VMAT_ELEMENTWISE for variable step for both loads and stores. > The above materialized over PRs 84037, 85491 and 87561, so not specifically > for the bwaves case. IIRC on x86 bwaves at -O2 is slower with vectorization > as well. > Thanks for the pointer! rs6000 probably can follow this way instead. IIUC, this cost adjustment is for each individual vec_construct/vec_to_scalar, is it better to use the way
Re: [PATCH 1/2] vect: Add costing_for_scalar parameter to init_cost hook
On Tue, May 11, 2021 at 12:50 PM Kewen.Lin wrote: > > Hi Richi, > > > OTOH we already pass scalar_stmt to individual add_stmt_cost, > > so not sure whether the context really matters. That said, > > the density test looks "interesting" ... the intent was that finish_cost > > might look at gathered data from add_stmt, not that it looks at > > the GIMPLE IL ... so why are you not counting vector_stmt vs. > > scalar_stmt entries in vect_body and using that for this metric? > > > > Good to know the intention behind finish_cost, thanks! > > I'm afraid that the check on vector_stmt and scalar_stmt entries > from add_stmt_cost doesn't work for the density test here. The > density test focuses on the vector version itself, there are some > stmts whose relevants are marked as vect_unused_in_scope, IIUC > they won't be passed down when costing for both versions. But the > existing density check would like to know the cost for the > non-vectorized part. The current implementation does: > > vec_cost = data->cost[vect_body] > > if (!STMT_VINFO_RELEVANT_P (stmt_info) > && !STMT_VINFO_IN_PATTERN_P (stmt_info)) > not_vec_cost++ > > density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost); > > it takes those unrelevant stmts into account, and then has > both costs from the non-vectorized part (not_vec_cost) > and vectorized part (cost[vect_body]), it can calculate the > vectorization code density ratio. > >>> > >>> Yes, but then what "relevant" stmts are actually needed and what > >>> not is missed by your heuristics. It's really some GIGO one > >>> I fear - each vectorized data reference will add a pointer IV > >>> (eventually commoned by IVOPTs later) and pointer value updates > >>> that are not accounted for in costing (the IVs and updates in the > >>> scalar code are marked as not relevant). Are those the stmts > >>> this heuristic wants to look at? > >> > >> Yes, the IVs and updates (even the comparison for exit) are what > >> the heuristics tries to count. In most cases, the non vectorized > >> part in the loop are IV updates. And it's so true that the > >> collected not_vec_cost could be not accurate, but it seems hard > >> to predict the cost exactly here? > >> > >> Assuming this not_vect_cost cost is over priced, it could result > >> in a lower density ratio than what it should be. Also assuming > >> the density threshold is relatively conservative, in this case > >> if the ratio still exceeds the density threshold, we can say the > >> loop is really dense. It could miss to catch some "dense" loops, > >> but I hope it won't take "non-dense" loops as "dense" unexpectedly. > > > > So we could in principle include IVs and updates in the costing but > > then the vectorizer isn't absolutely careful for doing scalar cleanups > > and instead expects IVOPTs to create canonical IVs. Note for > > the scalar part those stmts are not costed either, we'd have to > > change that as well. What this would mean is that for a scalar > > loop accessing a[i] and b[i] we'd have one original IV + update > > and the vectorizer generates two pointer IVs + updates. > > > > > I broke down my understanding a bit below to ensure it's correct. > > - We can pass down those "unrelevant" stmts into add_stmt_cost > for both scalar and vector versions, then targets can check > stmts accordingly instead of scanning IL by themselves. > For scalar version, these are mainly original IV + update > + some address ref calculation; while for vector version, > these are mainly pointer IVs + updates. > > - What's the cost assigned for these "unrelevant" stmts? > The comments seems to imply we want to cost them? If so, > I am worried that it can break some current costing > heuristics which don't consider these costs. Besides, > these "unrelavant" stmts can be optimized later, if we > consider them somwhere like calculating profitable min > iter, could result in worse code? > Can we pass them down but cost them freely? > > > But in the end the vector code shouldn't end up worse than the > > scalar code with respect to IVs - the cases where it would should > > be already costed. So I wonder if you have specific examples > > where things go worse enough for the heuristic to trigger? > > > > One typical case that I worked on to reuse this density check is the > function mat_times_vec of src file block_solver.fppized.f of SPEC2017 > 503.bwaves_r, the density with the existing heuristic is 83 (doesn't > exceed the threshold unlikely). The interesting loop is the innermost > one while option set is "-O2 -mcpu=power8 -ffast-math -ftree-vectorize". > We have verified that this loop isn't profitable to be vectorized at > O2 (without loop-interchange). Yeah, but that's because the loop only runs 5 iterations, not beca
Re: [PATCH 1/2] vect: Add costing_for_scalar parameter to init_cost hook
Hi Richi, > OTOH we already pass scalar_stmt to individual add_stmt_cost, > so not sure whether the context really matters. That said, > the density test looks "interesting" ... the intent was that finish_cost > might look at gathered data from add_stmt, not that it looks at > the GIMPLE IL ... so why are you not counting vector_stmt vs. > scalar_stmt entries in vect_body and using that for this metric? > Good to know the intention behind finish_cost, thanks! I'm afraid that the check on vector_stmt and scalar_stmt entries from add_stmt_cost doesn't work for the density test here. The density test focuses on the vector version itself, there are some stmts whose relevants are marked as vect_unused_in_scope, IIUC they won't be passed down when costing for both versions. But the existing density check would like to know the cost for the non-vectorized part. The current implementation does: vec_cost = data->cost[vect_body] if (!STMT_VINFO_RELEVANT_P (stmt_info) && !STMT_VINFO_IN_PATTERN_P (stmt_info)) not_vec_cost++ density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost); it takes those unrelevant stmts into account, and then has both costs from the non-vectorized part (not_vec_cost) and vectorized part (cost[vect_body]), it can calculate the vectorization code density ratio. >>> >>> Yes, but then what "relevant" stmts are actually needed and what >>> not is missed by your heuristics. It's really some GIGO one >>> I fear - each vectorized data reference will add a pointer IV >>> (eventually commoned by IVOPTs later) and pointer value updates >>> that are not accounted for in costing (the IVs and updates in the >>> scalar code are marked as not relevant). Are those the stmts >>> this heuristic wants to look at? >> >> Yes, the IVs and updates (even the comparison for exit) are what >> the heuristics tries to count. In most cases, the non vectorized >> part in the loop are IV updates. And it's so true that the >> collected not_vec_cost could be not accurate, but it seems hard >> to predict the cost exactly here? >> >> Assuming this not_vect_cost cost is over priced, it could result >> in a lower density ratio than what it should be. Also assuming >> the density threshold is relatively conservative, in this case >> if the ratio still exceeds the density threshold, we can say the >> loop is really dense. It could miss to catch some "dense" loops, >> but I hope it won't take "non-dense" loops as "dense" unexpectedly. > > So we could in principle include IVs and updates in the costing but > then the vectorizer isn't absolutely careful for doing scalar cleanups > and instead expects IVOPTs to create canonical IVs. Note for > the scalar part those stmts are not costed either, we'd have to > change that as well. What this would mean is that for a scalar > loop accessing a[i] and b[i] we'd have one original IV + update > and the vectorizer generates two pointer IVs + updates. > I broke down my understanding a bit below to ensure it's correct. - We can pass down those "unrelevant" stmts into add_stmt_cost for both scalar and vector versions, then targets can check stmts accordingly instead of scanning IL by themselves. For scalar version, these are mainly original IV + update + some address ref calculation; while for vector version, these are mainly pointer IVs + updates. - What's the cost assigned for these "unrelevant" stmts? The comments seems to imply we want to cost them? If so, I am worried that it can break some current costing heuristics which don't consider these costs. Besides, these "unrelavant" stmts can be optimized later, if we consider them somwhere like calculating profitable min iter, could result in worse code? Can we pass them down but cost them freely? > But in the end the vector code shouldn't end up worse than the > scalar code with respect to IVs - the cases where it would should > be already costed. So I wonder if you have specific examples > where things go worse enough for the heuristic to trigger? > One typical case that I worked on to reuse this density check is the function mat_times_vec of src file block_solver.fppized.f of SPEC2017 503.bwaves_r, the density with the existing heuristic is 83 (doesn't exceed the threshold unlikely). The interesting loop is the innermost one while option set is "-O2 -mcpu=power8 -ffast-math -ftree-vectorize". We have verified that this loop isn't profitable to be vectorized at O2 (without loop-interchange). Another function shell which also comes from 503.bwaves_r src file shell_lam.fppized.f does hit this threshold, the loop is the one starting from line 228. BR, Kewen
Re: [PATCH 1/2] vect: Add costing_for_scalar parameter to init_cost hook
On Tue, May 11, 2021 at 9:10 AM Kewen.Lin wrote: > > Hi Richi, > > on 2021/5/10 下午9:55, Richard Biener wrote: > > On Sat, May 8, 2021 at 10:05 AM Kewen.Lin wrote: > >> > >> Hi Richi, > >> > >> Thanks for the comments! > >> > >> on 2021/5/7 下午5:43, Richard Biener wrote: > >>> On Fri, May 7, 2021 at 5:30 AM Kewen.Lin via Gcc-patches > >>> wrote: > > Hi, > > When I was investigating density_test heuristics, I noticed that > the current rs6000_density_test could be used for single scalar > iteration cost calculation, through the call trace: > vect_compute_single_scalar_iteration_cost > -> rs6000_finish_cost > -> rs6000_density_test > > It looks unexpected as its desriptive function comments and Bill > helped to confirm this needs to be fixed (thanks!). > > So this patch is to check the passed data, if it's the same as > the one in loop_vinfo, it indicates it's working on vector version > cost calculation, otherwise just early return. > > Bootstrapped/regtested on powerpc64le-linux-gnu P9. > > Nothing remarkable was observed with SPEC2017 Power9 full run. > > Is it ok for trunk? > >>> > >>> + /* Only care about cost of vector version, so exclude scalar > >>> version here. */ > >>> + if (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo) != (void *) data) > >>> +return; > >>> > >>> Hmm, looks like a quite "random" test to me. What about adding a > >>> parameter to finish_cost () (or init_cost?) indicating the cost kind? > >>> > >> > >> I originally wanted to change the hook interface, but noticed that > >> the finish_cost in function vect_estimate_min_profitable_iters is > >> the only invocation with LOOP_VINFO_TARGET_COST_DATA (loop_vinfo), > >> it looks enough to differentiate the scalar version costing or > >> vector version costing for loop. Do you mean this observation/ > >> assumption easy to be broken sometime later? > > > > Yes, this field is likely to become stale. > > > >> > >> The attached patch to add one new parameter to indicate the > >> costing kind explicitly as you suggested. > >> > >> Does it look better? > >> > >> gcc/ChangeLog: > >> > >> * doc/tm.texi: Regenerated. > >> * target.def (init_cost): Add new parameter costing_for_scalar. > >> * targhooks.c (default_init_cost): Adjust for new parameter. > >> * targhooks.h (default_init_cost): Likewise. > >> * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Likewise. > >> (vect_compute_single_scalar_iteration_cost): Likewise. > >> (vect_analyze_loop_2): Likewise. > >> * tree-vect-slp.c (_bb_vec_info::_bb_vec_info): Likewise. > >> (vect_bb_vectorization_profitable_p): Likewise. > >> * tree-vectorizer.h (init_cost): Likewise. > >> * config/aarch64/aarch64.c (aarch64_init_cost): Likewise. > >> * config/i386/i386.c (ix86_init_cost): Likewise. > >> * config/rs6000/rs6000.c (rs6000_init_cost): Likewise. > >> > >>> OTOH we already pass scalar_stmt to individual add_stmt_cost, > >>> so not sure whether the context really matters. That said, > >>> the density test looks "interesting" ... the intent was that finish_cost > >>> might look at gathered data from add_stmt, not that it looks at > >>> the GIMPLE IL ... so why are you not counting vector_stmt vs. > >>> scalar_stmt entries in vect_body and using that for this metric? > >>> > >> > >> Good to know the intention behind finish_cost, thanks! > >> > >> I'm afraid that the check on vector_stmt and scalar_stmt entries > >> from add_stmt_cost doesn't work for the density test here. The > >> density test focuses on the vector version itself, there are some > >> stmts whose relevants are marked as vect_unused_in_scope, IIUC > >> they won't be passed down when costing for both versions. But the > >> existing density check would like to know the cost for the > >> non-vectorized part. The current implementation does: > >> > >> vec_cost = data->cost[vect_body] > >> > >> if (!STMT_VINFO_RELEVANT_P (stmt_info) > >> && !STMT_VINFO_IN_PATTERN_P (stmt_info)) > >> not_vec_cost++ > >> > >> density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost); > >> > >> it takes those unrelevant stmts into account, and then has > >> both costs from the non-vectorized part (not_vec_cost) > >> and vectorized part (cost[vect_body]), it can calculate the > >> vectorization code density ratio. > > > > Yes, but then what "relevant" stmts are actually needed and what > > not is missed by your heuristics. It's really some GIGO one > > I fear - each vectorized data reference will add a pointer IV > > (eventually commoned by IVOPTs later) and pointer value updates > > that are not accounted for in costing (the IVs and updates in the > > scalar code are marked as not relevant). Are those the stmts > > this heuristic wants to look at? > > Yes, the IVs and updates (even the
Re: [PATCH 1/2] vect: Add costing_for_scalar parameter to init_cost hook
Hi Richard, on 2021/5/10 下午10:08, Richard Sandiford wrote: > "Kewen.Lin via Gcc-patches" writes: >> on 2021/5/7 下午5:43, Richard Biener wrote: >>> On Fri, May 7, 2021 at 5:30 AM Kewen.Lin via Gcc-patches >>> wrote: Hi, When I was investigating density_test heuristics, I noticed that the current rs6000_density_test could be used for single scalar iteration cost calculation, through the call trace: vect_compute_single_scalar_iteration_cost -> rs6000_finish_cost -> rs6000_density_test It looks unexpected as its desriptive function comments and Bill helped to confirm this needs to be fixed (thanks!). So this patch is to check the passed data, if it's the same as the one in loop_vinfo, it indicates it's working on vector version cost calculation, otherwise just early return. Bootstrapped/regtested on powerpc64le-linux-gnu P9. Nothing remarkable was observed with SPEC2017 Power9 full run. Is it ok for trunk? >>> >>> + /* Only care about cost of vector version, so exclude scalar >>> version here. */ >>> + if (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo) != (void *) data) >>> +return; >>> >>> Hmm, looks like a quite "random" test to me. What about adding a >>> parameter to finish_cost () (or init_cost?) indicating the cost kind? >>> >> >> I originally wanted to change the hook interface, but noticed that >> the finish_cost in function vect_estimate_min_profitable_iters is >> the only invocation with LOOP_VINFO_TARGET_COST_DATA (loop_vinfo), >> it looks enough to differentiate the scalar version costing or >> vector version costing for loop. Do you mean this observation/ >> assumption easy to be broken sometime later? >> >> The attached patch to add one new parameter to indicate the >> costing kind explicitly as you suggested. >> >> Does it look better? >> >> gcc/ChangeLog: >> >> * doc/tm.texi: Regenerated. >> * target.def (init_cost): Add new parameter costing_for_scalar. >> * targhooks.c (default_init_cost): Adjust for new parameter. >> * targhooks.h (default_init_cost): Likewise. >> * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Likewise. >> (vect_compute_single_scalar_iteration_cost): Likewise. >> (vect_analyze_loop_2): Likewise. >> * tree-vect-slp.c (_bb_vec_info::_bb_vec_info): Likewise. >> (vect_bb_vectorization_profitable_p): Likewise. >> * tree-vectorizer.h (init_cost): Likewise. >> * config/aarch64/aarch64.c (aarch64_init_cost): Likewise. >> * config/i386/i386.c (ix86_init_cost): Likewise. >> * config/rs6000/rs6000.c (rs6000_init_cost): Likewise. > > Just wanted to say thanks for doing this. I hit the same problem > when doing the Neoverse V1 tuning near the end of stage 4. Due to > the extreme lateness of the changes, I couldn't reasonably ask for > target-independent help at that time, but this patch will make > things simpler for AArch64. :-) > Glad to see that rs6000 port isn't the only port requesting this. :-) Thanks for the information! BR, Kewen
Re: [PATCH 1/2] vect: Add costing_for_scalar parameter to init_cost hook
Hi Richi, on 2021/5/10 下午9:55, Richard Biener wrote: > On Sat, May 8, 2021 at 10:05 AM Kewen.Lin wrote: >> >> Hi Richi, >> >> Thanks for the comments! >> >> on 2021/5/7 下午5:43, Richard Biener wrote: >>> On Fri, May 7, 2021 at 5:30 AM Kewen.Lin via Gcc-patches >>> wrote: Hi, When I was investigating density_test heuristics, I noticed that the current rs6000_density_test could be used for single scalar iteration cost calculation, through the call trace: vect_compute_single_scalar_iteration_cost -> rs6000_finish_cost -> rs6000_density_test It looks unexpected as its desriptive function comments and Bill helped to confirm this needs to be fixed (thanks!). So this patch is to check the passed data, if it's the same as the one in loop_vinfo, it indicates it's working on vector version cost calculation, otherwise just early return. Bootstrapped/regtested on powerpc64le-linux-gnu P9. Nothing remarkable was observed with SPEC2017 Power9 full run. Is it ok for trunk? >>> >>> + /* Only care about cost of vector version, so exclude scalar >>> version here. */ >>> + if (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo) != (void *) data) >>> +return; >>> >>> Hmm, looks like a quite "random" test to me. What about adding a >>> parameter to finish_cost () (or init_cost?) indicating the cost kind? >>> >> >> I originally wanted to change the hook interface, but noticed that >> the finish_cost in function vect_estimate_min_profitable_iters is >> the only invocation with LOOP_VINFO_TARGET_COST_DATA (loop_vinfo), >> it looks enough to differentiate the scalar version costing or >> vector version costing for loop. Do you mean this observation/ >> assumption easy to be broken sometime later? > > Yes, this field is likely to become stale. > >> >> The attached patch to add one new parameter to indicate the >> costing kind explicitly as you suggested. >> >> Does it look better? >> >> gcc/ChangeLog: >> >> * doc/tm.texi: Regenerated. >> * target.def (init_cost): Add new parameter costing_for_scalar. >> * targhooks.c (default_init_cost): Adjust for new parameter. >> * targhooks.h (default_init_cost): Likewise. >> * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Likewise. >> (vect_compute_single_scalar_iteration_cost): Likewise. >> (vect_analyze_loop_2): Likewise. >> * tree-vect-slp.c (_bb_vec_info::_bb_vec_info): Likewise. >> (vect_bb_vectorization_profitable_p): Likewise. >> * tree-vectorizer.h (init_cost): Likewise. >> * config/aarch64/aarch64.c (aarch64_init_cost): Likewise. >> * config/i386/i386.c (ix86_init_cost): Likewise. >> * config/rs6000/rs6000.c (rs6000_init_cost): Likewise. >> >>> OTOH we already pass scalar_stmt to individual add_stmt_cost, >>> so not sure whether the context really matters. That said, >>> the density test looks "interesting" ... the intent was that finish_cost >>> might look at gathered data from add_stmt, not that it looks at >>> the GIMPLE IL ... so why are you not counting vector_stmt vs. >>> scalar_stmt entries in vect_body and using that for this metric? >>> >> >> Good to know the intention behind finish_cost, thanks! >> >> I'm afraid that the check on vector_stmt and scalar_stmt entries >> from add_stmt_cost doesn't work for the density test here. The >> density test focuses on the vector version itself, there are some >> stmts whose relevants are marked as vect_unused_in_scope, IIUC >> they won't be passed down when costing for both versions. But the >> existing density check would like to know the cost for the >> non-vectorized part. The current implementation does: >> >> vec_cost = data->cost[vect_body] >> >> if (!STMT_VINFO_RELEVANT_P (stmt_info) >> && !STMT_VINFO_IN_PATTERN_P (stmt_info)) >> not_vec_cost++ >> >> density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost); >> >> it takes those unrelevant stmts into account, and then has >> both costs from the non-vectorized part (not_vec_cost) >> and vectorized part (cost[vect_body]), it can calculate the >> vectorization code density ratio. > > Yes, but then what "relevant" stmts are actually needed and what > not is missed by your heuristics. It's really some GIGO one > I fear - each vectorized data reference will add a pointer IV > (eventually commoned by IVOPTs later) and pointer value updates > that are not accounted for in costing (the IVs and updates in the > scalar code are marked as not relevant). Are those the stmts > this heuristic wants to look at? Yes, the IVs and updates (even the comparison for exit) are what the heuristics tries to count. In most cases, the non vectorized part in the loop are IV updates. And it's so true that the collected not_vec_cost could be not accurate, but it seems hard to predict the cost exactly here? Assuming this not_vect
Re: [PATCH 1/2] vect: Add costing_for_scalar parameter to init_cost hook
"Kewen.Lin via Gcc-patches" writes: > on 2021/5/7 下午5:43, Richard Biener wrote: >> On Fri, May 7, 2021 at 5:30 AM Kewen.Lin via Gcc-patches >> wrote: >>> >>> Hi, >>> >>> When I was investigating density_test heuristics, I noticed that >>> the current rs6000_density_test could be used for single scalar >>> iteration cost calculation, through the call trace: >>> vect_compute_single_scalar_iteration_cost >>> -> rs6000_finish_cost >>> -> rs6000_density_test >>> >>> It looks unexpected as its desriptive function comments and Bill >>> helped to confirm this needs to be fixed (thanks!). >>> >>> So this patch is to check the passed data, if it's the same as >>> the one in loop_vinfo, it indicates it's working on vector version >>> cost calculation, otherwise just early return. >>> >>> Bootstrapped/regtested on powerpc64le-linux-gnu P9. >>> >>> Nothing remarkable was observed with SPEC2017 Power9 full run. >>> >>> Is it ok for trunk? >> >> + /* Only care about cost of vector version, so exclude scalar >> version here. */ >> + if (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo) != (void *) data) >> +return; >> >> Hmm, looks like a quite "random" test to me. What about adding a >> parameter to finish_cost () (or init_cost?) indicating the cost kind? >> > > I originally wanted to change the hook interface, but noticed that > the finish_cost in function vect_estimate_min_profitable_iters is > the only invocation with LOOP_VINFO_TARGET_COST_DATA (loop_vinfo), > it looks enough to differentiate the scalar version costing or > vector version costing for loop. Do you mean this observation/ > assumption easy to be broken sometime later? > > The attached patch to add one new parameter to indicate the > costing kind explicitly as you suggested. > > Does it look better? > > gcc/ChangeLog: > > * doc/tm.texi: Regenerated. > * target.def (init_cost): Add new parameter costing_for_scalar. > * targhooks.c (default_init_cost): Adjust for new parameter. > * targhooks.h (default_init_cost): Likewise. > * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Likewise. > (vect_compute_single_scalar_iteration_cost): Likewise. > (vect_analyze_loop_2): Likewise. > * tree-vect-slp.c (_bb_vec_info::_bb_vec_info): Likewise. > (vect_bb_vectorization_profitable_p): Likewise. > * tree-vectorizer.h (init_cost): Likewise. > * config/aarch64/aarch64.c (aarch64_init_cost): Likewise. > * config/i386/i386.c (ix86_init_cost): Likewise. > * config/rs6000/rs6000.c (rs6000_init_cost): Likewise. Just wanted to say thanks for doing this. I hit the same problem when doing the Neoverse V1 tuning near the end of stage 4. Due to the extreme lateness of the changes, I couldn't reasonably ask for target-independent help at that time, but this patch will make things simpler for AArch64. :-) Richard
Re: [PATCH 1/2] vect: Add costing_for_scalar parameter to init_cost hook
On Sat, May 8, 2021 at 10:05 AM Kewen.Lin wrote: > > Hi Richi, > > Thanks for the comments! > > on 2021/5/7 下午5:43, Richard Biener wrote: > > On Fri, May 7, 2021 at 5:30 AM Kewen.Lin via Gcc-patches > > wrote: > >> > >> Hi, > >> > >> When I was investigating density_test heuristics, I noticed that > >> the current rs6000_density_test could be used for single scalar > >> iteration cost calculation, through the call trace: > >> vect_compute_single_scalar_iteration_cost > >> -> rs6000_finish_cost > >> -> rs6000_density_test > >> > >> It looks unexpected as its desriptive function comments and Bill > >> helped to confirm this needs to be fixed (thanks!). > >> > >> So this patch is to check the passed data, if it's the same as > >> the one in loop_vinfo, it indicates it's working on vector version > >> cost calculation, otherwise just early return. > >> > >> Bootstrapped/regtested on powerpc64le-linux-gnu P9. > >> > >> Nothing remarkable was observed with SPEC2017 Power9 full run. > >> > >> Is it ok for trunk? > > > > + /* Only care about cost of vector version, so exclude scalar > > version here. */ > > + if (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo) != (void *) data) > > +return; > > > > Hmm, looks like a quite "random" test to me. What about adding a > > parameter to finish_cost () (or init_cost?) indicating the cost kind? > > > > I originally wanted to change the hook interface, but noticed that > the finish_cost in function vect_estimate_min_profitable_iters is > the only invocation with LOOP_VINFO_TARGET_COST_DATA (loop_vinfo), > it looks enough to differentiate the scalar version costing or > vector version costing for loop. Do you mean this observation/ > assumption easy to be broken sometime later? Yes, this field is likely to become stale. > > The attached patch to add one new parameter to indicate the > costing kind explicitly as you suggested. > > Does it look better? > > gcc/ChangeLog: > > * doc/tm.texi: Regenerated. > * target.def (init_cost): Add new parameter costing_for_scalar. > * targhooks.c (default_init_cost): Adjust for new parameter. > * targhooks.h (default_init_cost): Likewise. > * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Likewise. > (vect_compute_single_scalar_iteration_cost): Likewise. > (vect_analyze_loop_2): Likewise. > * tree-vect-slp.c (_bb_vec_info::_bb_vec_info): Likewise. > (vect_bb_vectorization_profitable_p): Likewise. > * tree-vectorizer.h (init_cost): Likewise. > * config/aarch64/aarch64.c (aarch64_init_cost): Likewise. > * config/i386/i386.c (ix86_init_cost): Likewise. > * config/rs6000/rs6000.c (rs6000_init_cost): Likewise. > > > OTOH we already pass scalar_stmt to individual add_stmt_cost, > > so not sure whether the context really matters. That said, > > the density test looks "interesting" ... the intent was that finish_cost > > might look at gathered data from add_stmt, not that it looks at > > the GIMPLE IL ... so why are you not counting vector_stmt vs. > > scalar_stmt entries in vect_body and using that for this metric? > > > > Good to know the intention behind finish_cost, thanks! > > I'm afraid that the check on vector_stmt and scalar_stmt entries > from add_stmt_cost doesn't work for the density test here. The > density test focuses on the vector version itself, there are some > stmts whose relevants are marked as vect_unused_in_scope, IIUC > they won't be passed down when costing for both versions. But the > existing density check would like to know the cost for the > non-vectorized part. The current implementation does: > > vec_cost = data->cost[vect_body] > > if (!STMT_VINFO_RELEVANT_P (stmt_info) > && !STMT_VINFO_IN_PATTERN_P (stmt_info)) > not_vec_cost++ > > density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost); > > it takes those unrelevant stmts into account, and then has > both costs from the non-vectorized part (not_vec_cost) > and vectorized part (cost[vect_body]), it can calculate the > vectorization code density ratio. Yes, but then what "relevant" stmts are actually needed and what not is missed by your heuristics. It's really some GIGO one I fear - each vectorized data reference will add a pointer IV (eventually commoned by IVOPTs later) and pointer value updates that are not accounted for in costing (the IVs and updates in the scalar code are marked as not relevant). Are those the stmts this heuristic wants to look at? The patch looks OK btw. Thanks, Richard. > > BR, > Kewen
[PATCH 1/2] vect: Add costing_for_scalar parameter to init_cost hook
Hi Richi, Thanks for the comments! on 2021/5/7 下午5:43, Richard Biener wrote: > On Fri, May 7, 2021 at 5:30 AM Kewen.Lin via Gcc-patches > wrote: >> >> Hi, >> >> When I was investigating density_test heuristics, I noticed that >> the current rs6000_density_test could be used for single scalar >> iteration cost calculation, through the call trace: >> vect_compute_single_scalar_iteration_cost >> -> rs6000_finish_cost >> -> rs6000_density_test >> >> It looks unexpected as its desriptive function comments and Bill >> helped to confirm this needs to be fixed (thanks!). >> >> So this patch is to check the passed data, if it's the same as >> the one in loop_vinfo, it indicates it's working on vector version >> cost calculation, otherwise just early return. >> >> Bootstrapped/regtested on powerpc64le-linux-gnu P9. >> >> Nothing remarkable was observed with SPEC2017 Power9 full run. >> >> Is it ok for trunk? > > + /* Only care about cost of vector version, so exclude scalar > version here. */ > + if (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo) != (void *) data) > +return; > > Hmm, looks like a quite "random" test to me. What about adding a > parameter to finish_cost () (or init_cost?) indicating the cost kind? > I originally wanted to change the hook interface, but noticed that the finish_cost in function vect_estimate_min_profitable_iters is the only invocation with LOOP_VINFO_TARGET_COST_DATA (loop_vinfo), it looks enough to differentiate the scalar version costing or vector version costing for loop. Do you mean this observation/ assumption easy to be broken sometime later? The attached patch to add one new parameter to indicate the costing kind explicitly as you suggested. Does it look better? gcc/ChangeLog: * doc/tm.texi: Regenerated. * target.def (init_cost): Add new parameter costing_for_scalar. * targhooks.c (default_init_cost): Adjust for new parameter. * targhooks.h (default_init_cost): Likewise. * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Likewise. (vect_compute_single_scalar_iteration_cost): Likewise. (vect_analyze_loop_2): Likewise. * tree-vect-slp.c (_bb_vec_info::_bb_vec_info): Likewise. (vect_bb_vectorization_profitable_p): Likewise. * tree-vectorizer.h (init_cost): Likewise. * config/aarch64/aarch64.c (aarch64_init_cost): Likewise. * config/i386/i386.c (ix86_init_cost): Likewise. * config/rs6000/rs6000.c (rs6000_init_cost): Likewise. > OTOH we already pass scalar_stmt to individual add_stmt_cost, > so not sure whether the context really matters. That said, > the density test looks "interesting" ... the intent was that finish_cost > might look at gathered data from add_stmt, not that it looks at > the GIMPLE IL ... so why are you not counting vector_stmt vs. > scalar_stmt entries in vect_body and using that for this metric? > Good to know the intention behind finish_cost, thanks! I'm afraid that the check on vector_stmt and scalar_stmt entries from add_stmt_cost doesn't work for the density test here. The density test focuses on the vector version itself, there are some stmts whose relevants are marked as vect_unused_in_scope, IIUC they won't be passed down when costing for both versions. But the existing density check would like to know the cost for the non-vectorized part. The current implementation does: vec_cost = data->cost[vect_body] if (!STMT_VINFO_RELEVANT_P (stmt_info) && !STMT_VINFO_IN_PATTERN_P (stmt_info)) not_vec_cost++ density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost); it takes those unrelevant stmts into account, and then has both costs from the non-vectorized part (not_vec_cost) and vectorized part (cost[vect_body]), it can calculate the vectorization code density ratio. BR, Kewen gcc/config/aarch64/aarch64.c | 2 +- gcc/config/i386/i386.c | 2 +- gcc/config/rs6000/rs6000.c | 2 +- gcc/doc/tm.texi | 4 ++-- gcc/target.def | 6 -- gcc/targhooks.c | 3 ++- gcc/targhooks.h | 2 +- gcc/tree-vect-loop.c | 6 +++--- gcc/tree-vect-slp.c | 8 +--- gcc/tree-vectorizer.h| 4 ++-- 10 files changed, 22 insertions(+), 17 deletions(-) diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 12625a4bee3..de3c86d85fb 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -14390,7 +14390,7 @@ struct aarch64_vector_costs /* Implement TARGET_VECTORIZE_INIT_COST. */ void * -aarch64_init_cost (class loop *) +aarch64_init_cost (class loop *, bool) { return new aarch64_vector_costs; } diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 7c41302c75b..5078d94970a 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -3,7 +3,7 @@ ix86_noce_conversion_profitable_p (rtx_insn *seq, struct noce_if_info *if_info) /* Implemen