Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > On Mon, Nov 8, 2021 at 2:06 PM Richard Sandiford > <richard.sandif...@arm.com> wrote: >> >> Richard Biener <richard.guent...@gmail.com> writes: >> > On Mon, Nov 8, 2021 at 11:45 AM Richard Sandiford via Gcc-patches >> > <gcc-patches@gcc.gnu.org> wrote: >> >> >> >> One of the things we want to do on AArch64 is compare vector loops >> >> side-by-side and pick the best one. For some targets, we want this >> >> to be based on issue rates as well as the usual latency-based costs >> >> (at least for loops with relatively high iteration counts). >> >> >> >> The current approach to doing this is: when costing vectorisation >> >> candidate A, try to guess what the other main candidate B will look >> >> like and adjust A's latency-based cost up or down based on the likely >> >> difference between A and B's issue rates. This effectively means >> >> that we try to cost parts of B at the same time as A, without actually >> >> being able to see B. >> > >> > I think the idea of the current costing is that compares are always >> > to the original scalar loop (with comparing the resulting magic >> > cost numbers) so that by means of transitivity comparing two >> > vector variants work as well. So I'm not sure where 'variant B' >> > comes into play here? >> >> E.g. A could be SVE and B could be Advanced SIMD, or A could be >> SVE with fully-packed vectors and B could be SVE with partially >> unpacked vectors. >> >> One motivating case is Neoverse V1, which is a 256-bit SVE target. >> There, scalar code, Advanced SIMD code and SVE code have different issue >> characteristics. SVE vector ops generally have the same latencies as >> the corresponding Advanced SIMD ops. Advanced SIMD ops generally >> have double the throughput of SVE ones, so that the overall bit-for-bit >> throughputs are roughly equal. However, there are differences due to >> predication, load/store handling, and so on, and those differences >> should be decisive in some cases. >> >> For SLP, latency-based costs are fine. But for loop costs, they hide >> too many details. What works best in practice, both for vector vs. >> vector and vector vs. scalar, is: >> >> (1) compare issue rates between two loop bodies (vector vs. vector >> or vector vs. scalar) >> (2) if issue rates are equal for a given number of scalar iterations, >> compare the latencies of the loop bodies, as we do now >> (3) if both the above are equal, compare the latencies of the prologue >> and epilogue, as we do now >> >> It's very difficult to combine latency and issue rate information >> into a single per-statement integer, in such a way that the integers >> can just be summed and compared. > > Yeah, so the idea I had when proposing the init_cost/add_stmt/finish_cost > was that finish_cost would determine the issue rate cost part, for > example if the uarch can issue 2 ops with ISA A and 4 ops with ISA B > then finish_cost would compute the number of cycles required to > issue all vector stmts. That yields sth like IPC the latency cost could > be divided by. Doing that for both ISA A and ISA B would produce > weighted latency values that can be compared? Alternatively > you can of course simulate issue with the actual instruction latencies > in mind and produce an overall iteration latency number. > > Comparing just latency or issue rate in isolation is likely not good > enough?
In practice, comparing issue rates in isolation does seem to be what we want as the first level of comparison. I don't think total latency / issue rate is really a meaningful combination. It makes sense to the extent that “lower latency == good” and “higher issue rate == good”, but I don't think it's the case that halving the total latency is equally valuable as doubling the issue rate. In practice the latter is a significant win while the former might or might not be. total latency / issue rate also runs the risk of double-counting: reducing the number of ops decreases the total latency *and* increases the issue rate, which could have a quadratic effect. (This is why we don't decrease SVE costs when we think that the SVE code will issue more quickly than the Advanced SIMD code.) For a long-running loop, the only latencies that really matter are for loop-carried dependencies. The issue rate information takes those into account, in that it tracks reduction and (with later patches) induction limiters. Thanks, Richard > >> So returning to the above, when costing an SVE loop A, we currently >> collect on-the-side issue information about both A and the likely >> Advanced SIMD implementation B. If B would have a higher throughput >> than A then we multiply A's latency-based costs by: >> >> ceil(B throughput/A throughput) >> >> The problem is, we don't actually know whether B exists or what it >> would look like (given that Advanced SIMD and SVE have different >> features). >> >> In principle, we should do the same in reverse, but since SVE needs >> fewer ops than Advanced SIMD, the latencies are usually smaller already. >> >> We also do something similar for the scalar code. When costing both >> Advanced SIMD and SVE, we try to estimate the issue rate of the >> original scalar code. If the scalar code could issue more quickly >> than the equivalent vector code, we increase the latency-based cost >> of the vector code to account for that. >> >> The idea of the patch (and others) is that we can do the (1), (2), >> (3) comparison above directly rather than indirectly. We can also >> use the issue information calculated while costing the scalar code, >> rather than having to estimate the scalar issue information while >> costing the vector code. >> >> >> This is needlessly indirect and complex. It was a compromise due >> >> to the code being added (too) late in the GCC 11 cycle, so that >> >> target-independent changes weren't possible. >> >> >> >> The target-independent code already compares two candidate loop_vec_infos >> >> side-by-side, so that information about A and B above are available >> >> directly. This patch creates a way for targets to hook into this >> >> comparison. >> > >> > You mean it has both loop_vec_infos. But as said I'm not sure it's a good >> > idea to compare those side-by-side - will that not lead to _more_ >> > special-casing >> > since you need to have a working compare to the scalar variant as well >> > to determine the vectorization threshold? >> >> A later patch allows the code to do the same comparison for vector >> vs. scalar. It makes the target code significantly simpler compared >> to now, since add_stmt_cost only needs to consider the latency and >> issue rate of the code it's actually supposed to be costing, rather than >> trying also to estimate the issue rate of the scalar code and the issue >> rate of the Advanced SIMD code. >> >> Thanks, >> Richard >> >> > >> >> >> >> The AArch64 code can therefore hook into better_main_loop_than_p to >> >> compare issue rates. If the issue rate comparison isn't decisive, >> >> the code can fall back to the normal latency-based comparison instead. >> >> >> >> Tested on aarch64-linux-gnu and x86_64-linux-gnu. OK to install? >> >> >> >> Richard >> >> >> >> >> >> gcc/ >> >> * tree-vectorizer.h (vector_costs::better_main_loop_than_p) >> >> (vector_costs::better_epilogue_loop_than_p) >> >> (vector_costs::compare_inside_loop_cost) >> >> (vector_costs::compare_outside_loop_cost): Likewise. >> >> * tree-vectorizer.c (vector_costs::better_main_loop_than_p) >> >> (vector_costs::better_epilogue_loop_than_p) >> >> (vector_costs::compare_inside_loop_cost) >> >> (vector_costs::compare_outside_loop_cost): New functions, >> >> containing code moved from... >> >> * tree-vect-loop.c (vect_better_loop_vinfo_p): ...here. >> >> --- >> >> gcc/tree-vect-loop.c | 142 ++--------------------------- >> >> gcc/tree-vectorizer.c | 204 ++++++++++++++++++++++++++++++++++++++++++ >> >> gcc/tree-vectorizer.h | 17 ++++ >> >> 3 files changed, 226 insertions(+), 137 deletions(-) >> >> >> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c >> >> index dd4a363fee5..c9ee2e15e35 100644 >> >> --- a/gcc/tree-vect-loop.c >> >> +++ b/gcc/tree-vect-loop.c >> >> @@ -2784,144 +2784,12 @@ vect_better_loop_vinfo_p (loop_vec_info >> >> new_loop_vinfo, >> >> return new_simdlen_p; >> >> } >> >> >> >> - loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO (old_loop_vinfo); >> >> - if (main_loop) >> >> - { >> >> - poly_uint64 main_poly_vf = LOOP_VINFO_VECT_FACTOR (main_loop); >> >> - unsigned HOST_WIDE_INT main_vf; >> >> - unsigned HOST_WIDE_INT old_factor, new_factor, old_cost, new_cost; >> >> - /* If we can determine how many iterations are left for the >> >> epilogue >> >> - loop, that is if both the main loop's vectorization factor and >> >> number >> >> - of iterations are constant, then we use them to calculate the >> >> cost of >> >> - the epilogue loop together with a 'likely value' for the >> >> epilogues >> >> - vectorization factor. Otherwise we use the main loop's >> >> vectorization >> >> - factor and the maximum poly value for the epilogue's. If the >> >> target >> >> - has not provided with a sensible upper bound poly vectorization >> >> - factors are likely to be favored over constant ones. */ >> >> - if (main_poly_vf.is_constant (&main_vf) >> >> - && LOOP_VINFO_NITERS_KNOWN_P (main_loop)) >> >> - { >> >> - unsigned HOST_WIDE_INT niters >> >> - = LOOP_VINFO_INT_NITERS (main_loop) % main_vf; >> >> - HOST_WIDE_INT old_likely_vf >> >> - = estimated_poly_value (old_vf, POLY_VALUE_LIKELY); >> >> - HOST_WIDE_INT new_likely_vf >> >> - = estimated_poly_value (new_vf, POLY_VALUE_LIKELY); >> >> - >> >> - /* If the epilogue is using partial vectors we account for the >> >> - partial iteration here too. */ >> >> - old_factor = niters / old_likely_vf; >> >> - if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (old_loop_vinfo) >> >> - && niters % old_likely_vf != 0) >> >> - old_factor++; >> >> - >> >> - new_factor = niters / new_likely_vf; >> >> - if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (new_loop_vinfo) >> >> - && niters % new_likely_vf != 0) >> >> - new_factor++; >> >> - } >> >> - else >> >> - { >> >> - unsigned HOST_WIDE_INT main_vf_max >> >> - = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX); >> >> - >> >> - old_factor = main_vf_max / estimated_poly_value (old_vf, >> >> - >> >> POLY_VALUE_MAX); >> >> - new_factor = main_vf_max / estimated_poly_value (new_vf, >> >> - >> >> POLY_VALUE_MAX); >> >> - >> >> - /* If the loop is not using partial vectors then it will >> >> iterate one >> >> - time less than one that does. It is safe to subtract one >> >> here, >> >> - because the main loop's vf is always at least 2x bigger than >> >> that >> >> - of an epilogue. */ >> >> - if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (old_loop_vinfo)) >> >> - old_factor -= 1; >> >> - if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (new_loop_vinfo)) >> >> - new_factor -= 1; >> >> - } >> >> - >> >> - /* Compute the costs by multiplying the inside costs with the >> >> factor and >> >> - add the outside costs for a more complete picture. The factor >> >> is the >> >> - amount of times we are expecting to iterate this epilogue. */ >> >> - old_cost = old_loop_vinfo->vector_costs->body_cost () * old_factor; >> >> - new_cost = new_loop_vinfo->vector_costs->body_cost () * new_factor; >> >> - old_cost += old_loop_vinfo->vector_costs->outside_cost (); >> >> - new_cost += new_loop_vinfo->vector_costs->outside_cost (); >> >> - return new_cost < old_cost; >> >> - } >> >> - >> >> - /* Limit the VFs to what is likely to be the maximum number of >> >> iterations, >> >> - to handle cases in which at least one loop_vinfo is fully-masked. >> >> */ >> >> - HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int >> >> (loop); >> >> - if (estimated_max_niter != -1) >> >> - { >> >> - if (known_le (estimated_max_niter, new_vf)) >> >> - new_vf = estimated_max_niter; >> >> - if (known_le (estimated_max_niter, old_vf)) >> >> - old_vf = estimated_max_niter; >> >> - } >> >> - >> >> - /* Check whether the (fractional) cost per scalar iteration is lower >> >> - or higher: new_inside_cost / new_vf vs. old_inside_cost / old_vf. >> >> */ >> >> - poly_int64 rel_new = new_loop_vinfo->vector_costs->body_cost () * >> >> old_vf; >> >> - poly_int64 rel_old = old_loop_vinfo->vector_costs->body_cost () * >> >> new_vf; >> >> - >> >> - HOST_WIDE_INT est_rel_new_min >> >> - = estimated_poly_value (rel_new, POLY_VALUE_MIN); >> >> - HOST_WIDE_INT est_rel_new_max >> >> - = estimated_poly_value (rel_new, POLY_VALUE_MAX); >> >> - >> >> - HOST_WIDE_INT est_rel_old_min >> >> - = estimated_poly_value (rel_old, POLY_VALUE_MIN); >> >> - HOST_WIDE_INT est_rel_old_max >> >> - = estimated_poly_value (rel_old, POLY_VALUE_MAX); >> >> - >> >> - /* Check first if we can make out an unambigous total order from the >> >> minimum >> >> - and maximum estimates. */ >> >> - if (est_rel_new_min < est_rel_old_min >> >> - && est_rel_new_max < est_rel_old_max) >> >> - return true; >> >> - else if (est_rel_old_min < est_rel_new_min >> >> - && est_rel_old_max < est_rel_new_max) >> >> - return false; >> >> - /* When old_loop_vinfo uses a variable vectorization factor, >> >> - we know that it has a lower cost for at least one runtime VF. >> >> - However, we don't know how likely that VF is. >> >> - >> >> - One option would be to compare the costs for the estimated VFs. >> >> - The problem is that that can put too much pressure on the cost >> >> - model. E.g. if the estimated VF is also the lowest possible VF, >> >> - and if old_loop_vinfo is 1 unit worse than new_loop_vinfo >> >> - for the estimated VF, we'd then choose new_loop_vinfo even >> >> - though (a) new_loop_vinfo might not actually be better than >> >> - old_loop_vinfo for that VF and (b) it would be significantly >> >> - worse at larger VFs. >> >> - >> >> - Here we go for a hacky compromise: pick new_loop_vinfo if it is >> >> - no more expensive than old_loop_vinfo even after doubling the >> >> - estimated old_loop_vinfo VF. For all but trivial loops, this >> >> - ensures that we only pick new_loop_vinfo if it is significantly >> >> - better than old_loop_vinfo at the estimated VF. */ >> >> - >> >> - if (est_rel_old_min != est_rel_new_min >> >> - || est_rel_old_max != est_rel_new_max) >> >> - { >> >> - HOST_WIDE_INT est_rel_new_likely >> >> - = estimated_poly_value (rel_new, POLY_VALUE_LIKELY); >> >> - HOST_WIDE_INT est_rel_old_likely >> >> - = estimated_poly_value (rel_old, POLY_VALUE_LIKELY); >> >> - >> >> - return est_rel_new_likely * 2 <= est_rel_old_likely; >> >> - } >> >> - >> >> - /* If there's nothing to choose between the loop bodies, see whether >> >> - there's a difference in the prologue and epilogue costs. */ >> >> - auto old_outside_cost = old_loop_vinfo->vector_costs->outside_cost (); >> >> - auto new_outside_cost = new_loop_vinfo->vector_costs->outside_cost (); >> >> - if (new_outside_cost != old_outside_cost) >> >> - return new_outside_cost < old_outside_cost; >> >> + const auto *old_costs = old_loop_vinfo->vector_costs; >> >> + const auto *new_costs = new_loop_vinfo->vector_costs; >> >> + if (loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO >> >> (old_loop_vinfo)) >> >> + return new_costs->better_epilogue_loop_than_p (old_costs, main_loop); >> >> >> >> - return false; >> >> + return new_costs->better_main_loop_than_p (old_costs); >> >> } >> >> >> >> /* Decide whether to replace OLD_LOOP_VINFO with NEW_LOOP_VINFO. Return >> >> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c >> >> index 9ef76ce654b..dcbb2a3f13a 100644 >> >> --- a/gcc/tree-vectorizer.c >> >> +++ b/gcc/tree-vectorizer.c >> >> @@ -1744,3 +1744,207 @@ vector_costs::adjust_cost_for_freq (stmt_vec_info >> >> stmt_info, >> >> } >> >> return cost; >> >> } >> >> + >> >> +/* See the comment above the declaration for details. */ >> >> + >> >> +bool >> >> +vector_costs::better_main_loop_than_p (const vector_costs *other) const >> >> +{ >> >> + int diff = compare_inside_loop_cost (other); >> >> + if (diff != 0) >> >> + return diff < 0; >> >> + >> >> + /* If there's nothing to choose between the loop bodies, see whether >> >> + there's a difference in the prologue and epilogue costs. */ >> >> + diff = compare_outside_loop_cost (other); >> >> + if (diff != 0) >> >> + return diff < 0; >> >> + >> >> + return false; >> >> +} >> >> + >> >> + >> >> +/* See the comment above the declaration for details. */ >> >> + >> >> +bool >> >> +vector_costs::better_epilogue_loop_than_p (const vector_costs *other, >> >> + loop_vec_info main_loop) const >> >> +{ >> >> + loop_vec_info this_loop_vinfo = as_a<loop_vec_info> (this->m_vinfo); >> >> + loop_vec_info other_loop_vinfo = as_a<loop_vec_info> (other->m_vinfo); >> >> + >> >> + poly_int64 this_vf = LOOP_VINFO_VECT_FACTOR (this_loop_vinfo); >> >> + poly_int64 other_vf = LOOP_VINFO_VECT_FACTOR (other_loop_vinfo); >> >> + >> >> + poly_uint64 main_poly_vf = LOOP_VINFO_VECT_FACTOR (main_loop); >> >> + unsigned HOST_WIDE_INT main_vf; >> >> + unsigned HOST_WIDE_INT other_factor, this_factor, other_cost, >> >> this_cost; >> >> + /* If we can determine how many iterations are left for the epilogue >> >> + loop, that is if both the main loop's vectorization factor and >> >> number >> >> + of iterations are constant, then we use them to calculate the cost >> >> of >> >> + the epilogue loop together with a 'likely value' for the epilogues >> >> + vectorization factor. Otherwise we use the main loop's >> >> vectorization >> >> + factor and the maximum poly value for the epilogue's. If the target >> >> + has not provided with a sensible upper bound poly vectorization >> >> + factors are likely to be favored over constant ones. */ >> >> + if (main_poly_vf.is_constant (&main_vf) >> >> + && LOOP_VINFO_NITERS_KNOWN_P (main_loop)) >> >> + { >> >> + unsigned HOST_WIDE_INT niters >> >> + = LOOP_VINFO_INT_NITERS (main_loop) % main_vf; >> >> + HOST_WIDE_INT other_likely_vf >> >> + = estimated_poly_value (other_vf, POLY_VALUE_LIKELY); >> >> + HOST_WIDE_INT this_likely_vf >> >> + = estimated_poly_value (this_vf, POLY_VALUE_LIKELY); >> >> + >> >> + /* If the epilogue is using partial vectors we account for the >> >> + partial iteration here too. */ >> >> + other_factor = niters / other_likely_vf; >> >> + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (other_loop_vinfo) >> >> + && niters % other_likely_vf != 0) >> >> + other_factor++; >> >> + >> >> + this_factor = niters / this_likely_vf; >> >> + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (this_loop_vinfo) >> >> + && niters % this_likely_vf != 0) >> >> + this_factor++; >> >> + } >> >> + else >> >> + { >> >> + unsigned HOST_WIDE_INT main_vf_max >> >> + = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX); >> >> + >> >> + other_factor = main_vf_max / estimated_poly_value (other_vf, >> >> + POLY_VALUE_MAX); >> >> + this_factor = main_vf_max / estimated_poly_value (this_vf, >> >> + POLY_VALUE_MAX); >> >> + >> >> + /* If the loop is not using partial vectors then it will iterate >> >> one >> >> + time less than one that does. It is safe to subtract one here, >> >> + because the main loop's vf is always at least 2x bigger than that >> >> + of an epilogue. */ >> >> + if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (other_loop_vinfo)) >> >> + other_factor -= 1; >> >> + if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (this_loop_vinfo)) >> >> + this_factor -= 1; >> >> + } >> >> + >> >> + /* Compute the costs by multiplying the inside costs with the factor >> >> and >> >> + add the outside costs for a more complete picture. The factor is >> >> the >> >> + amount of times we are expecting to iterate this epilogue. */ >> >> + other_cost = other->body_cost () * other_factor; >> >> + this_cost = this->body_cost () * this_factor; >> >> + other_cost += other->outside_cost (); >> >> + this_cost += this->outside_cost (); >> >> + return this_cost < other_cost; >> >> +} >> >> + >> >> +/* A <=>-style subroutine of better_main_loop_than_p. Check whether we >> >> can >> >> + determine the return value of better_main_loop_than_p by comparing the >> >> + inside (loop body) costs of THIS and OTHER. Return: >> >> + >> >> + * -1 if better_main_loop_than_p should return true. >> >> + * 1 if better_main_loop_than_p should return false. >> >> + * 0 if we can't decide. */ >> >> + >> >> +int >> >> +vector_costs::compare_inside_loop_cost (const vector_costs *other) const >> >> +{ >> >> + loop_vec_info this_loop_vinfo = as_a<loop_vec_info> (this->m_vinfo); >> >> + loop_vec_info other_loop_vinfo = as_a<loop_vec_info> (other->m_vinfo); >> >> + >> >> + struct loop *loop = LOOP_VINFO_LOOP (this_loop_vinfo); >> >> + gcc_assert (LOOP_VINFO_LOOP (other_loop_vinfo) == loop); >> >> + >> >> + poly_int64 this_vf = LOOP_VINFO_VECT_FACTOR (this_loop_vinfo); >> >> + poly_int64 other_vf = LOOP_VINFO_VECT_FACTOR (other_loop_vinfo); >> >> + >> >> + /* Limit the VFs to what is likely to be the maximum number of >> >> iterations, >> >> + to handle cases in which at least one loop_vinfo is fully-masked. >> >> */ >> >> + HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int >> >> (loop); >> >> + if (estimated_max_niter != -1) >> >> + { >> >> + if (known_le (estimated_max_niter, this_vf)) >> >> + this_vf = estimated_max_niter; >> >> + if (known_le (estimated_max_niter, other_vf)) >> >> + other_vf = estimated_max_niter; >> >> + } >> >> + >> >> + /* Check whether the (fractional) cost per scalar iteration is lower or >> >> + higher: this_inside_cost / this_vf vs. other_inside_cost / >> >> other_vf. */ >> >> + poly_int64 rel_this = this_loop_vinfo->vector_costs->body_cost () * >> >> other_vf; >> >> + poly_int64 rel_other >> >> + = other_loop_vinfo->vector_costs->body_cost () * this_vf; >> >> + >> >> + HOST_WIDE_INT est_rel_this_min >> >> + = estimated_poly_value (rel_this, POLY_VALUE_MIN); >> >> + HOST_WIDE_INT est_rel_this_max >> >> + = estimated_poly_value (rel_this, POLY_VALUE_MAX); >> >> + >> >> + HOST_WIDE_INT est_rel_other_min >> >> + = estimated_poly_value (rel_other, POLY_VALUE_MIN); >> >> + HOST_WIDE_INT est_rel_other_max >> >> + = estimated_poly_value (rel_other, POLY_VALUE_MAX); >> >> + >> >> + /* Check first if we can make out an unambigous total order from the >> >> minimum >> >> + and maximum estimates. */ >> >> + if (est_rel_this_min < est_rel_other_min >> >> + && est_rel_this_max < est_rel_other_max) >> >> + return -1; >> >> + >> >> + if (est_rel_other_min < est_rel_this_min >> >> + && est_rel_other_max < est_rel_this_max) >> >> + return 1; >> >> + >> >> + /* When other_loop_vinfo uses a variable vectorization factor, >> >> + we know that it has a lower cost for at least one runtime VF. >> >> + However, we don't know how likely that VF is. >> >> + >> >> + One option would be to compare the costs for the estimated VFs. >> >> + The problem is that that can put too much pressure on the cost >> >> + model. E.g. if the estimated VF is also the lowest possible VF, >> >> + and if other_loop_vinfo is 1 unit worse than this_loop_vinfo >> >> + for the estimated VF, we'd then choose this_loop_vinfo even >> >> + though (a) this_loop_vinfo might not actually be better than >> >> + other_loop_vinfo for that VF and (b) it would be significantly >> >> + worse at larger VFs. >> >> + >> >> + Here we go for a hacky compromise: pick this_loop_vinfo if it is >> >> + no more expensive than other_loop_vinfo even after doubling the >> >> + estimated other_loop_vinfo VF. For all but trivial loops, this >> >> + ensures that we only pick this_loop_vinfo if it is significantly >> >> + better than other_loop_vinfo at the estimated VF. */ >> >> + if (est_rel_other_min != est_rel_this_min >> >> + || est_rel_other_max != est_rel_this_max) >> >> + { >> >> + HOST_WIDE_INT est_rel_this_likely >> >> + = estimated_poly_value (rel_this, POLY_VALUE_LIKELY); >> >> + HOST_WIDE_INT est_rel_other_likely >> >> + = estimated_poly_value (rel_other, POLY_VALUE_LIKELY); >> >> + >> >> + return est_rel_this_likely * 2 <= est_rel_other_likely ? -1 : 1; >> >> + } >> >> + >> >> + return 0; >> >> +} >> >> + >> >> +/* A <=>-style subroutine of better_main_loop_than_p, used when there is >> >> + nothing to choose between the inside (loop body) costs of THIS and >> >> OTHER. >> >> + Check whether we can determine the return value of >> >> better_main_loop_than_p >> >> + by comparing the outside (prologue and epilogue) costs of THIS and >> >> OTHER. >> >> + Return: >> >> + >> >> + * -1 if better_main_loop_than_p should return true. >> >> + * 1 if better_main_loop_than_p should return false. >> >> + * 0 if we can't decide. */ >> >> + >> >> +int >> >> +vector_costs::compare_outside_loop_cost (const vector_costs *other) const >> >> +{ >> >> + auto this_outside_cost = this->outside_cost (); >> >> + auto other_outside_cost = other->outside_cost (); >> >> + if (this_outside_cost != other_outside_cost) >> >> + return this_outside_cost < other_outside_cost ? -1 : 1; >> >> + >> >> + return 0; >> >> +} >> >> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h >> >> index 87d3f211a2e..0e3aad590e8 100644 >> >> --- a/gcc/tree-vectorizer.h >> >> +++ b/gcc/tree-vectorizer.h >> >> @@ -1419,6 +1419,21 @@ public: >> >> read back using the functions below. */ >> >> virtual void finish_cost (); >> >> >> >> + /* The costs in THIS and OTHER both describe ways of vectorizing >> >> + a main loop. Return true if the costs described by THIS are >> >> + cheaper than the costs described by OTHER. Return false if any >> >> + of the following are true: >> >> + >> >> + - THIS and OTHER are of equal cost >> >> + - OTHER is better than THIS >> >> + - we can't be sure about the relative costs of THIS and OTHER. */ >> >> + virtual bool better_main_loop_than_p (const vector_costs *other) const; >> >> + >> >> + /* Likewise, but the costs in THIS and OTHER both describe ways of >> >> + vectorizing an epilogue loop of MAIN_LOOP. */ >> >> + virtual bool better_epilogue_loop_than_p (const vector_costs *other, >> >> + loop_vec_info main_loop) >> >> const; >> >> + >> >> unsigned int prologue_cost () const; >> >> unsigned int body_cost () const; >> >> unsigned int epilogue_cost () const; >> >> @@ -1429,6 +1444,8 @@ protected: >> >> unsigned int); >> >> unsigned int adjust_cost_for_freq (stmt_vec_info, >> >> vect_cost_model_location, >> >> unsigned int); >> >> + int compare_inside_loop_cost (const vector_costs *) const; >> >> + int compare_outside_loop_cost (const vector_costs *) const; >> >> >> >> /* The region of code that we're considering vectorizing. */ >> >> vec_info *m_vinfo; >> >> -- >> >> 2.25.1 >> >>