On Mon, Jun 19, 2023 at 8:35 PM Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > The following works around the lack of the x86 backend making the > vectorizer compare the costs of the different possible vector > sizes the backed advertises through the vector_modes hook. When > enabling masked epilogues or main loops then this means we will > select the prefered vector mode which is usually the largest even > for loops that do not iterate close to the times the vector has > lanes. When not using masking the vectorizer would reject any > mode resulting in a VF bigger than the number of iterations > but with masking they are simply masked out. > > So this overloads the finish_cost function and matches for > the problematic case, forcing a high cost to make us try a > smaller vector size. > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. This should > avoid regressing 525.x264_r with partial vector epilogues and > instead improves it by 25% with -march=znver4 (need to re-check > that, that was true with some earlier attempt). > > This falls short of enabling cost comparison in the x86 backend > which I also considered doing for --param vect-partial-vector-usage=1 > but which will also cause a much larger churn and compile-time > impact (but it should be bearable as seen with aarch64). > > I've filed PR110310 for an oddity I noticed around vectorizing > epilogues, I failed to adjust things for the case in that PR. > > I'm using INT_MAX to fend off the vectorizer, I wondered if > we should be able to signal that with a bool return value of > finish_cost? Though INT_MAX seems to work fine. > > Does this look reasonable? Reasonable for me, even for VECT_COMPARE_COSTS. > > Thanks, > Richard. > > * config/i386/i386.cc (ix86_vector_costs::finish_cost): > Overload. For masked main loops make sure the vectorization > factor isn't more than double the number of iterations. > > > * gcc.target/i386/vect-partial-vectors-1.c: New testcase. > * gcc.target/i386/vect-partial-vectors-2.c: Likewise. > --- > gcc/config/i386/i386.cc | 26 +++++++++++++++++++ > .../gcc.target/i386/vect-partial-vectors-1.c | 13 ++++++++++ > .../gcc.target/i386/vect-partial-vectors-2.c | 12 +++++++++ > 3 files changed, 51 insertions(+) > create mode 100644 gcc/testsuite/gcc.target/i386/vect-partial-vectors-1.c > create mode 100644 gcc/testsuite/gcc.target/i386/vect-partial-vectors-2.c > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > index b20cb86b822..32851a514a9 100644 > --- a/gcc/config/i386/i386.cc > +++ b/gcc/config/i386/i386.cc > @@ -23666,6 +23666,7 @@ class ix86_vector_costs : public vector_costs > stmt_vec_info stmt_info, slp_tree node, > tree vectype, int misalign, > vect_cost_model_location where) override; > + void finish_cost (const vector_costs *) override; > }; > > /* Implement targetm.vectorize.create_costs. */ > @@ -23918,6 +23919,31 @@ ix86_vector_costs::add_stmt_cost (int count, > vect_cost_for_stmt kind, > return retval; > } > > +void > +ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) > +{ > + loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo); > + if (loop_vinfo && !m_costing_for_scalar) > + { > + /* We are currently not asking the vectorizer to compare costs > + between different vector mode sizes. When using predication > + that will end up always choosing the prefered mode size even > + if there's a smaller mode covering all lanes. Test for this > + situation and artificially reject the larger mode attempt. > + ??? We currently lack masked ops for sub-SSE sized modes, > + so we could restrict this rejection to AVX and AVX512 modes > + but error on the safe side for now. */ > + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) > + && !LOOP_VINFO_EPILOGUE_P (loop_vinfo) > + && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) > + && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ()) > + > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo)))) > + m_costs[vect_body] = INT_MAX; > + } > + > + vector_costs::finish_cost (scalar_costs); > +} > + > /* Validate target specific memory model bits in VAL. */ > > static unsigned HOST_WIDE_INT > diff --git a/gcc/testsuite/gcc.target/i386/vect-partial-vectors-1.c > b/gcc/testsuite/gcc.target/i386/vect-partial-vectors-1.c > new file mode 100644 > index 00000000000..3834720e8e2 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/vect-partial-vectors-1.c > @@ -0,0 +1,13 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mavx512f -mavx512vl -mprefer-vector-width=512 --param > vect-partial-vector-usage=1" } */ > + > +void foo (int * __restrict a, int *b) > +{ > + for (int i = 0; i < 4; ++i) > + a[i] = b[i] + 42; > +} > + > +/* We do not want to optimize this using masked AVX or AXV512 > + but unmasked SSE. */ > +/* { dg-final { scan-assembler-not "\[yz\]mm" } } */ > +/* { dg-final { scan-assembler "xmm" } } */ > diff --git a/gcc/testsuite/gcc.target/i386/vect-partial-vectors-2.c > b/gcc/testsuite/gcc.target/i386/vect-partial-vectors-2.c > new file mode 100644 > index 00000000000..4ab2cbc4203 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/vect-partial-vectors-2.c > @@ -0,0 +1,12 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mavx512f -mavx512vl -mprefer-vector-width=512 --param > vect-partial-vector-usage=1" } */ > + > +void foo (int * __restrict a, int *b) > +{ > + for (int i = 0; i < 7; ++i) > + a[i] = b[i] + 42; > +} > + > +/* We want to optimize this using masked AVX, not AXV512 or SSE. */ > +/* { dg-final { scan-assembler-not "zmm" } } */ > +/* { dg-final { scan-assembler "ymm\[^\r\n\]*\{%k" } } */ > -- > 2.35.3
-- BR, Hongtao