On Mon, 19 Jan 2026, Hongtao Liu wrote:
> On Fri, Jan 16, 2026 at 10:23 PM Richard Biener <[email protected]> wrote:
> >
> > On Fri, 16 Jan 2026, Liu, Hongtao wrote:
> >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Richard Biener <[email protected]>
> > > > Sent: Friday, January 16, 2026 6:23 PM
> > > > To: [email protected]
> > > > Cc: Liu, Hongtao <[email protected]>
> > > > Subject: [PATCH] target/123603 - add --param ix86-vect-compare-costs
> > > >
> > > > The following allows to switch the x86 target to use the vectorizer cost
> > > > comparison mechanic to select between different vector mode variants of
> > > > vectorizations. The default is still to not do this but this allows an
> > > > opt-in.
> > > >
> > >
> > > The patch LGTM.
> > >
> > > > Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
> > > >
> > > > For next stage1 I'll probably propose flipping the switch (or not add
> > > > the switch
> > > > at all). I'll follow up with a report on how CPU 2017 behaves with
> > > > this on vs.
> > >
> > > If possible, we should run the next SPEC CPU benchmarks (with more
> > > vectorization) to decide whether to switch it on.
> > > I did similar tests on SPEC CPU 2017 two years ago - no clear benefits
> > > and longer compile times, probably due to the crude cost model.
> > >
> > > > off before considering to ask whether we want this switch for GCC 16 or
> > > > not
> > > > (like if it only has overly negative effects).
> > >
> > > It would be quite interesting if we could find that some benchmarks do
> > > show benefits.
> >
> > On SPEC CPU 2017 for -Ofast -march=znver4 this shows 2463 out of
> > 39706 vectorized loops changing mode. In 503 out of 12378 cases
> > we decided to not use masked epilogs. Compile-time increases by ~1%
> > overall.
> > With a quick 1-run there does not seem to be off-noise effects
> > for INT, this particular optimization and target option combination
> > and actual hardware to run on. For FP 549.fotonik3d_r improves by 6%
> > (confirmed with a 2-run).
> Interesting.
>
> >
> > This was triggered by PR123190 and PR123603 which have cases where
> > comparing costs would have resulted in the faster vector size to be
> > used. Both were reported for -O2 -march=x86-64-v3 -flto and with PGO.
> > The PR123603 recorded regression of 548.exchange2_r with these flags
> > is resolved with the flag (performance improves by 13%). I don't
> > have SPEC 2006 on that machine so did not verify the PR123190 433.milc
> > regression, but that has been improved with the two earlier patches.
> > The --param has no effect on the testcase in the PR.
> >
> > I do expect that some of our tricks in the x86 cost model to make
> > larger vector sizes unprofitable will be obsolete or are
> > counter-productive with cost comparison turned on.
> >
> > I think the above shows having the knob is useful, if only to
> > gather more data.
>
> I will test this separately on Intel P-cores and E-cores
> (theoretically, the cost comparison should be
> architecture-independent, but more testing might expose issues with
> the current cost model or certain limitations of cost comparison). If
> there are no negative results, considering that the current compile
> time overhead is relatively small, we can indeed enable this in the
> next stage1.
I have done an additional run (also on Zen4 hardware) with
-march=x86-64-v3 -O3 -flto where the only runtime effects
are a ~2% improvement for 521.wrf_r, a 1% slowdown of 500.perlbench_r
(but also with 1% noise, so unclear) and a 1% slowdown of 505.mcf_r
(also slightly noisy).
I'd also like us to switch for stage1 (and not make this a target
tunable), also to be able to eventually get rid of the various
"hacks" we have accumulated to pessimize some code-gen to force
us to lower vector sizes by making larger ones not profitable at all.
I have pushed the --param patch now and appreciate more data, esp.
negative fallout.
Richard.
> >
> > In case there's no negative feedback I plan to merge this early
> > next week.
> >
> > Thanks,
> > Richard.
> >
> > > >
> > > > PR target/123603
> > > > * config/i386/i386.opt (-param=ix86-vect-compare-costs=): Add.
> > > > * config/i386/i386.cc (ix86_autovectorize_vector_modes): Honor it.
> > > > * doc/invoke.texi (ix86-vect-compare-costs): Document.
> > > >
> > > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr123603.c: New
> > > > testcase.
> > > > ---
> > > > gcc/config/i386/i386.cc | 2 +-
> > > > gcc/config/i386/i386.opt | 4 ++++
> > > > gcc/doc/invoke.texi | 3 +++
> > > > .../vect/costmodel/x86_64/costmodel-pr123603.c | 15
> > > > +++++++++++++++
> > > > 4 files changed, 23 insertions(+), 1 deletion(-) create mode 100644
> > > > gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr123603.c
> > > >
> > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index
> > > > 6bf4af8bbe3..a3d0f7cb649 100644
> > > > --- a/gcc/config/i386/i386.cc
> > > > +++ b/gcc/config/i386/i386.cc
> > > > @@ -25700,7 +25700,7 @@ ix86_autovectorize_vector_modes
> > > > (vector_modes *modes, bool all)
> > > > if (TARGET_SSE2)
> > > > modes->safe_push (V4QImode);
> > > >
> > > > - return 0;
> > > > + return ix86_vect_compare_costs ? VECT_COMPARE_COSTS : 0;
> > > > }
> > > >
> > > > /* Implemenation of targetm.vectorize.get_mask_mode. */ diff --git
> > > > a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt index
> > > > 99bb674812b..ef9efabcff6 100644
> > > > --- a/gcc/config/i386/i386.opt
> > > > +++ b/gcc/config/i386/i386.opt
> > > > @@ -1249,6 +1249,10 @@ Enable conservative small loop unrolling.
> > > > Target Joined UInteger Var(ix86_vect_unroll_limit) Init(4) Param
> > > > Limit how
> > > > much the autovectorizer may unroll a loop.
> > > >
> > > > +-param=ix86-vect-compare-costs=
> > > > +Target Joined UInteger Var(ix86_vect_compare_costs) Init(0)
> > > > +IntegerRange(0, 1) Param Optimization Whether x86 vectorizer cost
> > > > modeling compares costs of different vector sizes.
> > > > +
> > > > mlam=
> > > > Target RejectNegative Joined Enum(lam_type) Var(ix86_lam_type)
> > > > Init(lam_none) -mlam=[none|u48|u57] Instrument meta data position in
> > > > user data pointers.
> > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index
> > > > b703b531d75..5092e4ba9ad 100644
> > > > --- a/gcc/doc/invoke.texi
> > > > +++ b/gcc/doc/invoke.texi
> > > > @@ -18213,6 +18213,9 @@ the discovery is aborted.
> > > > @item ix86-vect-unroll-limit
> > > > Limit how much the autovectorizer may unroll a loop.
> > > >
> > > > +@item ix86-vect-compare-costs
> > > > +Whether x86 vectorizer cost modeling compares costs of different vector
> > > > sizes.
> > > > +
> > > > @end table
> > > >
> > > > @end table
> > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-
> > > > pr123603.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-
> > > > pr123603.c
> > > > new file mode 100644
> > > > index 00000000000..c074176a7e4
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr123603.c
> > > > @@ -0,0 +1,15 @@
> > > > +/* { dg-do compile } */
> > > > +/* { dg-additional-options "--param ix86-vect-compare-costs=1" } */
> > > > +
> > > > +void foo (int *block)
> > > > +{
> > > > + for (int i = 0; i < 3; ++i)
> > > > + {
> > > > + int a = block[i*9];
> > > > + int b = block[i*9+1];
> > > > + block[i*9] = a + 10;
> > > > + block[i*9+1] = b + 10;
> > > > + }
> > > > +}
> > > > +
> > > > +/* { dg-final { scan-tree-dump "optimized: loop vectorized using 8 byte
> > > > +vectors" "vect" } } */
> > > > --
> > > > 2.51.0
> > >
> >
> > --
> > Richard Biener <[email protected]>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > Nuernberg)
>
>
>
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)