RE: [PATCH] ix86: Suggest unroll factor for loop vectorization

2022-11-02 Thread Cui, Lili via Gcc-patches
> > > +@item x86-vect-unroll-min-ldst-threshold
> > > +The vectorizer will check with target information to determine
> > > +whether unroll it. This parameter is used to limit the mininum of
> > > +loads and stores in the main loop.
> > >
> > > It's odd to "limit" the minimum number of something.  I think this
> > > warrants clarification that for some (unknow to me ;)) reason we
> > > think that when we have many loads and (or?) stores it is beneficial
> > > to unroll to get even more loads and stores in a single iteration.
> > > Btw, does the parameter limit the number of loads and stores _after_
> unrolling or before?
> > >
> > When the number of loads/stores exceeds the threshold, the loads/stores
> are more likely to conflict with loop itself in the L1 cache(Assuming that
> address of loads are scattered).
> > Unroll + software scheduling will make 2 or 4 address contiguous
> loads/stores closer together, it will reduce cache miss rate.
> 
> Ah, nice.  Can we express the default as a function of L1 data cache size, L1
> cache line size and more importantly, the size of the vector memory access?
> 
> Btw, I was looking into making a more meaningful cost modeling for loop
> distribution.  Similar reasoning might apply there - try to _reduce_ the
> number of memory streams so L1 cache utilization allows re-use of a cache
> line in the next [next N] iteration[s]?  OTOH given L1D is quite large I'd 
> expect
> the loops affected to be either quite huge or bottlenecked by load/store
> bandwith (there are 1024 L1D cache lines in zen2 for
> example) - what's the effective L1D load you are keying off?.
> Btw, how does L1D allocation on stores play a role here?
> 
Hi Richard,
To answer your question, I rechecked 549, I found that the 549 improvement 
comes from load reduction, it has a 3-level loop and 8 scalar loads in inner 
loop are loop invariants (due to high register pressure, these loop invariants 
all spill to the stack).
After unrolling the inner loop, those scalar parts are not doubled,  so 
unrolling reduces load instructions and L1/L2/L3 accesses. In the inner loop 
there are 8 different three-dimensional arrays, which size like this 
"a[128][480][128]". Although the size of the 3-layer array is very large,
but it doesn't support the theory I said before, Sorry for that. I need to hold 
this patch to see if we can do something about this scenario. 

Thanks,
Lili.




Re: [PATCH] ix86: Suggest unroll factor for loop vectorization

2022-10-28 Thread Richard Biener via Gcc-patches
On Wed, Oct 26, 2022 at 1:38 PM Cui, Lili  wrote:
>
> Hi Richard,
>
> > +@item x86-vect-unroll-min-ldst-threshold
> > +The vectorizer will check with target information to determine whether
> > +unroll it. This parameter is used to limit the mininum of loads and
> > +stores in the main loop.
> >
> > It's odd to "limit" the minimum number of something.  I think this warrants
> > clarification that for some (unknow to me ;)) reason we think that when we
> > have many loads and (or?) stores it is beneficial to unroll to get even more
> > loads and stores in a single iteration.  Btw, does the parameter limit the
> > number of loads and stores _after_ unrolling or before?
> >
> When the number of loads/stores exceeds the threshold, the loads/stores are 
> more likely to conflict with loop itself in the L1 cache(Assuming that 
> address of loads are scattered).
> Unroll + software scheduling will make 2 or 4 address contiguous loads/stores 
> closer together, it will reduce cache miss rate.

Ah, nice.  Can we express the default as a function of L1 data cache
size, L1 cache line size and
more importantly, the size of the vector memory access?

Btw, I was looking into making a more meaningful cost modeling for loop
distribution.  Similar reasoning might apply there - try to _reduce_ the
number of memory streams so L1 cache utilization allows re-use of a
cache line in the next [next N] iteration[s]?  OTOH given L1D is quite
large I'd expect the loops affected to be either quite huge or bottlenecked
by load/store bandwith (there are 1024 L1D cache lines in zen2 for
example) - what's the effective L1D load you are keying off?.
Btw, how does L1D allocation on stores play a role here?

> > +@item x86-vect-unroll-max-loop-size
> > +The vectorizer will check with target information to determine whether
> > +unroll it. This threshold is used to limit the max size of loop body after
> > unrolling.
> > +The default value is 200.
> >
> > it should probably say not "size" but "number of instructions".  Note that 
> > 200
> > is quite large given we are talking about vector instructions here which 
> > have
> > larger encodings than scalar instructions.  Optimistically assuming
> > 4 byte encoding (quite optimistic give we're looking at loops with many
> > loads/stores) that would be an 800 byte loop body which would be 25 cache
> > lines.
> > ISTR that at least the loop discovery is limited to a lot smaller cases 
> > (but we
> > are likely not targeting that).  The limit probably still works to fit the 
> > loop
> > body in the u-op caches though.
> >
> Agree with you, it should be "x86-vect-unroll-max-loop-insns". Thanks for the 
> reminder about larger encodings, I checked the skylake uop cache, it can hold 
> 1.5k uOPs, 200 * 2 (1~3 uops/instruction) = 400 uops. I think 200 still work.
>
> > That said, the heuristic made me think "what the heck".  Can we explain in 
> > u-
> > arch terms why the unrolling is beneficial instead of just defering to SPEC
> > CPU 2017 fotonik?
> >
> Regarding the benefits,  I explained in the first answer, I checked 5 hottest 
> functions in the 549, they all benefit from it, it improves the cache hit 
> ratio.
>
> Thanks,
> Lili.
>
> > > On Mon, Oct 24, 2022 at 10:46 AM Cui,Lili via Gcc-patches
> > >  wrote:
> > > >
> > > > Hi Hongtao,
> > > >
> > > > This patch introduces function finish_cost and
> > > > determine_suggested_unroll_factor for x86 backend, to make it be
> > > > able to suggest the unroll factor for a given loop being vectorized.
> > > > Referring to aarch64, RS6000 backends and basing on the analysis on
> > > > SPEC2017 performance evaluation results.
> > > >
> > > > Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > >
> > > > OK for trunk?
> > > >
> > > >
> > > >
> > > > With this patch, SPEC2017 performance evaluation results on
> > > > ICX/CLX/ADL/Znver3 are listed below:
> > > >
> > > > For single copy:
> > > >   - ICX: 549.fotonik3d_r +6.2%, the others are neutral
> > > >   - CLX: 549.fotonik3d_r +1.9%, the others are neutral
> > > >   - ADL: 549.fotonik3d_r +4.5%, the others are neutral
> > > >   - Znver3: 549.fotonik3d_r +4.8%, the others are neutral
> > > >
> > > > For multi-copy:
> > > >   - ADL: 549.fotonik3d_r +2.7%, the others are neutral
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > * config/i386/i386.cc (class ix86_vector_costs): Add new members
> > > >  m_nstmts, m_nloads m_nstores and
> > determine_suggested_unroll_factor.
> > > > (ix86_vector_costs::add_stmt_cost): Update for m_nstores,
> > m_nloads
> > > > and m_nstores.
> > > > (ix86_vector_costs::determine_suggested_unroll_factor): New
> > function.
> > > > (ix86_vector_costs::finish_cost): Diito.
> > > > * config/i386/i386.opt:(x86-vect-unroll-limit): New parameter.
> > > > (x86-vect-unroll-min-ldst-threshold): Likewise.
> > > > (x86-vect-unroll-max-loop-size): Likewise.
> > > > * doc/invoke.texi: Document 

RE: [PATCH] ix86: Suggest unroll factor for loop vectorization

2022-10-26 Thread Cui, Lili via Gcc-patches
Hi Richard,

> +@item x86-vect-unroll-min-ldst-threshold
> +The vectorizer will check with target information to determine whether
> +unroll it. This parameter is used to limit the mininum of loads and
> +stores in the main loop.
> 
> It's odd to "limit" the minimum number of something.  I think this warrants
> clarification that for some (unknow to me ;)) reason we think that when we
> have many loads and (or?) stores it is beneficial to unroll to get even more
> loads and stores in a single iteration.  Btw, does the parameter limit the
> number of loads and stores _after_ unrolling or before?
> 
When the number of loads/stores exceeds the threshold, the loads/stores are 
more likely to conflict with loop itself in the L1 cache(Assuming that address 
of loads are scattered).
Unroll + software scheduling will make 2 or 4 address contiguous loads/stores 
closer together, it will reduce cache miss rate.

> +@item x86-vect-unroll-max-loop-size
> +The vectorizer will check with target information to determine whether
> +unroll it. This threshold is used to limit the max size of loop body after
> unrolling.
> +The default value is 200.
> 
> it should probably say not "size" but "number of instructions".  Note that 200
> is quite large given we are talking about vector instructions here which have
> larger encodings than scalar instructions.  Optimistically assuming
> 4 byte encoding (quite optimistic give we're looking at loops with many
> loads/stores) that would be an 800 byte loop body which would be 25 cache
> lines.
> ISTR that at least the loop discovery is limited to a lot smaller cases (but 
> we
> are likely not targeting that).  The limit probably still works to fit the 
> loop
> body in the u-op caches though.
> 
Agree with you, it should be "x86-vect-unroll-max-loop-insns". Thanks for the 
reminder about larger encodings, I checked the skylake uop cache, it can hold 
1.5k uOPs, 200 * 2 (1~3 uops/instruction) = 400 uops. I think 200 still work.

> That said, the heuristic made me think "what the heck".  Can we explain in u-
> arch terms why the unrolling is beneficial instead of just defering to SPEC
> CPU 2017 fotonik?
> 
Regarding the benefits,  I explained in the first answer, I checked 5 hottest 
functions in the 549, they all benefit from it, it improves the cache hit ratio.

Thanks,
Lili.

> > On Mon, Oct 24, 2022 at 10:46 AM Cui,Lili via Gcc-patches
> >  wrote:
> > >
> > > Hi Hongtao,
> > >
> > > This patch introduces function finish_cost and
> > > determine_suggested_unroll_factor for x86 backend, to make it be
> > > able to suggest the unroll factor for a given loop being vectorized.
> > > Referring to aarch64, RS6000 backends and basing on the analysis on
> > > SPEC2017 performance evaluation results.
> > >
> > > Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > >
> > > OK for trunk?
> > >
> > >
> > >
> > > With this patch, SPEC2017 performance evaluation results on
> > > ICX/CLX/ADL/Znver3 are listed below:
> > >
> > > For single copy:
> > >   - ICX: 549.fotonik3d_r +6.2%, the others are neutral
> > >   - CLX: 549.fotonik3d_r +1.9%, the others are neutral
> > >   - ADL: 549.fotonik3d_r +4.5%, the others are neutral
> > >   - Znver3: 549.fotonik3d_r +4.8%, the others are neutral
> > >
> > > For multi-copy:
> > >   - ADL: 549.fotonik3d_r +2.7%, the others are neutral
> > >
> > > gcc/ChangeLog:
> > >
> > > * config/i386/i386.cc (class ix86_vector_costs): Add new members
> > >  m_nstmts, m_nloads m_nstores and
> determine_suggested_unroll_factor.
> > > (ix86_vector_costs::add_stmt_cost): Update for m_nstores,
> m_nloads
> > > and m_nstores.
> > > (ix86_vector_costs::determine_suggested_unroll_factor): New
> function.
> > > (ix86_vector_costs::finish_cost): Diito.
> > > * config/i386/i386.opt:(x86-vect-unroll-limit): New parameter.
> > > (x86-vect-unroll-min-ldst-threshold): Likewise.
> > > (x86-vect-unroll-max-loop-size): Likewise.
> > > * doc/invoke.texi: Document new parameter.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > * gcc.target/i386/cond_op_maxmin_b-1.c: Add -fno-unroll-loops.
> > > * gcc.target/i386/cond_op_maxmin_ub-1.c: Ditto.
> > > * gcc.target/i386/vect-alignment-peeling-1.c: Ditto.
> > > * gcc.target/i386/vect-alignment-peeling-2.c: Ditto.
> > > * gcc.target/i386/vect-reduc-1.c: Ditto.
> > > ---
> > >  gcc/config/i386/i386.cc   | 106 ++
> > >  gcc/config/i386/i386.opt  |  15 +++
> > >  gcc/doc/invoke.texi   |  17 +++
> > >  .../gcc.target/i386/cond_op_maxmin_b-1.c  |   2 +-
> > >  .../gcc.target/i386/cond_op_maxmin_ub-1.c |   2 +-
> > >  .../i386/vect-alignment-peeling-1.c   |   2 +-
> > >  .../i386/vect-alignment-peeling-2.c   |   2 +-
> > >  gcc/testsuite/gcc.target/i386/vect-reduc-1.c  |   2 +-
> > >  8 files changed, 143 

Re: [PATCH] ix86: Suggest unroll factor for loop vectorization

2022-10-25 Thread Richard Biener via Gcc-patches
On Tue, Oct 25, 2022 at 7:46 AM Hongtao Liu  wrote:
>
> Any comments?

+@item x86-vect-unroll-min-ldst-threshold
+The vectorizer will check with target information to determine whether unroll
+it. This parameter is used to limit the mininum of loads and stores in the main
+loop.

It's odd to "limit" the minimum number of something.  I think this
warrants clarification
that for some (unknow to me ;)) reason we think that when we have many loads
and (or?) stores it is beneficial to unroll to get even more loads and
stores in a
single iteration.  Btw, does the parameter limit the number of loads and stores
_after_ unrolling or before?

+@item x86-vect-unroll-max-loop-size
+The vectorizer will check with target information to determine whether unroll
+it. This threshold is used to limit the max size of loop body after unrolling.
+The default value is 200.

it should probably say not "size" but "number of instructions".  Note that 200
is quite large given we are talking about vector instructions here which have
larger encodings than scalar instructions.  Optimistically assuming
4 byte encoding (quite optimistic give we're looking at loops with many
loads/stores) that would be an 800 byte loop body which would be 25 cache lines.
ISTR that at least the loop discovery is limited to a lot smaller cases (but
we are likely not targeting that).  The limit probably still works to
fit the loop
body in the u-op caches though.

That said, the heuristic made me think "what the heck".  Can we explain
in u-arch terms why the unrolling is beneficial instead of just defering to
SPEC CPU 2017 fotonik?

> On Mon, Oct 24, 2022 at 10:46 AM Cui,Lili via Gcc-patches
>  wrote:
> >
> > Hi Hongtao,
> >
> > This patch introduces function finish_cost and
> > determine_suggested_unroll_factor for x86 backend, to make it be
> > able to suggest the unroll factor for a given loop being vectorized.
> > Referring to aarch64, RS6000 backends and basing on the analysis on
> > SPEC2017 performance evaluation results.
> >
> > Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
> >
> > OK for trunk?
> >
> >
> >
> > With this patch, SPEC2017 performance evaluation results on
> > ICX/CLX/ADL/Znver3 are listed below:
> >
> > For single copy:
> >   - ICX: 549.fotonik3d_r +6.2%, the others are neutral
> >   - CLX: 549.fotonik3d_r +1.9%, the others are neutral
> >   - ADL: 549.fotonik3d_r +4.5%, the others are neutral
> >   - Znver3: 549.fotonik3d_r +4.8%, the others are neutral
> >
> > For multi-copy:
> >   - ADL: 549.fotonik3d_r +2.7%, the others are neutral
> >
> > gcc/ChangeLog:
> >
> > * config/i386/i386.cc (class ix86_vector_costs): Add new members
> >  m_nstmts, m_nloads m_nstores and determine_suggested_unroll_factor.
> > (ix86_vector_costs::add_stmt_cost): Update for m_nstores, m_nloads
> > and m_nstores.
> > (ix86_vector_costs::determine_suggested_unroll_factor): New 
> > function.
> > (ix86_vector_costs::finish_cost): Diito.
> > * config/i386/i386.opt:(x86-vect-unroll-limit): New parameter.
> > (x86-vect-unroll-min-ldst-threshold): Likewise.
> > (x86-vect-unroll-max-loop-size): Likewise.
> > * doc/invoke.texi: Document new parameter.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/cond_op_maxmin_b-1.c: Add -fno-unroll-loops.
> > * gcc.target/i386/cond_op_maxmin_ub-1.c: Ditto.
> > * gcc.target/i386/vect-alignment-peeling-1.c: Ditto.
> > * gcc.target/i386/vect-alignment-peeling-2.c: Ditto.
> > * gcc.target/i386/vect-reduc-1.c: Ditto.
> > ---
> >  gcc/config/i386/i386.cc   | 106 ++
> >  gcc/config/i386/i386.opt  |  15 +++
> >  gcc/doc/invoke.texi   |  17 +++
> >  .../gcc.target/i386/cond_op_maxmin_b-1.c  |   2 +-
> >  .../gcc.target/i386/cond_op_maxmin_ub-1.c |   2 +-
> >  .../i386/vect-alignment-peeling-1.c   |   2 +-
> >  .../i386/vect-alignment-peeling-2.c   |   2 +-
> >  gcc/testsuite/gcc.target/i386/vect-reduc-1.c  |   2 +-
> >  8 files changed, 143 insertions(+), 5 deletions(-)
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index aeea26ef4be..a939354e55e 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -23336,6 +23336,17 @@ class ix86_vector_costs : public vector_costs
> >   stmt_vec_info stmt_info, slp_tree node,
> >   tree vectype, int misalign,
> >   vect_cost_model_location where) override;
> > +
> > +  unsigned int determine_suggested_unroll_factor (loop_vec_info);
> > +
> > +  void finish_cost (const vector_costs *) override;
> > +
> > +  /* Total number of vectorized stmts (loop only).  */
> > +  unsigned m_nstmts = 0;
> > +  /* Total number of loads (loop only).  */
> > +  unsigned m_nloads = 0;
> > +  /* Total number of stores (loop only).  */
> > 

Re: [PATCH] ix86: Suggest unroll factor for loop vectorization

2022-10-24 Thread Hongtao Liu via Gcc-patches
Any comments?

On Mon, Oct 24, 2022 at 10:46 AM Cui,Lili via Gcc-patches
 wrote:
>
> Hi Hongtao,
>
> This patch introduces function finish_cost and
> determine_suggested_unroll_factor for x86 backend, to make it be
> able to suggest the unroll factor for a given loop being vectorized.
> Referring to aarch64, RS6000 backends and basing on the analysis on
> SPEC2017 performance evaluation results.
>
> Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
>
> OK for trunk?
>
>
>
> With this patch, SPEC2017 performance evaluation results on
> ICX/CLX/ADL/Znver3 are listed below:
>
> For single copy:
>   - ICX: 549.fotonik3d_r +6.2%, the others are neutral
>   - CLX: 549.fotonik3d_r +1.9%, the others are neutral
>   - ADL: 549.fotonik3d_r +4.5%, the others are neutral
>   - Znver3: 549.fotonik3d_r +4.8%, the others are neutral
>
> For multi-copy:
>   - ADL: 549.fotonik3d_r +2.7%, the others are neutral
>
> gcc/ChangeLog:
>
> * config/i386/i386.cc (class ix86_vector_costs): Add new members
>  m_nstmts, m_nloads m_nstores and determine_suggested_unroll_factor.
> (ix86_vector_costs::add_stmt_cost): Update for m_nstores, m_nloads
> and m_nstores.
> (ix86_vector_costs::determine_suggested_unroll_factor): New function.
> (ix86_vector_costs::finish_cost): Diito.
> * config/i386/i386.opt:(x86-vect-unroll-limit): New parameter.
> (x86-vect-unroll-min-ldst-threshold): Likewise.
> (x86-vect-unroll-max-loop-size): Likewise.
> * doc/invoke.texi: Document new parameter.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/cond_op_maxmin_b-1.c: Add -fno-unroll-loops.
> * gcc.target/i386/cond_op_maxmin_ub-1.c: Ditto.
> * gcc.target/i386/vect-alignment-peeling-1.c: Ditto.
> * gcc.target/i386/vect-alignment-peeling-2.c: Ditto.
> * gcc.target/i386/vect-reduc-1.c: Ditto.
> ---
>  gcc/config/i386/i386.cc   | 106 ++
>  gcc/config/i386/i386.opt  |  15 +++
>  gcc/doc/invoke.texi   |  17 +++
>  .../gcc.target/i386/cond_op_maxmin_b-1.c  |   2 +-
>  .../gcc.target/i386/cond_op_maxmin_ub-1.c |   2 +-
>  .../i386/vect-alignment-peeling-1.c   |   2 +-
>  .../i386/vect-alignment-peeling-2.c   |   2 +-
>  gcc/testsuite/gcc.target/i386/vect-reduc-1.c  |   2 +-
>  8 files changed, 143 insertions(+), 5 deletions(-)
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index aeea26ef4be..a939354e55e 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -23336,6 +23336,17 @@ class ix86_vector_costs : public vector_costs
>   stmt_vec_info stmt_info, slp_tree node,
>   tree vectype, int misalign,
>   vect_cost_model_location where) override;
> +
> +  unsigned int determine_suggested_unroll_factor (loop_vec_info);
> +
> +  void finish_cost (const vector_costs *) override;
> +
> +  /* Total number of vectorized stmts (loop only).  */
> +  unsigned m_nstmts = 0;
> +  /* Total number of loads (loop only).  */
> +  unsigned m_nloads = 0;
> +  /* Total number of stores (loop only).  */
> +  unsigned m_nstores = 0;
>  };
>
>  /* Implement targetm.vectorize.create_costs.  */
> @@ -23579,6 +23590,19 @@ ix86_vector_costs::add_stmt_cost (int count, 
> vect_cost_for_stmt kind,
> retval = (retval * 17) / 10;
>  }
>
> +  if (!m_costing_for_scalar
> +  && is_a (m_vinfo)
> +  && where == vect_body)
> +{
> +  m_nstmts += count;
> +  if (kind == scalar_load || kind == vector_load
> + || kind == unaligned_load || kind == vector_gather_load)
> +   m_nloads += count;
> +  else if (kind == scalar_store || kind == vector_store
> +  || kind == unaligned_store || kind == vector_scatter_store)
> +   m_nstores += count;
> +}
> +
>m_costs[where] += retval;
>
>return retval;
> @@ -23850,6 +23874,88 @@ ix86_loop_unroll_adjust (unsigned nunroll, class 
> loop *loop)
>return nunroll;
>  }
>
> +unsigned int
> +ix86_vector_costs::determine_suggested_unroll_factor (loop_vec_info 
> loop_vinfo)
> +{
> +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +
> +  /* Don't unroll if it's specified explicitly not to be unrolled.  */
> +  if (loop->unroll == 1
> +  || (OPTION_SET_P (flag_unroll_loops) && !flag_unroll_loops)
> +  || (OPTION_SET_P (flag_unroll_all_loops) && !flag_unroll_all_loops))
> +return 1;
> +
> +  /* Don't unroll if there is no vectorized stmt.  */
> +  if (m_nstmts == 0)
> +return 1;
> +
> +  /* Don't unroll if vector size is zmm, since zmm throughput is lower than 
> other
> + sizes.  */
> +  if (GET_MODE_SIZE (loop_vinfo->vector_mode) == 64)
> +return 1;
> +
> +  /* Calc the total number of loads and stores in the loop body.  */
> +  unsigned int nstmts_ldst = m_nloads + m_nstores;
> +
> +  /* Don't unroll if loop 

[PATCH] ix86: Suggest unroll factor for loop vectorization

2022-10-23 Thread Cui,Lili via Gcc-patches
Hi Hongtao,

This patch introduces function finish_cost and 
determine_suggested_unroll_factor for x86 backend, to make it be
able to suggest the unroll factor for a given loop being vectorized.
Referring to aarch64, RS6000 backends and basing on the analysis on
SPEC2017 performance evaluation results.

Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.

OK for trunk?



With this patch, SPEC2017 performance evaluation results on
ICX/CLX/ADL/Znver3 are listed below:

For single copy:
  - ICX: 549.fotonik3d_r +6.2%, the others are neutral
  - CLX: 549.fotonik3d_r +1.9%, the others are neutral
  - ADL: 549.fotonik3d_r +4.5%, the others are neutral
  - Znver3: 549.fotonik3d_r +4.8%, the others are neutral

For multi-copy:
  - ADL: 549.fotonik3d_r +2.7%, the others are neutral

gcc/ChangeLog:

* config/i386/i386.cc (class ix86_vector_costs): Add new members
 m_nstmts, m_nloads m_nstores and determine_suggested_unroll_factor.
(ix86_vector_costs::add_stmt_cost): Update for m_nstores, m_nloads
and m_nstores.
(ix86_vector_costs::determine_suggested_unroll_factor): New function.
(ix86_vector_costs::finish_cost): Diito.
* config/i386/i386.opt:(x86-vect-unroll-limit): New parameter.
(x86-vect-unroll-min-ldst-threshold): Likewise.
(x86-vect-unroll-max-loop-size): Likewise.
* doc/invoke.texi: Document new parameter.

gcc/testsuite/ChangeLog:

* gcc.target/i386/cond_op_maxmin_b-1.c: Add -fno-unroll-loops.
* gcc.target/i386/cond_op_maxmin_ub-1.c: Ditto.
* gcc.target/i386/vect-alignment-peeling-1.c: Ditto.
* gcc.target/i386/vect-alignment-peeling-2.c: Ditto.
* gcc.target/i386/vect-reduc-1.c: Ditto.
---
 gcc/config/i386/i386.cc   | 106 ++
 gcc/config/i386/i386.opt  |  15 +++
 gcc/doc/invoke.texi   |  17 +++
 .../gcc.target/i386/cond_op_maxmin_b-1.c  |   2 +-
 .../gcc.target/i386/cond_op_maxmin_ub-1.c |   2 +-
 .../i386/vect-alignment-peeling-1.c   |   2 +-
 .../i386/vect-alignment-peeling-2.c   |   2 +-
 gcc/testsuite/gcc.target/i386/vect-reduc-1.c  |   2 +-
 8 files changed, 143 insertions(+), 5 deletions(-)

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index aeea26ef4be..a939354e55e 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -23336,6 +23336,17 @@ class ix86_vector_costs : public vector_costs
  stmt_vec_info stmt_info, slp_tree node,
  tree vectype, int misalign,
  vect_cost_model_location where) override;
+
+  unsigned int determine_suggested_unroll_factor (loop_vec_info);
+
+  void finish_cost (const vector_costs *) override;
+
+  /* Total number of vectorized stmts (loop only).  */
+  unsigned m_nstmts = 0;
+  /* Total number of loads (loop only).  */
+  unsigned m_nloads = 0;
+  /* Total number of stores (loop only).  */
+  unsigned m_nstores = 0;
 };
 
 /* Implement targetm.vectorize.create_costs.  */
@@ -23579,6 +23590,19 @@ ix86_vector_costs::add_stmt_cost (int count, 
vect_cost_for_stmt kind,
retval = (retval * 17) / 10;
 }
 
+  if (!m_costing_for_scalar
+  && is_a (m_vinfo)
+  && where == vect_body)
+{
+  m_nstmts += count;
+  if (kind == scalar_load || kind == vector_load
+ || kind == unaligned_load || kind == vector_gather_load)
+   m_nloads += count;
+  else if (kind == scalar_store || kind == vector_store
+  || kind == unaligned_store || kind == vector_scatter_store)
+   m_nstores += count;
+}
+
   m_costs[where] += retval;
 
   return retval;
@@ -23850,6 +23874,88 @@ ix86_loop_unroll_adjust (unsigned nunroll, class loop 
*loop)
   return nunroll;
 }
 
+unsigned int
+ix86_vector_costs::determine_suggested_unroll_factor (loop_vec_info loop_vinfo)
+{
+  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+
+  /* Don't unroll if it's specified explicitly not to be unrolled.  */
+  if (loop->unroll == 1
+  || (OPTION_SET_P (flag_unroll_loops) && !flag_unroll_loops)
+  || (OPTION_SET_P (flag_unroll_all_loops) && !flag_unroll_all_loops))
+return 1;
+
+  /* Don't unroll if there is no vectorized stmt.  */
+  if (m_nstmts == 0)
+return 1;
+
+  /* Don't unroll if vector size is zmm, since zmm throughput is lower than 
other
+ sizes.  */
+  if (GET_MODE_SIZE (loop_vinfo->vector_mode) == 64)
+return 1;
+
+  /* Calc the total number of loads and stores in the loop body.  */
+  unsigned int nstmts_ldst = m_nloads + m_nstores;
+
+  /* Don't unroll if loop body size big than threshold, the threshold
+ is a heuristic value inspired by param_max_unrolled_insns.  */
+  unsigned int uf = m_nstmts < (unsigned int)x86_vect_unroll_max_loop_size
+   ? ((unsigned int)x86_vect_unroll_max_loop_size / m_nstmts)
+   : 1;
+  uf = MIN ((unsigned