Hi Richard, > +@item x86-vect-unroll-min-ldst-threshold > +The vectorizer will check with target information to determine whether > +unroll it. This parameter is used to limit the mininum of loads and > +stores in the main loop. > > It's odd to "limit" the minimum number of something. I think this warrants > clarification that for some (unknow to me ;)) reason we think that when we > have many loads and (or?) stores it is beneficial to unroll to get even more > loads and stores in a single iteration. Btw, does the parameter limit the > number of loads and stores _after_ unrolling or before? > When the number of loads/stores exceeds the threshold, the loads/stores are more likely to conflict with loop itself in the L1 cache(Assuming that address of loads are scattered). Unroll + software scheduling will make 2 or 4 address contiguous loads/stores closer together, it will reduce cache miss rate.
> +@item x86-vect-unroll-max-loop-size > +The vectorizer will check with target information to determine whether > +unroll it. This threshold is used to limit the max size of loop body after > unrolling. > +The default value is 200. > > it should probably say not "size" but "number of instructions". Note that 200 > is quite large given we are talking about vector instructions here which have > larger encodings than scalar instructions. Optimistically assuming > 4 byte encoding (quite optimistic give we're looking at loops with many > loads/stores) that would be an 800 byte loop body which would be 25 cache > lines. > ISTR that at least the loop discovery is limited to a lot smaller cases (but > we > are likely not targeting that). The limit probably still works to fit the > loop > body in the u-op caches though. > Agree with you, it should be "x86-vect-unroll-max-loop-insns". Thanks for the reminder about larger encodings, I checked the skylake uop cache, it can hold 1.5k uOPs, 200 * 2 (1~3 uops/instruction) = 400 uops. I think 200 still work. > That said, the heuristic made me think "what the heck". Can we explain in u- > arch terms why the unrolling is beneficial instead of just defering to SPEC > CPU 2017 fotonik? > Regarding the benefits, I explained in the first answer, I checked 5 hottest functions in the 549, they all benefit from it, it improves the cache hit ratio. Thanks, Lili. > > On Mon, Oct 24, 2022 at 10:46 AM Cui,Lili via Gcc-patches > > <gcc-patches@gcc.gnu.org> wrote: > > > > > > Hi Hongtao, > > > > > > This patch introduces function finish_cost and > > > determine_suggested_unroll_factor for x86 backend, to make it be > > > able to suggest the unroll factor for a given loop being vectorized. > > > Referring to aarch64, RS6000 backends and basing on the analysis on > > > SPEC2017 performance evaluation results. > > > > > > Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu. > > > > > > OK for trunk? > > > > > > > > > > > > With this patch, SPEC2017 performance evaluation results on > > > ICX/CLX/ADL/Znver3 are listed below: > > > > > > For single copy: > > > - ICX: 549.fotonik3d_r +6.2%, the others are neutral > > > - CLX: 549.fotonik3d_r +1.9%, the others are neutral > > > - ADL: 549.fotonik3d_r +4.5%, the others are neutral > > > - Znver3: 549.fotonik3d_r +4.8%, the others are neutral > > > > > > For multi-copy: > > > - ADL: 549.fotonik3d_r +2.7%, the others are neutral > > > > > > gcc/ChangeLog: > > > > > > * config/i386/i386.cc (class ix86_vector_costs): Add new members > > > m_nstmts, m_nloads m_nstores and > determine_suggested_unroll_factor. > > > (ix86_vector_costs::add_stmt_cost): Update for m_nstores, > m_nloads > > > and m_nstores. > > > (ix86_vector_costs::determine_suggested_unroll_factor): New > function. > > > (ix86_vector_costs::finish_cost): Diito. > > > * config/i386/i386.opt:(x86-vect-unroll-limit): New parameter. > > > (x86-vect-unroll-min-ldst-threshold): Likewise. > > > (x86-vect-unroll-max-loop-size): Likewise. > > > * doc/invoke.texi: Document new parameter. > > > > > > gcc/testsuite/ChangeLog: > > > > > > * gcc.target/i386/cond_op_maxmin_b-1.c: Add -fno-unroll-loops. > > > * gcc.target/i386/cond_op_maxmin_ub-1.c: Ditto. > > > * gcc.target/i386/vect-alignment-peeling-1.c: Ditto. > > > * gcc.target/i386/vect-alignment-peeling-2.c: Ditto. > > > * gcc.target/i386/vect-reduc-1.c: Ditto. > > > --- > > > gcc/config/i386/i386.cc | 106 ++++++++++++++++++ > > > gcc/config/i386/i386.opt | 15 +++ > > > gcc/doc/invoke.texi | 17 +++ > > > .../gcc.target/i386/cond_op_maxmin_b-1.c | 2 +- > > > .../gcc.target/i386/cond_op_maxmin_ub-1.c | 2 +- > > > .../i386/vect-alignment-peeling-1.c | 2 +- > > > .../i386/vect-alignment-peeling-2.c | 2 +- > > > gcc/testsuite/gcc.target/i386/vect-reduc-1.c | 2 +- > > > 8 files changed, 143 insertions(+), 5 deletions(-) > > > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index > > > aeea26ef4be..a939354e55e 100644 > > > --- a/gcc/config/i386/i386.cc > > > +++ b/gcc/config/i386/i386.cc > > > @@ -23336,6 +23336,17 @@ class ix86_vector_costs : public vector_costs > > > stmt_vec_info stmt_info, slp_tree node, > > > tree vectype, int misalign, > > > vect_cost_model_location where) > > > override; > > > + > > > + unsigned int determine_suggested_unroll_factor (loop_vec_info); > > > + > > > + void finish_cost (const vector_costs *) override; > > > + > > > + /* Total number of vectorized stmts (loop only). */ unsigned > > > + m_nstmts = 0; > > > + /* Total number of loads (loop only). */ unsigned m_nloads = 0; > > > + /* Total number of stores (loop only). */ unsigned m_nstores = > > > + 0; > > > }; > > > > > > /* Implement targetm.vectorize.create_costs. */ > > > @@ -23579,6 +23590,19 @@ ix86_vector_costs::add_stmt_cost (int count, > vect_cost_for_stmt kind, > > > retval = (retval * 17) / 10; > > > } > > > > > > + if (!m_costing_for_scalar > > > + && is_a<loop_vec_info> (m_vinfo) > > > + && where == vect_body) > > > + { > > > + m_nstmts += count; > > > + if (kind == scalar_load || kind == vector_load > > > + || kind == unaligned_load || kind == vector_gather_load) > > > + m_nloads += count; > > > + else if (kind == scalar_store || kind == vector_store > > > + || kind == unaligned_store || kind == vector_scatter_store) > > > + m_nstores += count; > > > + } > > > + > > > m_costs[where] += retval; > > > > > > return retval; > > > @@ -23850,6 +23874,88 @@ ix86_loop_unroll_adjust (unsigned nunroll, > class loop *loop) > > > return nunroll; > > > } > > > > > > +unsigned int > > > +ix86_vector_costs::determine_suggested_unroll_factor (loop_vec_info > loop_vinfo) > > > +{ > > > + class loop *loop = LOOP_VINFO_LOOP (loop_vinfo); > > > + > > > + /* Don't unroll if it's specified explicitly not to be unrolled. */ > > > + if (loop->unroll == 1 > > > + || (OPTION_SET_P (flag_unroll_loops) && !flag_unroll_loops) > > > + || (OPTION_SET_P (flag_unroll_all_loops) && > > > !flag_unroll_all_loops)) > > > + return 1; > > > + > > > + /* Don't unroll if there is no vectorized stmt. */ > > > + if (m_nstmts == 0) > > > + return 1; > > > + > > > + /* Don't unroll if vector size is zmm, since zmm throughput is lower > than other > > > + sizes. */ > > > + if (GET_MODE_SIZE (loop_vinfo->vector_mode) == 64) > > > + return 1; > > > + > > > + /* Calc the total number of loads and stores in the loop body. */ > > > + unsigned int nstmts_ldst = m_nloads + m_nstores; > > > + > > > + /* Don't unroll if loop body size big than threshold, the threshold > > > + is a heuristic value inspired by param_max_unrolled_insns. */ > > > + unsigned int uf = m_nstmts < (unsigned > int)x86_vect_unroll_max_loop_size > > > + ? ((unsigned int)x86_vect_unroll_max_loop_size / > > > m_nstmts) > > > + : 1; > > > + uf = MIN ((unsigned int)x86_vect_unroll_limit, uf); > > > + uf = 1 << ceil_log2 (uf); > > > + > > > + /* Early return if don't need to unroll. */ > > > + if (uf == 1) > > > + return 1; > > > + > > > + /* Inspired by SPEC2017 fotonik3d_r, we want to aggressively unroll > the loop > > > + if the number of loads and stores exceeds the threshold, unroll + > software > > > + schedule will reduce cache miss rate. */ > > > + if (nstmts_ldst >= (unsigned int)x86_vect_unroll_min_ldst_threshold) > > > + return uf; > > > + > > > + HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop); > > > + unsigned int vf = vect_vf_for_cost (loop_vinfo); > > > + unsigned int unrolled_vf = vf * uf; > > > + if (est_niter == -1 || est_niter < unrolled_vf) > > > + /* When the estimated iteration of this loop is unknown, it's > > > possible > > > + that we are able to vectorize this loop with the original VF but > > > fail > > > + to vectorize it with the unrolled VF any more if the actual > > > iteration > > > + count is in between. */ > > > + return 1; > > > + else > > > + { > > > + unsigned int epil_niter_unr = est_niter % unrolled_vf; > > > + unsigned int epil_niter = est_niter % vf; > > > + /* Even if we have partial vector support, it can be still > > > inefficent > > > + to calculate the length when the iteration count is unknown, so > > > + only expect it's good to unroll when the epilogue iteration count > > > + is not bigger than VF (only one time length calculation). */ > > > + if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) > > > + && epil_niter_unr <= vf) > > > + return uf; > > > + /* Without partial vector support, conservatively unroll this when > > > + the epilogue iteration count is less than the original one > > > + (epilogue execution time wouldn't be longer than before). */ > > > + else if (!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) > > > + && epil_niter_unr <= epil_niter) > > > + return uf; > > > + } > > > + > > > + return 1; > > > +} > > > + > > > +void > > > +ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) > > > + > > > +{ > > > + if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo)) > > > + { > > > + m_suggested_unroll_factor = determine_suggested_unroll_factor > (loop_vinfo); > > > + } > > > + vector_costs::finish_cost (scalar_costs); > > > +} > > > > > > /* Implement TARGET_FLOAT_EXCEPTIONS_ROUNDING_SUPPORTED_P. > */ > > > > > > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt > > > index 53d534f6392..8e49b406aa5 100644 > > > --- a/gcc/config/i386/i386.opt > > > +++ b/gcc/config/i386/i386.opt > > > @@ -1224,3 +1224,18 @@ mavxvnniint8 > > > Target Mask(ISA2_AVXVNNIINT8) Var(ix86_isa_flags2) Save > > > Support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2 and > > > AVXVNNIINT8 built-in functions and code generation. > > > + > > > +-param=x86-vect-unroll-limit= > > > +Target Joined UInteger Var(x86_vect_unroll_limit) Init(4) > IntegerRange(1, 8) Param > > > +Used to limit unroll factor which indicates how much the autovectorizer > may > > > +unroll a loop. The default value is 4. > > > + > > > +-param=x86-vect-unroll-min-ldst-threshold= > > > +Target Joined UInteger Var(x86_vect_unroll_min_ldst_threshold) > Init(25) Param > > > +Used to limit the mininum of loads and stores in the main loop. The > default > > > +value is 25. > > > + > > > +-param=x86-vect-unroll-max-loop-size= > > > +Target Joined UInteger Var(x86_vect_unroll_max_loop_size) Init(200) > Param > > > +This threshold is used to limit the maxnum size of loop body after > unrolling. > > > +The default value is 200. > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > > > index 09548c4528c..c86d686f2cd 100644 > > > --- a/gcc/doc/invoke.texi > > > +++ b/gcc/doc/invoke.texi > > > @@ -15779,6 +15779,23 @@ The following choices of @var{name} are > available on i386 and x86_64 targets: > > > @item x86-stlf-window-ninsns > > > Instructions number above which STFL stall penalty can be compensated. > > > > > > +@item x86-vect-unroll-limit > > > +The vectorizer will check with target information to determine whether > it > > > +would be beneficial to unroll the main vectorized loop and by how much. > This > > > +parameter sets the upper bound of how much the vectorizer will unroll > the main > > > +loop. The default value is four. > > > + > > > +@item x86-vect-unroll-min-ldst-threshold > > > +The vectorizer will check with target information to determine whether > unroll > > > +it. This parameter is used to limit the mininum of loads and stores in > > > the > main > > > +loop. > > > + > > > +@item x86-vect-unroll-max-loop-size > > > +The vectorizer will check with target information to determine whether > unroll > > > +it. This threshold is used to limit the max size of loop body after > > > unrolling. > > > +The default value is 200. > > > + > > > + > > > @end table > > > > > > @end table > > > diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c > b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c > > > index 78c6600f83b..3bf1fb1b12d 100644 > > > --- a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c > > > +++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c > > > @@ -1,5 +1,5 @@ > > > /* { dg-do compile } */ > > > -/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fdump-tree- > optimized" } */ > > > +/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fno-unroll- > loops -fdump-tree-optimized" } */ > > > /* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */ > > > /* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */ > > > /* { dg-final { scan-assembler-times "vpmaxsb" 1 } } */ > > > diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c > b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c > > > index 117179f2109..ba41fd64386 100644 > > > --- a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c > > > +++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c > > > @@ -1,5 +1,5 @@ > > > /* { dg-do compile } */ > > > -/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=uint8 -fdump-tree- > optimized" } */ > > > +/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=uint8 -fno-unroll- > loops -fdump-tree-optimized" } */ > > > /* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */ > > > /* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */ > > > /* { dg-final { scan-assembler-times "vpmaxub" 1 } } */ > > > diff --git a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c > b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c > > > index 4aa536ba86c..fd2f054af4a 100644 > > > --- a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c > > > +++ b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c > > > @@ -2,7 +2,7 @@ > > > /* This is a test exercising peeling for alignment for a negative step > > > vector loop. We're forcing atom tuning here because that has a higher > > > unaligned vs aligned cost unlike most other archs. */ > > > -/* { dg-options "-O3 -march=x86-64 -mtune=atom -fdump-tree-vect- > details -save-temps" } */ > > > +/* { dg-options "-O3 -march=x86-64 -mtune=atom -fno-unroll-loops - > fdump-tree-vect-details -save-temps" } */ > > > > > > float a[1024], b[1024]; > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c > b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c > > > index 834bf0f770d..62c0db2bb9a 100644 > > > --- a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c > > > +++ b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c > > > @@ -2,7 +2,7 @@ > > > /* This is a test exercising peeling for alignment for a positive step > > > vector loop. We're forcing atom tuning here because that has a higher > > > unaligned vs aligned cost unlike most other archs. */ > > > -/* { dg-options "-O3 -march=x86-64 -mtune=atom -fdump-tree-vect- > details -save-temps" } */ > > > +/* { dg-options "-O3 -march=x86-64 -mtune=atom -fno-unroll-loops - > fdump-tree-vect-details -save-temps" } */ > > > > > > float a[1024], b[1024]; > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c > b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c > > > index 9ee9ba4e736..1ba4be01bea 100644 > > > --- a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c > > > +++ b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c > > > @@ -1,5 +1,5 @@ > > > /* { dg-do compile } */ > > > -/* { dg-options "-O3 -mavx2 -mno-avx512f -fdump-tree-vect-details" } > */ > > > +/* { dg-options "-O3 -mavx2 -mno-avx512f -fno-unroll-loops -fdump- > tree-vect-details" } */ > > > > > > #define N 32 > > > int foo (int *a, int n) > > > -- > > > 2.17.1 > > > > > > Thanks, > > > Lili. > > > > > > > > -- > > BR, > > Hongtao