On Thu, Sep 12, 2024 at 4:50 PM Hongtao Liu <crazy...@gmail.com> wrote:
>
> On Wed, Sep 11, 2024 at 4:21 PM Hongtao Liu <crazy...@gmail.com> wrote:
> >
> > On Wed, Sep 11, 2024 at 4:04 PM Richard Biener
> > <richard.guent...@gmail.com> wrote:
> > >
> > > On Wed, Sep 11, 2024 at 4:17 AM liuhongt <hongtao....@intel.com> wrote:
> > > >
> > > > GCC12 enables vectorization for O2 with very cheap cost model which is 
> > > > restricted
> > > > to constant tripcount. The vectorization capacity is very limited w/ 
> > > > consideration
> > > > of codesize impact.
> > > >
> > > > The patch extends the very cheap cost model a little bit to support 
> > > > variable tripcount.
> > > > But still disable peeling for gaps/alignment, runtime aliasing checking 
> > > > and epilogue
> > > > vectorization with the consideration of codesize.
> > > >
> > > > So there're at most 2 versions of loop for O2 vectorization, one 
> > > > vectorized main loop
> > > > , one scalar/remainder loop.
> > > >
> > > > .i.e.
> > > >
> > > > void
> > > > foo1 (int* __restrict a, int* b, int* c, int n)
> > > > {
> > > >  for (int i = 0; i != n; i++)
> > > >   a[i] = b[i] + c[i];
> > > > }
> > > >
> > > > with -O2 -march=x86-64-v3, will be vectorized to
> > > >
> > > > .L10:
> > > >         vmovdqu (%r8,%rax), %ymm0
> > > >         vpaddd  (%rsi,%rax), %ymm0, %ymm0
> > > >         vmovdqu %ymm0, (%rdi,%rax)
> > > >         addq    $32, %rax
> > > >         cmpq    %rdx, %rax
> > > >         jne     .L10
> > > >         movl    %ecx, %eax
> > > >         andl    $-8, %eax
> > > >         cmpl    %eax, %ecx
> > > >         je      .L21
> > > >         vzeroupper
> > > > .L12:
> > > >         movl    (%r8,%rax,4), %edx
> > > >         addl    (%rsi,%rax,4), %edx
> > > >         movl    %edx, (%rdi,%rax,4)
> > > >         addq    $1, %rax
> > > >         cmpl    %eax, %ecx
> > > >         jne     .L12
> > > >
> > > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves 
> > > > performance by 4.11%
> > > > with extra 2.8% codeisze, and cheap cost model improve performance by 
> > > > 5.74% with
> > > > extra 8.88% codesize. The details are as below
> > >
> > > I'm confused by this, is the N-Iter numbers ontop of the cheap cost
> > > model numbers?
> > No, it's N-iter vs base(very cheap cost model), and cheap vs base.
> > >
> > > > Performance measured with -march=x86-64-v3 -O2 on EMR
> > > >
> > > >                     N-Iter      cheap cost model
> > > > 500.perlbench_r     -0.12%      -0.12%
> > > > 502.gcc_r           0.44%       -0.11%
> > > > 505.mcf_r           0.17%       4.46%
> > > > 520.omnetpp_r       0.28%       -0.27%
> > > > 523.xalancbmk_r     0.00%       5.93%
> > > > 525.x264_r          -0.09%      23.53%
> > > > 531.deepsjeng_r     0.19%       0.00%
> > > > 541.leela_r         0.22%       0.00%
> > > > 548.exchange2_r     -11.54%     -22.34%
> > > > 557.xz_r            0.74%       0.49%
> > > > GEOMEAN INT         -1.04%      0.60%
> > > >
> > > > 503.bwaves_r        3.13%       4.72%
> > > > 507.cactuBSSN_r     1.17%       0.29%
> > > > 508.namd_r          0.39%       6.87%
> > > > 510.parest_r        3.14%       8.52%
> > > > 511.povray_r        0.10%       -0.20%
> > > > 519.lbm_r           -0.68%      10.14%
> > > > 521.wrf_r           68.20%      76.73%
> > >
> > > So this seems to regress as well?
> > Niter increases performance less than the cheap cost model, that's
> > expected, it is not a regression.
> > >
> > > > 526.blender_r       0.12%       0.12%
> > > > 527.cam4_r          19.67%      23.21%
> > > > 538.imagick_r       0.12%       0.24%
> > > > 544.nab_r           0.63%       0.53%
> > > > 549.fotonik3d_r     14.44%      9.43%
> > > > 554.roms_r          12.39%      0.00%
> > > > GEOMEAN FP          8.26%       9.41%
> > > > GEOMEAN ALL         4.11%       5.74%
>
> I've tested the patch on aarch64, it shows similar improvement with
> little codesize increasement.
> I haven't tested it on other backends, but I think it would have
> similar good improvements

I think overall this is expected since a constant niter dividable by
the VF isn't a common situation.  So the question is mostly whether
we want to pay the size penalty or not.

Looking only at docs the proposed change would make the very-cheap
cost model nearly(?) equivalent to the cheap one so maybe the answer
is to default to cheap rather than very-cheap?  One difference seems to
be that cheap allows alias versioning.

Richard.

> > > >
> > > > Code sise impact
> > > >                     N-Iter      cheap cost model
> > > > 500.perlbench_r     0.22%       1.03%
> > > > 502.gcc_r           0.25%       0.60%
> > > > 505.mcf_r           0.00%       32.07%
> > > > 520.omnetpp_r       0.09%       0.31%
> > > > 523.xalancbmk_r     0.08%       1.86%
> > > > 525.x264_r          0.75%       7.96%
> > > > 531.deepsjeng_r     0.72%       3.28%
> > > > 541.leela_r         0.18%       0.75%
> > > > 548.exchange2_r     8.29%       12.19%
> > > > 557.xz_r            0.40%       0.60%
> > > > GEOMEAN INT         1.07%%      5.71%
> > > >
> > > > 503.bwaves_r        12.89%      21.59%
> > > > 507.cactuBSSN_r     0.90%       20.19%
> > > > 508.namd_r          0.77%       14.75%
> > > > 510.parest_r        0.91%       3.91%
> > > > 511.povray_r        0.45%       4.08%
> > > > 519.lbm_r           0.00%       0.00%
> > > > 521.wrf_r           5.97%       12.79%
> > > > 526.blender_r       0.49%       3.84%
> > > > 527.cam4_r          1.39%       3.28%
> > > > 538.imagick_r       1.86%       7.78%
> > > > 544.nab_r           0.41%       3.00%
> > > > 549.fotonik3d_r     25.50%      47.47%
> > > > 554.roms_r          5.17%       13.01%
> > > > GEOMEAN FP          4.14%       11.38%
> > > > GEOMEAN ALL         2.80%       8.88%
> > > >
> > > >
> > > > The only regression is from 548.exchange_r, the vectorization for inner 
> > > > loop in each layer
> > > > of the 9-layer loops increases register pressure and causes more spill.
> > > > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10
> > > >   - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
> > > >     .....
> > > >         - block(rnext:9, 9, i9) = block(rnext:9, 9, i9) + 10
> > > >     ...
> > > > - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
> > > > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10
> > > >
> > > > Looks like aarch64 doesn't have the issue because aarch64 has 32 gprs, 
> > > > but x86 only has 16.
> > > > I have a extra patch to prevent loop vectorization in deep-depth loop 
> > > > for x86 backend which can
> > > > bring the performance back.
> > > >
> > > > For 503.bwaves_r/505.mcf_r/507.cactuBSSN_r/508.namd_r, cheap cost model 
> > > > increases codesize
> > > > a lot but don't imporve any performance. And N-iter is much better for 
> > > > that for codesize.
> > > >
> > > >
> > > > Any comments?
> > > >
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >         * tree-vect-loop.cc (vect_analyze_loop_costing): Enable
> > > >         vectorization for LOOP_VINFO_PEELING_FOR_NITER in very cheap
> > > >         cost model.
> > > >         (vect_analyze_loop): Disable epilogue vectorization in very
> > > >         cheap cost model.
> > > > ---
> > > >  gcc/tree-vect-loop.cc | 6 +++---
> > > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > > index 242d5e2d916..06afd8cae79 100644
> > > > --- a/gcc/tree-vect-loop.cc
> > > > +++ b/gcc/tree-vect-loop.cc
> > > > @@ -2356,8 +2356,7 @@ vect_analyze_loop_costing (loop_vec_info 
> > > > loop_vinfo,
> > > >       a copy of the scalar code (even if we might be able to vectorize 
> > > > it).  */
> > > >    if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP
> > > >        && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> > > > -         || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > > > -         || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
> > > > +         || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)))
> > >
> > > I notice that we should probably not call
> > > vect_enhance_data_refs_alignment because
> > > when alignment peeling is optional we should avoid it rather than 
> > > disabling the
> > > vectorization completely.
> > >
> > > Also if you allow peeling for niter then there's no good reason to not
> > > allow peeling
> > > for gaps (or any other epilogue peeling).
> > Maybe, I just want to be conservative.
> > >
> > > The extra cost for niter peeling is a runtime check before the loop
> > > which would also
> > > happen (plus keeping the scalar copy) when there's a runtime cost check.  
> > > That
> > > also means versioning for alias/alignment could be allowed if it
> > > shares the scalar
> > > loop with the epilogue (I don't remember the constraints we set in place 
> > > for the
> > > sharing).
> > Yes, but for current GCC, alias run-time check creates a separate scalar 
> > loop
> > https://godbolt.org/z/9seoWePKK
> > And enabling alias runtime check could increase too much codesize but
> > w/o any performance improvement.
> >
> > >
> > > Richard.
> > >
> > > >      {
> > > >        if (dump_enabled_p ())
> > > >         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > > > @@ -3638,7 +3637,8 @@ vect_analyze_loop (class loop *loop, 
> > > > vec_info_shared *shared)
> > > >                            /* No code motion support for multiple 
> > > > epilogues so for now
> > > >                               not supported when multiple exits.  */
> > > >                          && !LOOP_VINFO_EARLY_BREAKS (first_loop_vinfo)
> > > > -                        && !loop->simduid);
> > > > +                        && !loop->simduid
> > > > +                        && loop_cost_model (loop) > 
> > > > VECT_COST_MODEL_VERY_CHEAP);
> > > >    if (!vect_epilogues)
> > > >      return first_loop_vinfo;
> > > >
> > > > --
> > > > 2.31.1
> > > >
> >
> >
> >
> > --
> > BR,
> > Hongtao
>
>
>
> --
> BR,
> Hongtao

Reply via email to