On Thu, Sep 12, 2024 at 4:50 PM Hongtao Liu <crazy...@gmail.com> wrote: > > On Wed, Sep 11, 2024 at 4:21 PM Hongtao Liu <crazy...@gmail.com> wrote: > > > > On Wed, Sep 11, 2024 at 4:04 PM Richard Biener > > <richard.guent...@gmail.com> wrote: > > > > > > On Wed, Sep 11, 2024 at 4:17 AM liuhongt <hongtao....@intel.com> wrote: > > > > > > > > GCC12 enables vectorization for O2 with very cheap cost model which is > > > > restricted > > > > to constant tripcount. The vectorization capacity is very limited w/ > > > > consideration > > > > of codesize impact. > > > > > > > > The patch extends the very cheap cost model a little bit to support > > > > variable tripcount. > > > > But still disable peeling for gaps/alignment, runtime aliasing checking > > > > and epilogue > > > > vectorization with the consideration of codesize. > > > > > > > > So there're at most 2 versions of loop for O2 vectorization, one > > > > vectorized main loop > > > > , one scalar/remainder loop. > > > > > > > > .i.e. > > > > > > > > void > > > > foo1 (int* __restrict a, int* b, int* c, int n) > > > > { > > > > for (int i = 0; i != n; i++) > > > > a[i] = b[i] + c[i]; > > > > } > > > > > > > > with -O2 -march=x86-64-v3, will be vectorized to > > > > > > > > .L10: > > > > vmovdqu (%r8,%rax), %ymm0 > > > > vpaddd (%rsi,%rax), %ymm0, %ymm0 > > > > vmovdqu %ymm0, (%rdi,%rax) > > > > addq $32, %rax > > > > cmpq %rdx, %rax > > > > jne .L10 > > > > movl %ecx, %eax > > > > andl $-8, %eax > > > > cmpl %eax, %ecx > > > > je .L21 > > > > vzeroupper > > > > .L12: > > > > movl (%r8,%rax,4), %edx > > > > addl (%rsi,%rax,4), %edx > > > > movl %edx, (%rdi,%rax,4) > > > > addq $1, %rax > > > > cmpl %eax, %ecx > > > > jne .L12 > > > > > > > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves > > > > performance by 4.11% > > > > with extra 2.8% codeisze, and cheap cost model improve performance by > > > > 5.74% with > > > > extra 8.88% codesize. The details are as below > > > > > > I'm confused by this, is the N-Iter numbers ontop of the cheap cost > > > model numbers? > > No, it's N-iter vs base(very cheap cost model), and cheap vs base. > > > > > > > Performance measured with -march=x86-64-v3 -O2 on EMR > > > > > > > > N-Iter cheap cost model > > > > 500.perlbench_r -0.12% -0.12% > > > > 502.gcc_r 0.44% -0.11% > > > > 505.mcf_r 0.17% 4.46% > > > > 520.omnetpp_r 0.28% -0.27% > > > > 523.xalancbmk_r 0.00% 5.93% > > > > 525.x264_r -0.09% 23.53% > > > > 531.deepsjeng_r 0.19% 0.00% > > > > 541.leela_r 0.22% 0.00% > > > > 548.exchange2_r -11.54% -22.34% > > > > 557.xz_r 0.74% 0.49% > > > > GEOMEAN INT -1.04% 0.60% > > > > > > > > 503.bwaves_r 3.13% 4.72% > > > > 507.cactuBSSN_r 1.17% 0.29% > > > > 508.namd_r 0.39% 6.87% > > > > 510.parest_r 3.14% 8.52% > > > > 511.povray_r 0.10% -0.20% > > > > 519.lbm_r -0.68% 10.14% > > > > 521.wrf_r 68.20% 76.73% > > > > > > So this seems to regress as well? > > Niter increases performance less than the cheap cost model, that's > > expected, it is not a regression. > > > > > > > 526.blender_r 0.12% 0.12% > > > > 527.cam4_r 19.67% 23.21% > > > > 538.imagick_r 0.12% 0.24% > > > > 544.nab_r 0.63% 0.53% > > > > 549.fotonik3d_r 14.44% 9.43% > > > > 554.roms_r 12.39% 0.00% > > > > GEOMEAN FP 8.26% 9.41% > > > > GEOMEAN ALL 4.11% 5.74% > > I've tested the patch on aarch64, it shows similar improvement with > little codesize increasement. > I haven't tested it on other backends, but I think it would have > similar good improvements
I think overall this is expected since a constant niter dividable by the VF isn't a common situation. So the question is mostly whether we want to pay the size penalty or not. Looking only at docs the proposed change would make the very-cheap cost model nearly(?) equivalent to the cheap one so maybe the answer is to default to cheap rather than very-cheap? One difference seems to be that cheap allows alias versioning. Richard. > > > > > > > > Code sise impact > > > > N-Iter cheap cost model > > > > 500.perlbench_r 0.22% 1.03% > > > > 502.gcc_r 0.25% 0.60% > > > > 505.mcf_r 0.00% 32.07% > > > > 520.omnetpp_r 0.09% 0.31% > > > > 523.xalancbmk_r 0.08% 1.86% > > > > 525.x264_r 0.75% 7.96% > > > > 531.deepsjeng_r 0.72% 3.28% > > > > 541.leela_r 0.18% 0.75% > > > > 548.exchange2_r 8.29% 12.19% > > > > 557.xz_r 0.40% 0.60% > > > > GEOMEAN INT 1.07%% 5.71% > > > > > > > > 503.bwaves_r 12.89% 21.59% > > > > 507.cactuBSSN_r 0.90% 20.19% > > > > 508.namd_r 0.77% 14.75% > > > > 510.parest_r 0.91% 3.91% > > > > 511.povray_r 0.45% 4.08% > > > > 519.lbm_r 0.00% 0.00% > > > > 521.wrf_r 5.97% 12.79% > > > > 526.blender_r 0.49% 3.84% > > > > 527.cam4_r 1.39% 3.28% > > > > 538.imagick_r 1.86% 7.78% > > > > 544.nab_r 0.41% 3.00% > > > > 549.fotonik3d_r 25.50% 47.47% > > > > 554.roms_r 5.17% 13.01% > > > > GEOMEAN FP 4.14% 11.38% > > > > GEOMEAN ALL 2.80% 8.88% > > > > > > > > > > > > The only regression is from 548.exchange_r, the vectorization for inner > > > > loop in each layer > > > > of the 9-layer loops increases register pressure and causes more spill. > > > > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10 > > > > - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10 > > > > ..... > > > > - block(rnext:9, 9, i9) = block(rnext:9, 9, i9) + 10 > > > > ... > > > > - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10 > > > > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10 > > > > > > > > Looks like aarch64 doesn't have the issue because aarch64 has 32 gprs, > > > > but x86 only has 16. > > > > I have a extra patch to prevent loop vectorization in deep-depth loop > > > > for x86 backend which can > > > > bring the performance back. > > > > > > > > For 503.bwaves_r/505.mcf_r/507.cactuBSSN_r/508.namd_r, cheap cost model > > > > increases codesize > > > > a lot but don't imporve any performance. And N-iter is much better for > > > > that for codesize. > > > > > > > > > > > > Any comments? > > > > > > > > > > > > gcc/ChangeLog: > > > > > > > > * tree-vect-loop.cc (vect_analyze_loop_costing): Enable > > > > vectorization for LOOP_VINFO_PEELING_FOR_NITER in very cheap > > > > cost model. > > > > (vect_analyze_loop): Disable epilogue vectorization in very > > > > cheap cost model. > > > > --- > > > > gcc/tree-vect-loop.cc | 6 +++--- > > > > 1 file changed, 3 insertions(+), 3 deletions(-) > > > > > > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > > > > index 242d5e2d916..06afd8cae79 100644 > > > > --- a/gcc/tree-vect-loop.cc > > > > +++ b/gcc/tree-vect-loop.cc > > > > @@ -2356,8 +2356,7 @@ vect_analyze_loop_costing (loop_vec_info > > > > loop_vinfo, > > > > a copy of the scalar code (even if we might be able to vectorize > > > > it). */ > > > > if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP > > > > && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) > > > > - || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) > > > > - || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))) > > > > + || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))) > > > > > > I notice that we should probably not call > > > vect_enhance_data_refs_alignment because > > > when alignment peeling is optional we should avoid it rather than > > > disabling the > > > vectorization completely. > > > > > > Also if you allow peeling for niter then there's no good reason to not > > > allow peeling > > > for gaps (or any other epilogue peeling). > > Maybe, I just want to be conservative. > > > > > > The extra cost for niter peeling is a runtime check before the loop > > > which would also > > > happen (plus keeping the scalar copy) when there's a runtime cost check. > > > That > > > also means versioning for alias/alignment could be allowed if it > > > shares the scalar > > > loop with the epilogue (I don't remember the constraints we set in place > > > for the > > > sharing). > > Yes, but for current GCC, alias run-time check creates a separate scalar > > loop > > https://godbolt.org/z/9seoWePKK > > And enabling alias runtime check could increase too much codesize but > > w/o any performance improvement. > > > > > > > > Richard. > > > > > > > { > > > > if (dump_enabled_p ()) > > > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > > > @@ -3638,7 +3637,8 @@ vect_analyze_loop (class loop *loop, > > > > vec_info_shared *shared) > > > > /* No code motion support for multiple > > > > epilogues so for now > > > > not supported when multiple exits. */ > > > > && !LOOP_VINFO_EARLY_BREAKS (first_loop_vinfo) > > > > - && !loop->simduid); > > > > + && !loop->simduid > > > > + && loop_cost_model (loop) > > > > > VECT_COST_MODEL_VERY_CHEAP); > > > > if (!vect_epilogues) > > > > return first_loop_vinfo; > > > > > > > > -- > > > > 2.31.1 > > > > > > > > > > > > -- > > BR, > > Hongtao > > > > -- > BR, > Hongtao