scatter

Jan Hubicka via Gcc-patches Fri, 24 Mar 2023 06:12:40 -0700

> Emulated gather/scatter behave similar to strided elementwise
> accesses in that they need to decompose the offset vector
> and construct or decompose the data vector so handle them
> the same way, pessimizing the cases with may elements.
> 
> For pr88531-2c.c instead of
> 
> .L4:
>         leaq    (%r15,%rcx), %rdx
>         incl    %edi
>         movl    16(%rdx), %r13d
>         movl    24(%rdx), %r14d
>         movl    (%rdx), %r10d
>         movl    4(%rdx), %r9d
>         movl    8(%rdx), %ebx
>         movl    12(%rdx), %r11d
>         movl    20(%rdx), %r12d
>         vmovss  (%rax,%r14,4), %xmm2
>         movl    28(%rdx), %edx
>         vmovss  (%rax,%r13,4), %xmm1
>         vmovss  (%rax,%r10,4), %xmm0
>         vinsertps       $0x10, (%rax,%rdx,4), %xmm2, %xmm2
>         vinsertps       $0x10, (%rax,%r12,4), %xmm1, %xmm1
>         vinsertps       $0x10, (%rax,%r9,4), %xmm0, %xmm0
>         vmovlhps        %xmm2, %xmm1, %xmm1
>         vmovss  (%rax,%rbx,4), %xmm2
>         vinsertps       $0x10, (%rax,%r11,4), %xmm2, %xmm2
>         vmovlhps        %xmm2, %xmm0, %xmm0
>         vinsertf128     $0x1, %xmm1, %ymm0, %ymm0
>         vmulps  %ymm3, %ymm0, %ymm0
>         vmovups %ymm0, (%r8,%rcx)
>         addq    $32, %rcx
>         cmpl    %esi, %edi
>         jb      .L4
> 
> we now prefer
> 
> .L4:
>         leaq    0(%rbp,%rdx,8), %rcx
>         movl    (%rcx), %r10d
>         movl    4(%rcx), %ecx
>         vmovss  (%rsi,%r10,4), %xmm0
>         vinsertps       $0x10, (%rsi,%rcx,4), %xmm0, %xmm0
>         vmulps  %xmm1, %xmm0, %xmm0
>         vmovlps %xmm0, (%rbx,%rdx,8)
>         incq    %rdx
>         cmpl    %edi, %edx
>         jb      .L4
> 
> which vectorizes with SSE instead of AVX2 which looks like an
> improvement.
> 
> When testing this on SPEC CPU 2017 with -Ofast -flto -march=znver4
> there are quite some cases where we now prefer SSE vectorization
> over AVX512 + AVX2 epilogue and some cases where we now reject
> vectorization.  Runtime the changes are noise with the off-noise
> candidates better after the patch.
> 
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
> 
> OK for stage1?
> 
> Thanks,
> Richard.
> 
>       * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
>       Tame down element extracts and scalar loads for gather/scatter
>       similar to elementwise strided accesses.
> 
>       * gcc.target/i386/pr89618-2.c: New testcase.
>       * gcc.target/i386/pr88531-2b.c: Adjust.
>       * gcc.target/i386/pr88531-2c.c: Likewise.
OK.


Honza

Re: [PATCH 2/2] [i386] Adjust costing of emulated vectorized gather/scatter

Reply via email to