On Mon, Dec 16, 2013 at 5:54 PM, Bingfeng Mei <b...@broadcom.com> wrote:
> Hi,
> I was looking at some loops that can be vectorized by LLVM, but not GCC. One 
> type of loop is with store of negative step.
>
> void test1(short * __restrict__ x, short * __restrict__ y, short * 
> __restrict__ z)
> {
>     int i;
>     for (i=127; i>=0; i--) {
>         x[i] = y[127-i] + z[127-i];
>     }
> }
>
> I don't know why GCC only implements negative step for load, but not store. I 
> implemented a patch, very similar to code in vectorizable_load.
>
> ~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx
>
> Without patch:
> test1:
> .LFB0:
>         addq    $254, %rdi
>         xorl    %eax, %eax
>         .p2align 4,,10
>         .p2align 3
> .L2:
>         movzwl  (%rsi,%rax), %ecx
>         subq    $2, %rdi
>         addw    (%rdx,%rax), %cx
>         addq    $2, %rax
>         movw    %cx, 2(%rdi)
>         cmpq    $256, %rax
>         jne     .L2
>         rep; ret
>
> With patch:
> test1:
> .LFB0:
>         vmovdqa .LC0(%rip), %xmm1
>         xorl    %eax, %eax
>         .p2align 4,,10
>         .p2align 3
> .L2:
>         vmovdqu (%rsi,%rax), %xmm0
>         movq    %rax, %rcx
>         negq    %rcx
>         vpaddw  (%rdx,%rax), %xmm0, %xmm0
>         vpshufb %xmm1, %xmm0, %xmm0
>         addq    $16, %rax
>         cmpq    $256, %rax
>         vmovups %xmm0, 240(%rdi,%rcx)
>         jne     .L2
>         rep; ret
>
> Performance is definitely improved here. It is bootstrapped for 
> x86_64-unknown-linux-gnu, and has no additional regressions on my machine.
>
> For reference, LLVM seems to use different instructions and slightly worse 
> code. I am not so familiar with x86 assemble code. The patch is originally 
> for our private port.
> test1:                                  # @test1
>         .cfi_startproc
> # BB#0:                                 # %entry
>         addq    $240, %rdi
>         xorl    %eax, %eax
>         .align  16, 0x90
> .LBB0_1:                                # %vector.body
>                                         # =>This Inner Loop Header: Depth=1
>         movdqu  (%rsi,%rax,2), %xmm0
>         movdqu  (%rdx,%rax,2), %xmm1
>         paddw   %xmm0, %xmm1
>         shufpd  $1, %xmm1, %xmm1        # xmm1 = xmm1[1,0]
>         pshuflw $27, %xmm1, %xmm0       # xmm0 = xmm1[3,2,1,0,4,5,6,7]
>         pshufhw $27, %xmm0, %xmm0       # xmm0 = xmm0[0,1,2,3,7,6,5,4]
>         movdqu  %xmm0, (%rdi)
>         addq    $8, %rax
>         addq    $-16, %rdi
>         cmpq    $128, %rax
>         jne     .LBB0_1
> # BB#2:                                 # %for.end
>         ret
>
> Any comment?

Looks good to me.  One of the various TODOs in vectorizable_store I presume.

Needs a testcase and at this stage a bugreport that is fixed by it.

Thanks,
Richard.

> Bingfeng Mei
> Broadcom UK
>
>

Reply via email to