Hi, I created PR59544 and here is the patch. OK to commit? Thanks, Bingfeng
2013-12-18 Bingfeng Mei <b...@broadcom.com> PR tree-optimization/59544 * tree-vect-stmts.c (perm_mask_for_reverse): Move before vectorizable_store. (vectorizable_store): Handle negative step. 2013-12-18 Bingfeng Mei <b...@broadcom.com> PR tree-optimization/59544 * gcc.target/i386/pr59544.c: New test -----Original Message----- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Richard Biener Sent: 18 December 2013 11:47 To: Bingfeng Mei Cc: gcc-patches@gcc.gnu.org Subject: Re: Vectorization for store with negative step On Wed, Dec 18, 2013 at 12:34 PM, Bingfeng Mei <b...@broadcom.com> wrote: > Thanks, Richard. I will file a bug report and prepare a complete patch. For > perm_mask_for_reverse function, should I move it before vectorizable_store or > add a declaration. Move it. Richard. > > Bingfeng > -----Original Message----- > From: Richard Biener [mailto:richard.guent...@gmail.com] > Sent: 18 December 2013 11:26 > To: Bingfeng Mei > Cc: gcc-patches@gcc.gnu.org > Subject: Re: Vectorization for store with negative step > > On Mon, Dec 16, 2013 at 5:54 PM, Bingfeng Mei <b...@broadcom.com> wrote: >> Hi, >> I was looking at some loops that can be vectorized by LLVM, but not GCC. One >> type of loop is with store of negative step. >> >> void test1(short * __restrict__ x, short * __restrict__ y, short * >> __restrict__ z) >> { >> int i; >> for (i=127; i>=0; i--) { >> x[i] = y[127-i] + z[127-i]; >> } >> } >> >> I don't know why GCC only implements negative step for load, but not store. >> I implemented a patch, very similar to code in vectorizable_load. >> >> ~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx >> >> Without patch: >> test1: >> .LFB0: >> addq $254, %rdi >> xorl %eax, %eax >> .p2align 4,,10 >> .p2align 3 >> .L2: >> movzwl (%rsi,%rax), %ecx >> subq $2, %rdi >> addw (%rdx,%rax), %cx >> addq $2, %rax >> movw %cx, 2(%rdi) >> cmpq $256, %rax >> jne .L2 >> rep; ret >> >> With patch: >> test1: >> .LFB0: >> vmovdqa .LC0(%rip), %xmm1 >> xorl %eax, %eax >> .p2align 4,,10 >> .p2align 3 >> .L2: >> vmovdqu (%rsi,%rax), %xmm0 >> movq %rax, %rcx >> negq %rcx >> vpaddw (%rdx,%rax), %xmm0, %xmm0 >> vpshufb %xmm1, %xmm0, %xmm0 >> addq $16, %rax >> cmpq $256, %rax >> vmovups %xmm0, 240(%rdi,%rcx) >> jne .L2 >> rep; ret >> >> Performance is definitely improved here. It is bootstrapped for >> x86_64-unknown-linux-gnu, and has no additional regressions on my machine. >> >> For reference, LLVM seems to use different instructions and slightly worse >> code. I am not so familiar with x86 assemble code. The patch is originally >> for our private port. >> test1: # @test1 >> .cfi_startproc >> # BB#0: # %entry >> addq $240, %rdi >> xorl %eax, %eax >> .align 16, 0x90 >> .LBB0_1: # %vector.body >> # =>This Inner Loop Header: Depth=1 >> movdqu (%rsi,%rax,2), %xmm0 >> movdqu (%rdx,%rax,2), %xmm1 >> paddw %xmm0, %xmm1 >> shufpd $1, %xmm1, %xmm1 # xmm1 = xmm1[1,0] >> pshuflw $27, %xmm1, %xmm0 # xmm0 = xmm1[3,2,1,0,4,5,6,7] >> pshufhw $27, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2,3,7,6,5,4] >> movdqu %xmm0, (%rdi) >> addq $8, %rax >> addq $-16, %rdi >> cmpq $128, %rax >> jne .LBB0_1 >> # BB#2: # %for.end >> ret >> >> Any comment? > > Looks good to me. One of the various TODOs in vectorizable_store I presume. > > Needs a testcase and at this stage a bugreport that is fixed by it. > > Thanks, > Richard. > >> Bingfeng Mei >> Broadcom UK >> >>
patch_vec_store
Description: patch_vec_store