http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59544
Bug ID: 59544 Summary: Vectorizing store with negative stop Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: bmei at broadcom dot com Created attachment 31467 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=31467&action=edit The patch against r206016 I was looking at some loops that can be vectorized by LLVM, but not GCC. One type of loop is with store of negative step. void test1(short * __restrict__ x, short * __restrict__ y, short * __restrict__ z) { int i; for (i=127; i>=0; i--) { x[i] = y[127-i] + z[127-i]; } } I don't know why GCC only implements negative step for load, but not store. I implemented a patch (attached), very similar to code in vectorizable_load. ~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx Without patch: test1: .LFB0: addq $254, %rdi xorl %eax, %eax .p2align 4,,10 .p2align 3 .L2: movzwl (%rsi,%rax), %ecx subq $2, %rdi addw (%rdx,%rax), %cx addq $2, %rax movw %cx, 2(%rdi) cmpq $256, %rax jne .L2 rep; ret With patch: test1: .LFB0: vmovdqa .LC0(%rip), %xmm1 xorl %eax, %eax .p2align 4,,10 .p2align 3 .L2: vmovdqu (%rsi,%rax), %xmm0 movq %rax, %rcx negq %rcx vpaddw (%rdx,%rax), %xmm0, %xmm0 vpshufb %xmm1, %xmm0, %xmm0 addq $16, %rax cmpq $256, %rax vmovups %xmm0, 240(%rdi,%rcx) jne .L2 rep; ret Performance is definitely improved here. It is bootstrapped for x86_64-unknown-linux-gnu, and has no additional regressions on my machine. For reference, LLVM seems to use different instructions and slightly worse code. I am not so familiar with x86 assemble code. The patch is originally for our private port. test1: # @test1 .cfi_startproc # BB#0: # %entry addq $240, %rdi xorl %eax, %eax .align 16, 0x90 .LBB0_1: # %vector.body # =>This Inner Loop Header: Depth=1 movdqu (%rsi,%rax,2), %xmm0 movdqu (%rdx,%rax,2), %xmm1 paddw %xmm0, %xmm1 shufpd $1, %xmm1, %xmm1 # xmm1 = xmm1[1,0] pshuflw $27, %xmm1, %xmm0 # xmm0 = xmm1[3,2,1,0,4,5,6,7] pshufhw $27, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2,3,7,6,5,4] movdqu %xmm0, (%rdi) addq $8, %rax addq $-16, %rdi cmpq $128, %rax jne .LBB0_1 # BB#2: # %for.end ret