Re: [PATCH] Vectorization for store with negative step

2013-12-31 Thread Tejas Belagod

Bingfeng Mei wrote:

Hi,
I created PR59544 and here is the patch. OK to commit? 


Thanks,
Bingfeng


2013-12-18  Bingfeng Mei  

PR tree-optimization/59544
 * tree-vect-stmts.c (perm_mask_for_reverse): Move before
   vectorizable_store. (vectorizable_store): Handle negative step.

2013-12-18  Bingfeng Mei  

PR tree-optimization/59544
* gcc.target/i386/pr59544.c: New test



Hi Bingfeng,

Your patch seems to have a dependence calculation bug(I think) due to which 
gcc.dg/torture/pr52943.c regresses on aarch64. I've raised 
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59651.


Do you think you could have a look?

Thanks,
Tejas.


-Original Message-
From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On 
Behalf Of Richard Biener
Sent: 18 December 2013 11:47
To: Bingfeng Mei
Cc: gcc-patches@gcc.gnu.org
Subject: Re: Vectorization for store with negative step

On Wed, Dec 18, 2013 at 12:34 PM, Bingfeng Mei  wrote:

Thanks, Richard. I will file a bug report and prepare a complete patch. For 
perm_mask_for_reverse function, should I move it before vectorizable_store or 
add a declaration.


Move it.

Richard.


Bingfeng
-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com]
Sent: 18 December 2013 11:26
To: Bingfeng Mei
Cc: gcc-patches@gcc.gnu.org
Subject: Re: Vectorization for store with negative step

On Mon, Dec 16, 2013 at 5:54 PM, Bingfeng Mei  wrote:

Hi,
I was looking at some loops that can be vectorized by LLVM, but not GCC. One 
type of loop is with store of negative step.

void test1(short * __restrict__ x, short * __restrict__ y, short * __restrict__ 
z)
{
int i;
for (i=127; i>=0; i--) {
x[i] = y[127-i] + z[127-i];
}
}

I don't know why GCC only implements negative step for load, but not store. I 
implemented a patch, very similar to code in vectorizable_load.

~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx

Without patch:
test1:
.LFB0:
addq$254, %rdi
xorl%eax, %eax
.p2align 4,,10
.p2align 3
.L2:
movzwl  (%rsi,%rax), %ecx
subq$2, %rdi
addw(%rdx,%rax), %cx
addq$2, %rax
movw%cx, 2(%rdi)
cmpq$256, %rax
jne .L2
rep; ret

With patch:
test1:
.LFB0:
vmovdqa .LC0(%rip), %xmm1
xorl%eax, %eax
.p2align 4,,10
.p2align 3
.L2:
vmovdqu (%rsi,%rax), %xmm0
movq%rax, %rcx
negq%rcx
vpaddw  (%rdx,%rax), %xmm0, %xmm0
vpshufb %xmm1, %xmm0, %xmm0
addq$16, %rax
cmpq$256, %rax
vmovups %xmm0, 240(%rdi,%rcx)
jne .L2
rep; ret

Performance is definitely improved here. It is bootstrapped for 
x86_64-unknown-linux-gnu, and has no additional regressions on my machine.

For reference, LLVM seems to use different instructions and slightly worse 
code. I am not so familiar with x86 assemble code. The patch is originally for 
our private port.
test1:  # @test1
.cfi_startproc
# BB#0: # %entry
addq$240, %rdi
xorl%eax, %eax
.align  16, 0x90
.LBB0_1:# %vector.body
# =>This Inner Loop Header: Depth=1
movdqu  (%rsi,%rax,2), %xmm0
movdqu  (%rdx,%rax,2), %xmm1
paddw   %xmm0, %xmm1
shufpd  $1, %xmm1, %xmm1# xmm1 = xmm1[1,0]
pshuflw $27, %xmm1, %xmm0   # xmm0 = xmm1[3,2,1,0,4,5,6,7]
pshufhw $27, %xmm0, %xmm0   # xmm0 = xmm0[0,1,2,3,7,6,5,4]
movdqu  %xmm0, (%rdi)
addq$8, %rax
addq$-16, %rdi
cmpq$128, %rax
jne .LBB0_1
# BB#2: # %for.end
ret

Any comment?

Looks good to me.  One of the various TODOs in vectorizable_store I presume.

Needs a testcase and at this stage a bugreport that is fixed by it.

Thanks,
Richard.


Bingfeng Mei
Broadcom UK


>





Re: [PING] RE: [PATCH] Vectorization for store with negative step

2013-12-20 Thread H.J. Lu
On Fri, Dec 20, 2013 at 2:09 AM, Bingfeng Mei  wrote:
> OK to commit?
>
> Thanks,
> Bingfeng
> -Original Message-
> From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On 
> Behalf Of Bingfeng Mei
> Sent: 18 December 2013 16:25
> To: Jakub Jelinek
> Cc: Richard Biener; gcc-patches@gcc.gnu.org
> Subject: RE: [PATCH] Vectorization for store with negative step
>
> Hi, Jakub,
> Sorry for all the formatting issues. Haven't submit a patch for a while :-).
> Please find the updated patch.
>

It caused:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59569

-- 
H.J.


Re: [PING] RE: [PATCH] Vectorization for store with negative step

2013-12-20 Thread Richard Biener
On Fri, Dec 20, 2013 at 11:09 AM, Bingfeng Mei  wrote:
> OK to commit?

Ok.

Thanks,
Richard.

> Thanks,
> Bingfeng
> -Original Message-
> From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On 
> Behalf Of Bingfeng Mei
> Sent: 18 December 2013 16:25
> To: Jakub Jelinek
> Cc: Richard Biener; gcc-patches@gcc.gnu.org
> Subject: RE: [PATCH] Vectorization for store with negative step
>
> Hi, Jakub,
> Sorry for all the formatting issues. Haven't submit a patch for a while :-).
> Please find the updated patch.
>
> Thanks,
> Bingfeng
>
> -Original Message-
> From: Jakub Jelinek [mailto:ja...@redhat.com]
> Sent: 18 December 2013 13:38
> To: Bingfeng Mei
> Cc: Richard Biener; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] Vectorization for store with negative step
>
> On Wed, Dec 18, 2013 at 01:31:05PM +, Bingfeng Mei wrote:
> Index: gcc/ChangeLog
> ===
> --- gcc/ChangeLog   (revision 206016)
> +++ gcc/ChangeLog   (working copy)
> @@ -1,3 +1,9 @@
> +2013-12-18  Bingfeng Mei  
> +
> +   PR tree-optimization/59544
> +* tree-vect-stmts.c (perm_mask_for_reverse): Move before
>
> This should be a tab instead of 8 spaces.
>
> +   vectorizable_store. (vectorizable_store): Handle negative step.
>
> Newline and tab after "store.", rather than space.
>
> Property changes on: gcc/testsuite/gcc.target/i386/pr59544.c
> ___
> Added: svn:executable
>+ *
>
> Please don't add such bogus property.  Testcases aren't executable.
>
> Index: gcc/testsuite/ChangeLog
> ===
> --- gcc/testsuite/ChangeLog (revision 206016)
> +++ gcc/testsuite/ChangeLog (working copy)
> @@ -1,3 +1,8 @@
> +2013-12-18  Bingfeng Mei  
> +
> +   PR tree-optimization/59544
> +   * gcc.target/i386/pr59544.c: New test
>
> Missing dot at the end of line.
> +
>  2013-12-16  Jakub Jelinek  
>
> PR middle-end/58956
> Index: gcc/tree-vect-stmts.c
> ===
> --- gcc/tree-vect-stmts.c   (revision 206016)
> +++ gcc/tree-vect-stmts.c   (working copy)
> @@ -4859,6 +4859,25 @@ ensure_base_align (stmt_vec_info stmt_in
>  }
>
>
> +/* Given a vector type VECTYPE returns the VECTOR_CST mask that implements
> +   reversal of the vector elements.  If that is impossible to do,
> +   returns NULL.  */
> +
> +static tree
> +perm_mask_for_reverse (tree vectype)
> +{
> +  int i, nunits;
> +  unsigned char *sel;
> +
> +  nunits = TYPE_VECTOR_SUBPARTS (vectype);
> +  sel = XALLOCAVEC (unsigned char, nunits);
> +
> +  for (i = 0; i < nunits; ++i)
> +sel[i] = nunits - 1 - i;
> +
> +  return vect_gen_perm_mask (vectype, sel);
> +}
> +
>  /* Function vectorizable_store.
>
> Check if STMT defines a non scalar data-ref (array/pointer/structure) that
> @@ -4902,6 +4921,8 @@ vectorizable_store (gimple stmt, gimple_
>vec oprnds = vNULL;
>vec result_chain = vNULL;
>bool inv_p;
> +  bool negative = false;
> +  tree offset = NULL_TREE;
>vec vec_oprnds = vNULL;
>bool slp = (slp_node != NULL);
>unsigned int vec_num;
> @@ -4976,16 +4997,38 @@ vectorizable_store (gimple stmt, gimple_
>if (!STMT_VINFO_DATA_REF (stmt_info))
>  return false;
>
> -  if (tree_int_cst_compare (loop && nested_in_vect_loop_p (loop, stmt)
> -   ? STMT_VINFO_DR_STEP (stmt_info) : DR_STEP (dr),
> -   size_zero_node) < 0)
> +  negative = tree_int_cst_compare (loop && nested_in_vect_loop_p (loop, stmt)
> +? STMT_VINFO_DR_STEP (stmt_info) : DR_STEP (dr),
> +size_zero_node) < 0;
>
> The formatting looks wrong, do:
>   negative
> = tree_int_cst_compare (loop && nested_in_vect_loop_p (loop, stmt)
> ? STMT_VINFO_DR_STEP (stmt_info) : DR_STEP (dr),
> size_zero_node) < 0;
> instead.
>
> +  if (negative && ncopies > 1)
>  {
>if (dump_enabled_p ())
>  dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> - "negative step for store.\n");
> + "multiple types with negative step.");
>return false;
>  }
>
> +  if (negative)
> +{
> +  gcc_assert (!grouped_store);
> +  alignment_support_scheme = vect

[PING] RE: [PATCH] Vectorization for store with negative step

2013-12-20 Thread Bingfeng Mei
OK to commit? 

Thanks,
Bingfeng
-Original Message-
From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On 
Behalf Of Bingfeng Mei
Sent: 18 December 2013 16:25
To: Jakub Jelinek
Cc: Richard Biener; gcc-patches@gcc.gnu.org
Subject: RE: [PATCH] Vectorization for store with negative step

Hi, Jakub,
Sorry for all the formatting issues. Haven't submit a patch for a while :-).
Please find the updated patch. 

Thanks,
Bingfeng

-Original Message-
From: Jakub Jelinek [mailto:ja...@redhat.com] 
Sent: 18 December 2013 13:38
To: Bingfeng Mei
Cc: Richard Biener; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] Vectorization for store with negative step

On Wed, Dec 18, 2013 at 01:31:05PM +, Bingfeng Mei wrote:
Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 206016)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,9 @@
+2013-12-18  Bingfeng Mei  
+
+   PR tree-optimization/59544
+* tree-vect-stmts.c (perm_mask_for_reverse): Move before

This should be a tab instead of 8 spaces.

+   vectorizable_store. (vectorizable_store): Handle negative step.

Newline and tab after "store.", rather than space.

Property changes on: gcc/testsuite/gcc.target/i386/pr59544.c
___
Added: svn:executable
   + *

Please don't add such bogus property.  Testcases aren't executable.

Index: gcc/testsuite/ChangeLog
===
--- gcc/testsuite/ChangeLog (revision 206016)
+++ gcc/testsuite/ChangeLog (working copy)
@@ -1,3 +1,8 @@
+2013-12-18  Bingfeng Mei  
+
+   PR tree-optimization/59544
+   * gcc.target/i386/pr59544.c: New test

Missing dot at the end of line.
+
 2013-12-16  Jakub Jelinek  
 
PR middle-end/58956
Index: gcc/tree-vect-stmts.c
===
--- gcc/tree-vect-stmts.c   (revision 206016)
+++ gcc/tree-vect-stmts.c   (working copy)
@@ -4859,6 +4859,25 @@ ensure_base_align (stmt_vec_info stmt_in
 }
 
 
+/* Given a vector type VECTYPE returns the VECTOR_CST mask that implements
+   reversal of the vector elements.  If that is impossible to do,
+   returns NULL.  */
+
+static tree
+perm_mask_for_reverse (tree vectype)
+{
+  int i, nunits;
+  unsigned char *sel;
+
+  nunits = TYPE_VECTOR_SUBPARTS (vectype);
+  sel = XALLOCAVEC (unsigned char, nunits);
+
+  for (i = 0; i < nunits; ++i)
+sel[i] = nunits - 1 - i;
+
+  return vect_gen_perm_mask (vectype, sel);
+}
+
 /* Function vectorizable_store.
 
Check if STMT defines a non scalar data-ref (array/pointer/structure) that
@@ -4902,6 +4921,8 @@ vectorizable_store (gimple stmt, gimple_
   vec oprnds = vNULL;
   vec result_chain = vNULL;
   bool inv_p;
+  bool negative = false;
+  tree offset = NULL_TREE;
   vec vec_oprnds = vNULL;
   bool slp = (slp_node != NULL);
   unsigned int vec_num;
@@ -4976,16 +4997,38 @@ vectorizable_store (gimple stmt, gimple_
   if (!STMT_VINFO_DATA_REF (stmt_info))
 return false;
 
-  if (tree_int_cst_compare (loop && nested_in_vect_loop_p (loop, stmt)
-   ? STMT_VINFO_DR_STEP (stmt_info) : DR_STEP (dr),
-   size_zero_node) < 0)
+  negative = tree_int_cst_compare (loop && nested_in_vect_loop_p (loop, stmt)
+? STMT_VINFO_DR_STEP (stmt_info) : DR_STEP (dr),
+size_zero_node) < 0;

The formatting looks wrong, do:
  negative
= tree_int_cst_compare (loop && nested_in_vect_loop_p (loop, stmt)
? STMT_VINFO_DR_STEP (stmt_info) : DR_STEP (dr),
size_zero_node) < 0;
instead.

+  if (negative && ncopies > 1)
 {
   if (dump_enabled_p ())
 dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
- "negative step for store.\n");
+ "multiple types with negative step.");
   return false;
 }
 
+  if (negative)
+{
+  gcc_assert (!grouped_store);
+  alignment_support_scheme = vect_supportable_dr_alignment (dr, false);
+  if (alignment_support_scheme != dr_aligned
+  && alignment_support_scheme != dr_unaligned_supported)

Lots of places where you use 8 spaces instead of tab, please fix.
+offset = size_int (-TYPE_VECTOR_SUBPARTS (vectype) + 1);
+
   if (store_lanes_p)
 aggr_type = build_array_type_nelts (elem_type, vec_num * nunits);
   else
@@ -5200,7 +5246,7 @@ vectorizable_store (gimple stmt, gimple_
dataref_ptr
  = vect_create_data_ref_ptr (first_stmt, aggr_type,
  simd_lane_access_p ? loop : NULL,
- NULL_TR

RE: [PATCH] Vectorization for store with negative step

2013-12-18 Thread Bingfeng Mei
Hi, Jakub,
Sorry for all the formatting issues. Haven't submit a patch for a while :-).
Please find the updated patch. 

Thanks,
Bingfeng

-Original Message-
From: Jakub Jelinek [mailto:ja...@redhat.com] 
Sent: 18 December 2013 13:38
To: Bingfeng Mei
Cc: Richard Biener; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] Vectorization for store with negative step

On Wed, Dec 18, 2013 at 01:31:05PM +, Bingfeng Mei wrote:
Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 206016)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,9 @@
+2013-12-18  Bingfeng Mei  
+
+   PR tree-optimization/59544
+* tree-vect-stmts.c (perm_mask_for_reverse): Move before

This should be a tab instead of 8 spaces.

+   vectorizable_store. (vectorizable_store): Handle negative step.

Newline and tab after "store.", rather than space.

Property changes on: gcc/testsuite/gcc.target/i386/pr59544.c
___
Added: svn:executable
   + *

Please don't add such bogus property.  Testcases aren't executable.

Index: gcc/testsuite/ChangeLog
===
--- gcc/testsuite/ChangeLog (revision 206016)
+++ gcc/testsuite/ChangeLog (working copy)
@@ -1,3 +1,8 @@
+2013-12-18  Bingfeng Mei  
+
+   PR tree-optimization/59544
+   * gcc.target/i386/pr59544.c: New test

Missing dot at the end of line.
+
 2013-12-16  Jakub Jelinek  
 
PR middle-end/58956
Index: gcc/tree-vect-stmts.c
===
--- gcc/tree-vect-stmts.c   (revision 206016)
+++ gcc/tree-vect-stmts.c   (working copy)
@@ -4859,6 +4859,25 @@ ensure_base_align (stmt_vec_info stmt_in
 }
 
 
+/* Given a vector type VECTYPE returns the VECTOR_CST mask that implements
+   reversal of the vector elements.  If that is impossible to do,
+   returns NULL.  */
+
+static tree
+perm_mask_for_reverse (tree vectype)
+{
+  int i, nunits;
+  unsigned char *sel;
+
+  nunits = TYPE_VECTOR_SUBPARTS (vectype);
+  sel = XALLOCAVEC (unsigned char, nunits);
+
+  for (i = 0; i < nunits; ++i)
+sel[i] = nunits - 1 - i;
+
+  return vect_gen_perm_mask (vectype, sel);
+}
+
 /* Function vectorizable_store.
 
Check if STMT defines a non scalar data-ref (array/pointer/structure) that
@@ -4902,6 +4921,8 @@ vectorizable_store (gimple stmt, gimple_
   vec oprnds = vNULL;
   vec result_chain = vNULL;
   bool inv_p;
+  bool negative = false;
+  tree offset = NULL_TREE;
   vec vec_oprnds = vNULL;
   bool slp = (slp_node != NULL);
   unsigned int vec_num;
@@ -4976,16 +4997,38 @@ vectorizable_store (gimple stmt, gimple_
   if (!STMT_VINFO_DATA_REF (stmt_info))
 return false;
 
-  if (tree_int_cst_compare (loop && nested_in_vect_loop_p (loop, stmt)
-   ? STMT_VINFO_DR_STEP (stmt_info) : DR_STEP (dr),
-   size_zero_node) < 0)
+  negative = tree_int_cst_compare (loop && nested_in_vect_loop_p (loop, stmt)
+? STMT_VINFO_DR_STEP (stmt_info) : DR_STEP (dr),
+size_zero_node) < 0;

The formatting looks wrong, do:
  negative
= tree_int_cst_compare (loop && nested_in_vect_loop_p (loop, stmt)
? STMT_VINFO_DR_STEP (stmt_info) : DR_STEP (dr),
size_zero_node) < 0;
instead.

+  if (negative && ncopies > 1)
 {
   if (dump_enabled_p ())
 dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
- "negative step for store.\n");
+ "multiple types with negative step.");
   return false;
 }
 
+  if (negative)
+{
+  gcc_assert (!grouped_store);
+  alignment_support_scheme = vect_supportable_dr_alignment (dr, false);
+  if (alignment_support_scheme != dr_aligned
+  && alignment_support_scheme != dr_unaligned_supported)

Lots of places where you use 8 spaces instead of tab, please fix.
+offset = size_int (-TYPE_VECTOR_SUBPARTS (vectype) + 1);
+
   if (store_lanes_p)
 aggr_type = build_array_type_nelts (elem_type, vec_num * nunits);
   else
@@ -5200,7 +5246,7 @@ vectorizable_store (gimple stmt, gimple_
dataref_ptr
  = vect_create_data_ref_ptr (first_stmt, aggr_type,
  simd_lane_access_p ? loop : NULL,
- NULL_TREE, &dummy, gsi, &ptr_incr,
+ offset, &dummy, gsi, &ptr_incr,
  simd_lane_access_p, &inv_p);
  gcc_assert (bb_vinfo || !inv_p);
}
@@ -5306,6 +5352,21 @@ vectorizable_store (gimple stmt, gimple_
set_ptr_info_align

Re: [PATCH] Vectorization for store with negative step

2013-12-18 Thread Jakub Jelinek
On Wed, Dec 18, 2013 at 01:31:05PM +, Bingfeng Mei wrote:
Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 206016)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,9 @@
+2013-12-18  Bingfeng Mei  
+
+   PR tree-optimization/59544
+* tree-vect-stmts.c (perm_mask_for_reverse): Move before

This should be a tab instead of 8 spaces.

+   vectorizable_store. (vectorizable_store): Handle negative step.

Newline and tab after "store.", rather than space.

Property changes on: gcc/testsuite/gcc.target/i386/pr59544.c
___
Added: svn:executable
   + *

Please don't add such bogus property.  Testcases aren't executable.

Index: gcc/testsuite/ChangeLog
===
--- gcc/testsuite/ChangeLog (revision 206016)
+++ gcc/testsuite/ChangeLog (working copy)
@@ -1,3 +1,8 @@
+2013-12-18  Bingfeng Mei  
+
+   PR tree-optimization/59544
+   * gcc.target/i386/pr59544.c: New test

Missing dot at the end of line.
+
 2013-12-16  Jakub Jelinek  
 
PR middle-end/58956
Index: gcc/tree-vect-stmts.c
===
--- gcc/tree-vect-stmts.c   (revision 206016)
+++ gcc/tree-vect-stmts.c   (working copy)
@@ -4859,6 +4859,25 @@ ensure_base_align (stmt_vec_info stmt_in
 }
 
 
+/* Given a vector type VECTYPE returns the VECTOR_CST mask that implements
+   reversal of the vector elements.  If that is impossible to do,
+   returns NULL.  */
+
+static tree
+perm_mask_for_reverse (tree vectype)
+{
+  int i, nunits;
+  unsigned char *sel;
+
+  nunits = TYPE_VECTOR_SUBPARTS (vectype);
+  sel = XALLOCAVEC (unsigned char, nunits);
+
+  for (i = 0; i < nunits; ++i)
+sel[i] = nunits - 1 - i;
+
+  return vect_gen_perm_mask (vectype, sel);
+}
+
 /* Function vectorizable_store.
 
Check if STMT defines a non scalar data-ref (array/pointer/structure) that
@@ -4902,6 +4921,8 @@ vectorizable_store (gimple stmt, gimple_
   vec oprnds = vNULL;
   vec result_chain = vNULL;
   bool inv_p;
+  bool negative = false;
+  tree offset = NULL_TREE;
   vec vec_oprnds = vNULL;
   bool slp = (slp_node != NULL);
   unsigned int vec_num;
@@ -4976,16 +4997,38 @@ vectorizable_store (gimple stmt, gimple_
   if (!STMT_VINFO_DATA_REF (stmt_info))
 return false;
 
-  if (tree_int_cst_compare (loop && nested_in_vect_loop_p (loop, stmt)
-   ? STMT_VINFO_DR_STEP (stmt_info) : DR_STEP (dr),
-   size_zero_node) < 0)
+  negative = tree_int_cst_compare (loop && nested_in_vect_loop_p (loop, stmt)
+? STMT_VINFO_DR_STEP (stmt_info) : DR_STEP (dr),
+size_zero_node) < 0;

The formatting looks wrong, do:
  negative
= tree_int_cst_compare (loop && nested_in_vect_loop_p (loop, stmt)
? STMT_VINFO_DR_STEP (stmt_info) : DR_STEP (dr),
size_zero_node) < 0;
instead.

+  if (negative && ncopies > 1)
 {
   if (dump_enabled_p ())
 dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
- "negative step for store.\n");
+ "multiple types with negative step.");
   return false;
 }
 
+  if (negative)
+{
+  gcc_assert (!grouped_store);
+  alignment_support_scheme = vect_supportable_dr_alignment (dr, false);
+  if (alignment_support_scheme != dr_aligned
+  && alignment_support_scheme != dr_unaligned_supported)

Lots of places where you use 8 spaces instead of tab, please fix.
+offset = size_int (-TYPE_VECTOR_SUBPARTS (vectype) + 1);
+
   if (store_lanes_p)
 aggr_type = build_array_type_nelts (elem_type, vec_num * nunits);
   else
@@ -5200,7 +5246,7 @@ vectorizable_store (gimple stmt, gimple_
dataref_ptr
  = vect_create_data_ref_ptr (first_stmt, aggr_type,
  simd_lane_access_p ? loop : NULL,
- NULL_TREE, &dummy, gsi, &ptr_incr,
+ offset, &dummy, gsi, &ptr_incr,
  simd_lane_access_p, &inv_p);
  gcc_assert (bb_vinfo || !inv_p);
}
@@ -5306,6 +5352,21 @@ vectorizable_store (gimple stmt, gimple_
set_ptr_info_alignment (get_ptr_info (dataref_ptr), align,
misalign);
 
+ if (negative)
+{
+  tree perm_mask = perm_mask_for_reverse (vectype);
+  tree perm_dest = vect_create_destination_var 
(gimple_assign_rhs1 (stmt), vectype);
+  tree new_temp = make_ssa_name (perm_dest, NULL);
+
+  /* Generate the permute statement.  */
+  gimple perm_stmt = gimple_build_assign_with_ops 
(VEC_PERM_EXPR,

[PATCH] Vectorization for store with negative step

2013-12-18 Thread Bingfeng Mei
Hi,
I created PR59544 and here is the patch. OK to commit? 

Thanks,
Bingfeng


2013-12-18  Bingfeng Mei  

PR tree-optimization/59544
 * tree-vect-stmts.c (perm_mask_for_reverse): Move before
   vectorizable_store. (vectorizable_store): Handle negative step.

2013-12-18  Bingfeng Mei  

PR tree-optimization/59544
* gcc.target/i386/pr59544.c: New test

-Original Message-
From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On 
Behalf Of Richard Biener
Sent: 18 December 2013 11:47
To: Bingfeng Mei
Cc: gcc-patches@gcc.gnu.org
Subject: Re: Vectorization for store with negative step

On Wed, Dec 18, 2013 at 12:34 PM, Bingfeng Mei  wrote:
> Thanks, Richard. I will file a bug report and prepare a complete patch. For 
> perm_mask_for_reverse function, should I move it before vectorizable_store or 
> add a declaration.

Move it.

Richard.

>
> Bingfeng
> -Original Message-
> From: Richard Biener [mailto:richard.guent...@gmail.com]
> Sent: 18 December 2013 11:26
> To: Bingfeng Mei
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: Vectorization for store with negative step
>
> On Mon, Dec 16, 2013 at 5:54 PM, Bingfeng Mei  wrote:
>> Hi,
>> I was looking at some loops that can be vectorized by LLVM, but not GCC. One 
>> type of loop is with store of negative step.
>>
>> void test1(short * __restrict__ x, short * __restrict__ y, short * 
>> __restrict__ z)
>> {
>> int i;
>> for (i=127; i>=0; i--) {
>> x[i] = y[127-i] + z[127-i];
>> }
>> }
>>
>> I don't know why GCC only implements negative step for load, but not store. 
>> I implemented a patch, very similar to code in vectorizable_load.
>>
>> ~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx
>>
>> Without patch:
>> test1:
>> .LFB0:
>> addq$254, %rdi
>> xorl%eax, %eax
>> .p2align 4,,10
>> .p2align 3
>> .L2:
>> movzwl  (%rsi,%rax), %ecx
>> subq$2, %rdi
>> addw(%rdx,%rax), %cx
>> addq$2, %rax
>> movw%cx, 2(%rdi)
>> cmpq$256, %rax
>> jne .L2
>> rep; ret
>>
>> With patch:
>> test1:
>> .LFB0:
>> vmovdqa .LC0(%rip), %xmm1
>> xorl%eax, %eax
>> .p2align 4,,10
>> .p2align 3
>> .L2:
>> vmovdqu (%rsi,%rax), %xmm0
>> movq%rax, %rcx
>> negq%rcx
>> vpaddw  (%rdx,%rax), %xmm0, %xmm0
>> vpshufb %xmm1, %xmm0, %xmm0
>> addq$16, %rax
>> cmpq$256, %rax
>> vmovups %xmm0, 240(%rdi,%rcx)
>> jne .L2
>> rep; ret
>>
>> Performance is definitely improved here. It is bootstrapped for 
>> x86_64-unknown-linux-gnu, and has no additional regressions on my machine.
>>
>> For reference, LLVM seems to use different instructions and slightly worse 
>> code. I am not so familiar with x86 assemble code. The patch is originally 
>> for our private port.
>> test1:  # @test1
>> .cfi_startproc
>> # BB#0: # %entry
>> addq$240, %rdi
>> xorl%eax, %eax
>> .align  16, 0x90
>> .LBB0_1:# %vector.body
>> # =>This Inner Loop Header: Depth=1
>> movdqu  (%rsi,%rax,2), %xmm0
>> movdqu  (%rdx,%rax,2), %xmm1
>> paddw   %xmm0, %xmm1
>> shufpd  $1, %xmm1, %xmm1# xmm1 = xmm1[1,0]
>> pshuflw $27, %xmm1, %xmm0   # xmm0 = xmm1[3,2,1,0,4,5,6,7]
>> pshufhw $27, %xmm0, %xmm0   # xmm0 = xmm0[0,1,2,3,7,6,5,4]
>> movdqu  %xmm0, (%rdi)
>> addq$8, %rax
>> addq$-16, %rdi
>> cmpq$128, %rax
>> jne .LBB0_1
>> # BB#2: # %for.end
>> ret
>>
>> Any comment?
>
> Looks good to me.  One of the various TODOs in vectorizable_store I presume.
>
> Needs a testcase and at this stage a bugreport that is fixed by it.
>
> Thanks,
> Richard.
>
>> Bingfeng Mei
>> Broadcom UK
>>
>>


patch_vec_store
Description: patch_vec_store


Re: Vectorization for store with negative step

2013-12-18 Thread Richard Biener
On Wed, Dec 18, 2013 at 12:34 PM, Bingfeng Mei  wrote:
> Thanks, Richard. I will file a bug report and prepare a complete patch. For 
> perm_mask_for_reverse function, should I move it before vectorizable_store or 
> add a declaration.

Move it.

Richard.

>
> Bingfeng
> -Original Message-
> From: Richard Biener [mailto:richard.guent...@gmail.com]
> Sent: 18 December 2013 11:26
> To: Bingfeng Mei
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: Vectorization for store with negative step
>
> On Mon, Dec 16, 2013 at 5:54 PM, Bingfeng Mei  wrote:
>> Hi,
>> I was looking at some loops that can be vectorized by LLVM, but not GCC. One 
>> type of loop is with store of negative step.
>>
>> void test1(short * __restrict__ x, short * __restrict__ y, short * 
>> __restrict__ z)
>> {
>> int i;
>> for (i=127; i>=0; i--) {
>> x[i] = y[127-i] + z[127-i];
>> }
>> }
>>
>> I don't know why GCC only implements negative step for load, but not store. 
>> I implemented a patch, very similar to code in vectorizable_load.
>>
>> ~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx
>>
>> Without patch:
>> test1:
>> .LFB0:
>> addq$254, %rdi
>> xorl%eax, %eax
>> .p2align 4,,10
>> .p2align 3
>> .L2:
>> movzwl  (%rsi,%rax), %ecx
>> subq$2, %rdi
>> addw(%rdx,%rax), %cx
>> addq$2, %rax
>> movw%cx, 2(%rdi)
>> cmpq$256, %rax
>> jne .L2
>> rep; ret
>>
>> With patch:
>> test1:
>> .LFB0:
>> vmovdqa .LC0(%rip), %xmm1
>> xorl%eax, %eax
>> .p2align 4,,10
>> .p2align 3
>> .L2:
>> vmovdqu (%rsi,%rax), %xmm0
>> movq%rax, %rcx
>> negq%rcx
>> vpaddw  (%rdx,%rax), %xmm0, %xmm0
>> vpshufb %xmm1, %xmm0, %xmm0
>> addq$16, %rax
>> cmpq$256, %rax
>> vmovups %xmm0, 240(%rdi,%rcx)
>> jne .L2
>> rep; ret
>>
>> Performance is definitely improved here. It is bootstrapped for 
>> x86_64-unknown-linux-gnu, and has no additional regressions on my machine.
>>
>> For reference, LLVM seems to use different instructions and slightly worse 
>> code. I am not so familiar with x86 assemble code. The patch is originally 
>> for our private port.
>> test1:  # @test1
>> .cfi_startproc
>> # BB#0: # %entry
>> addq$240, %rdi
>> xorl%eax, %eax
>> .align  16, 0x90
>> .LBB0_1:# %vector.body
>> # =>This Inner Loop Header: Depth=1
>> movdqu  (%rsi,%rax,2), %xmm0
>> movdqu  (%rdx,%rax,2), %xmm1
>> paddw   %xmm0, %xmm1
>> shufpd  $1, %xmm1, %xmm1# xmm1 = xmm1[1,0]
>> pshuflw $27, %xmm1, %xmm0   # xmm0 = xmm1[3,2,1,0,4,5,6,7]
>> pshufhw $27, %xmm0, %xmm0   # xmm0 = xmm0[0,1,2,3,7,6,5,4]
>> movdqu  %xmm0, (%rdi)
>> addq$8, %rax
>> addq$-16, %rdi
>> cmpq$128, %rax
>> jne .LBB0_1
>> # BB#2: # %for.end
>> ret
>>
>> Any comment?
>
> Looks good to me.  One of the various TODOs in vectorizable_store I presume.
>
> Needs a testcase and at this stage a bugreport that is fixed by it.
>
> Thanks,
> Richard.
>
>> Bingfeng Mei
>> Broadcom UK
>>
>>


RE: Vectorization for store with negative step

2013-12-18 Thread Bingfeng Mei
Thanks, Richard. I will file a bug report and prepare a complete patch. For 
perm_mask_for_reverse function, should I move it before vectorizable_store or 
add a declaration. 


Bingfeng
-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com] 
Sent: 18 December 2013 11:26
To: Bingfeng Mei
Cc: gcc-patches@gcc.gnu.org
Subject: Re: Vectorization for store with negative step

On Mon, Dec 16, 2013 at 5:54 PM, Bingfeng Mei  wrote:
> Hi,
> I was looking at some loops that can be vectorized by LLVM, but not GCC. One 
> type of loop is with store of negative step.
>
> void test1(short * __restrict__ x, short * __restrict__ y, short * 
> __restrict__ z)
> {
> int i;
> for (i=127; i>=0; i--) {
> x[i] = y[127-i] + z[127-i];
> }
> }
>
> I don't know why GCC only implements negative step for load, but not store. I 
> implemented a patch, very similar to code in vectorizable_load.
>
> ~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx
>
> Without patch:
> test1:
> .LFB0:
> addq$254, %rdi
> xorl%eax, %eax
> .p2align 4,,10
> .p2align 3
> .L2:
> movzwl  (%rsi,%rax), %ecx
> subq$2, %rdi
> addw(%rdx,%rax), %cx
> addq$2, %rax
> movw%cx, 2(%rdi)
> cmpq$256, %rax
> jne .L2
> rep; ret
>
> With patch:
> test1:
> .LFB0:
> vmovdqa .LC0(%rip), %xmm1
> xorl%eax, %eax
> .p2align 4,,10
> .p2align 3
> .L2:
> vmovdqu (%rsi,%rax), %xmm0
> movq%rax, %rcx
> negq%rcx
> vpaddw  (%rdx,%rax), %xmm0, %xmm0
> vpshufb %xmm1, %xmm0, %xmm0
> addq$16, %rax
> cmpq$256, %rax
> vmovups %xmm0, 240(%rdi,%rcx)
> jne .L2
> rep; ret
>
> Performance is definitely improved here. It is bootstrapped for 
> x86_64-unknown-linux-gnu, and has no additional regressions on my machine.
>
> For reference, LLVM seems to use different instructions and slightly worse 
> code. I am not so familiar with x86 assemble code. The patch is originally 
> for our private port.
> test1:  # @test1
> .cfi_startproc
> # BB#0: # %entry
> addq$240, %rdi
> xorl%eax, %eax
> .align  16, 0x90
> .LBB0_1:# %vector.body
> # =>This Inner Loop Header: Depth=1
> movdqu  (%rsi,%rax,2), %xmm0
> movdqu  (%rdx,%rax,2), %xmm1
> paddw   %xmm0, %xmm1
> shufpd  $1, %xmm1, %xmm1# xmm1 = xmm1[1,0]
> pshuflw $27, %xmm1, %xmm0   # xmm0 = xmm1[3,2,1,0,4,5,6,7]
> pshufhw $27, %xmm0, %xmm0   # xmm0 = xmm0[0,1,2,3,7,6,5,4]
> movdqu  %xmm0, (%rdi)
> addq$8, %rax
> addq$-16, %rdi
> cmpq$128, %rax
> jne .LBB0_1
> # BB#2: # %for.end
> ret
>
> Any comment?

Looks good to me.  One of the various TODOs in vectorizable_store I presume.

Needs a testcase and at this stage a bugreport that is fixed by it.

Thanks,
Richard.

> Bingfeng Mei
> Broadcom UK
>
>


Re: Vectorization for store with negative step

2013-12-18 Thread Richard Biener
On Mon, Dec 16, 2013 at 5:54 PM, Bingfeng Mei  wrote:
> Hi,
> I was looking at some loops that can be vectorized by LLVM, but not GCC. One 
> type of loop is with store of negative step.
>
> void test1(short * __restrict__ x, short * __restrict__ y, short * 
> __restrict__ z)
> {
> int i;
> for (i=127; i>=0; i--) {
> x[i] = y[127-i] + z[127-i];
> }
> }
>
> I don't know why GCC only implements negative step for load, but not store. I 
> implemented a patch, very similar to code in vectorizable_load.
>
> ~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx
>
> Without patch:
> test1:
> .LFB0:
> addq$254, %rdi
> xorl%eax, %eax
> .p2align 4,,10
> .p2align 3
> .L2:
> movzwl  (%rsi,%rax), %ecx
> subq$2, %rdi
> addw(%rdx,%rax), %cx
> addq$2, %rax
> movw%cx, 2(%rdi)
> cmpq$256, %rax
> jne .L2
> rep; ret
>
> With patch:
> test1:
> .LFB0:
> vmovdqa .LC0(%rip), %xmm1
> xorl%eax, %eax
> .p2align 4,,10
> .p2align 3
> .L2:
> vmovdqu (%rsi,%rax), %xmm0
> movq%rax, %rcx
> negq%rcx
> vpaddw  (%rdx,%rax), %xmm0, %xmm0
> vpshufb %xmm1, %xmm0, %xmm0
> addq$16, %rax
> cmpq$256, %rax
> vmovups %xmm0, 240(%rdi,%rcx)
> jne .L2
> rep; ret
>
> Performance is definitely improved here. It is bootstrapped for 
> x86_64-unknown-linux-gnu, and has no additional regressions on my machine.
>
> For reference, LLVM seems to use different instructions and slightly worse 
> code. I am not so familiar with x86 assemble code. The patch is originally 
> for our private port.
> test1:  # @test1
> .cfi_startproc
> # BB#0: # %entry
> addq$240, %rdi
> xorl%eax, %eax
> .align  16, 0x90
> .LBB0_1:# %vector.body
> # =>This Inner Loop Header: Depth=1
> movdqu  (%rsi,%rax,2), %xmm0
> movdqu  (%rdx,%rax,2), %xmm1
> paddw   %xmm0, %xmm1
> shufpd  $1, %xmm1, %xmm1# xmm1 = xmm1[1,0]
> pshuflw $27, %xmm1, %xmm0   # xmm0 = xmm1[3,2,1,0,4,5,6,7]
> pshufhw $27, %xmm0, %xmm0   # xmm0 = xmm0[0,1,2,3,7,6,5,4]
> movdqu  %xmm0, (%rdi)
> addq$8, %rax
> addq$-16, %rdi
> cmpq$128, %rax
> jne .LBB0_1
> # BB#2: # %for.end
> ret
>
> Any comment?

Looks good to me.  One of the various TODOs in vectorizable_store I presume.

Needs a testcase and at this stage a bugreport that is fixed by it.

Thanks,
Richard.

> Bingfeng Mei
> Broadcom UK
>
>


Vectorization for store with negative step

2013-12-16 Thread Bingfeng Mei
Hi, 
I was looking at some loops that can be vectorized by LLVM, but not GCC. One 
type of loop is with store of negative step. 

void test1(short * __restrict__ x, short * __restrict__ y, short * __restrict__ 
z)
{
int i;
for (i=127; i>=0; i--) {
x[i] = y[127-i] + z[127-i];
}
}

I don't know why GCC only implements negative step for load, but not store. I 
implemented a patch, very similar to code in vectorizable_load. 

~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx

Without patch:
test1:
.LFB0:
addq$254, %rdi
xorl%eax, %eax
.p2align 4,,10
.p2align 3
.L2:
movzwl  (%rsi,%rax), %ecx
subq$2, %rdi
addw(%rdx,%rax), %cx
addq$2, %rax
movw%cx, 2(%rdi)
cmpq$256, %rax
jne .L2
rep; ret

With patch:
test1:
.LFB0:
vmovdqa .LC0(%rip), %xmm1
xorl%eax, %eax
.p2align 4,,10
.p2align 3
.L2:
vmovdqu (%rsi,%rax), %xmm0
movq%rax, %rcx
negq%rcx
vpaddw  (%rdx,%rax), %xmm0, %xmm0
vpshufb %xmm1, %xmm0, %xmm0
addq$16, %rax
cmpq$256, %rax
vmovups %xmm0, 240(%rdi,%rcx)
jne .L2
rep; ret

Performance is definitely improved here. It is bootstrapped for 
x86_64-unknown-linux-gnu, and has no additional regressions on my machine.

For reference, LLVM seems to use different instructions and slightly worse 
code. I am not so familiar with x86 assemble code. The patch is originally for 
our private port. 
test1:  # @test1
.cfi_startproc
# BB#0: # %entry
addq$240, %rdi
xorl%eax, %eax
.align  16, 0x90
.LBB0_1:# %vector.body
# =>This Inner Loop Header: Depth=1
movdqu  (%rsi,%rax,2), %xmm0
movdqu  (%rdx,%rax,2), %xmm1
paddw   %xmm0, %xmm1
shufpd  $1, %xmm1, %xmm1# xmm1 = xmm1[1,0]
pshuflw $27, %xmm1, %xmm0   # xmm0 = xmm1[3,2,1,0,4,5,6,7]
pshufhw $27, %xmm0, %xmm0   # xmm0 = xmm0[0,1,2,3,7,6,5,4]
movdqu  %xmm0, (%rdi)
addq$8, %rax
addq$-16, %rdi
cmpq$128, %rax
jne .LBB0_1
# BB#2: # %for.end
ret

Any comment? 

Bingfeng Mei
Broadcom UK




patch
Description: patch