Re: [PATCH] Vectorization for store with negative step
Bingfeng Mei wrote: Hi, I created PR59544 and here is the patch. OK to commit? Thanks, Bingfeng 2013-12-18 Bingfeng Mei PR tree-optimization/59544 * tree-vect-stmts.c (perm_mask_for_reverse): Move before vectorizable_store. (vectorizable_store): Handle negative step. 2013-12-18 Bingfeng Mei PR tree-optimization/59544 * gcc.target/i386/pr59544.c: New test Hi Bingfeng, Your patch seems to have a dependence calculation bug(I think) due to which gcc.dg/torture/pr52943.c regresses on aarch64. I've raised http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59651. Do you think you could have a look? Thanks, Tejas. -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Richard Biener Sent: 18 December 2013 11:47 To: Bingfeng Mei Cc: gcc-patches@gcc.gnu.org Subject: Re: Vectorization for store with negative step On Wed, Dec 18, 2013 at 12:34 PM, Bingfeng Mei wrote: Thanks, Richard. I will file a bug report and prepare a complete patch. For perm_mask_for_reverse function, should I move it before vectorizable_store or add a declaration. Move it. Richard. Bingfeng -Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: 18 December 2013 11:26 To: Bingfeng Mei Cc: gcc-patches@gcc.gnu.org Subject: Re: Vectorization for store with negative step On Mon, Dec 16, 2013 at 5:54 PM, Bingfeng Mei wrote: Hi, I was looking at some loops that can be vectorized by LLVM, but not GCC. One type of loop is with store of negative step. void test1(short * __restrict__ x, short * __restrict__ y, short * __restrict__ z) { int i; for (i=127; i>=0; i--) { x[i] = y[127-i] + z[127-i]; } } I don't know why GCC only implements negative step for load, but not store. I implemented a patch, very similar to code in vectorizable_load. ~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx Without patch: test1: .LFB0: addq$254, %rdi xorl%eax, %eax .p2align 4,,10 .p2align 3 .L2: movzwl (%rsi,%rax), %ecx subq$2, %rdi addw(%rdx,%rax), %cx addq$2, %rax movw%cx, 2(%rdi) cmpq$256, %rax jne .L2 rep; ret With patch: test1: .LFB0: vmovdqa .LC0(%rip), %xmm1 xorl%eax, %eax .p2align 4,,10 .p2align 3 .L2: vmovdqu (%rsi,%rax), %xmm0 movq%rax, %rcx negq%rcx vpaddw (%rdx,%rax), %xmm0, %xmm0 vpshufb %xmm1, %xmm0, %xmm0 addq$16, %rax cmpq$256, %rax vmovups %xmm0, 240(%rdi,%rcx) jne .L2 rep; ret Performance is definitely improved here. It is bootstrapped for x86_64-unknown-linux-gnu, and has no additional regressions on my machine. For reference, LLVM seems to use different instructions and slightly worse code. I am not so familiar with x86 assemble code. The patch is originally for our private port. test1: # @test1 .cfi_startproc # BB#0: # %entry addq$240, %rdi xorl%eax, %eax .align 16, 0x90 .LBB0_1:# %vector.body # =>This Inner Loop Header: Depth=1 movdqu (%rsi,%rax,2), %xmm0 movdqu (%rdx,%rax,2), %xmm1 paddw %xmm0, %xmm1 shufpd $1, %xmm1, %xmm1# xmm1 = xmm1[1,0] pshuflw $27, %xmm1, %xmm0 # xmm0 = xmm1[3,2,1,0,4,5,6,7] pshufhw $27, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2,3,7,6,5,4] movdqu %xmm0, (%rdi) addq$8, %rax addq$-16, %rdi cmpq$128, %rax jne .LBB0_1 # BB#2: # %for.end ret Any comment? Looks good to me. One of the various TODOs in vectorizable_store I presume. Needs a testcase and at this stage a bugreport that is fixed by it. Thanks, Richard. Bingfeng Mei Broadcom UK >
Re: [PING] RE: [PATCH] Vectorization for store with negative step
On Fri, Dec 20, 2013 at 2:09 AM, Bingfeng Mei wrote: > OK to commit? > > Thanks, > Bingfeng > -Original Message- > From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On > Behalf Of Bingfeng Mei > Sent: 18 December 2013 16:25 > To: Jakub Jelinek > Cc: Richard Biener; gcc-patches@gcc.gnu.org > Subject: RE: [PATCH] Vectorization for store with negative step > > Hi, Jakub, > Sorry for all the formatting issues. Haven't submit a patch for a while :-). > Please find the updated patch. > It caused: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59569 -- H.J.
Re: [PING] RE: [PATCH] Vectorization for store with negative step
On Fri, Dec 20, 2013 at 11:09 AM, Bingfeng Mei wrote: > OK to commit? Ok. Thanks, Richard. > Thanks, > Bingfeng > -Original Message- > From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On > Behalf Of Bingfeng Mei > Sent: 18 December 2013 16:25 > To: Jakub Jelinek > Cc: Richard Biener; gcc-patches@gcc.gnu.org > Subject: RE: [PATCH] Vectorization for store with negative step > > Hi, Jakub, > Sorry for all the formatting issues. Haven't submit a patch for a while :-). > Please find the updated patch. > > Thanks, > Bingfeng > > -Original Message- > From: Jakub Jelinek [mailto:ja...@redhat.com] > Sent: 18 December 2013 13:38 > To: Bingfeng Mei > Cc: Richard Biener; gcc-patches@gcc.gnu.org > Subject: Re: [PATCH] Vectorization for store with negative step > > On Wed, Dec 18, 2013 at 01:31:05PM +, Bingfeng Mei wrote: > Index: gcc/ChangeLog > === > --- gcc/ChangeLog (revision 206016) > +++ gcc/ChangeLog (working copy) > @@ -1,3 +1,9 @@ > +2013-12-18 Bingfeng Mei > + > + PR tree-optimization/59544 > +* tree-vect-stmts.c (perm_mask_for_reverse): Move before > > This should be a tab instead of 8 spaces. > > + vectorizable_store. (vectorizable_store): Handle negative step. > > Newline and tab after "store.", rather than space. > > Property changes on: gcc/testsuite/gcc.target/i386/pr59544.c > ___ > Added: svn:executable >+ * > > Please don't add such bogus property. Testcases aren't executable. > > Index: gcc/testsuite/ChangeLog > === > --- gcc/testsuite/ChangeLog (revision 206016) > +++ gcc/testsuite/ChangeLog (working copy) > @@ -1,3 +1,8 @@ > +2013-12-18 Bingfeng Mei > + > + PR tree-optimization/59544 > + * gcc.target/i386/pr59544.c: New test > > Missing dot at the end of line. > + > 2013-12-16 Jakub Jelinek > > PR middle-end/58956 > Index: gcc/tree-vect-stmts.c > === > --- gcc/tree-vect-stmts.c (revision 206016) > +++ gcc/tree-vect-stmts.c (working copy) > @@ -4859,6 +4859,25 @@ ensure_base_align (stmt_vec_info stmt_in > } > > > +/* Given a vector type VECTYPE returns the VECTOR_CST mask that implements > + reversal of the vector elements. If that is impossible to do, > + returns NULL. */ > + > +static tree > +perm_mask_for_reverse (tree vectype) > +{ > + int i, nunits; > + unsigned char *sel; > + > + nunits = TYPE_VECTOR_SUBPARTS (vectype); > + sel = XALLOCAVEC (unsigned char, nunits); > + > + for (i = 0; i < nunits; ++i) > +sel[i] = nunits - 1 - i; > + > + return vect_gen_perm_mask (vectype, sel); > +} > + > /* Function vectorizable_store. > > Check if STMT defines a non scalar data-ref (array/pointer/structure) that > @@ -4902,6 +4921,8 @@ vectorizable_store (gimple stmt, gimple_ >vec oprnds = vNULL; >vec result_chain = vNULL; >bool inv_p; > + bool negative = false; > + tree offset = NULL_TREE; >vec vec_oprnds = vNULL; >bool slp = (slp_node != NULL); >unsigned int vec_num; > @@ -4976,16 +4997,38 @@ vectorizable_store (gimple stmt, gimple_ >if (!STMT_VINFO_DATA_REF (stmt_info)) > return false; > > - if (tree_int_cst_compare (loop && nested_in_vect_loop_p (loop, stmt) > - ? STMT_VINFO_DR_STEP (stmt_info) : DR_STEP (dr), > - size_zero_node) < 0) > + negative = tree_int_cst_compare (loop && nested_in_vect_loop_p (loop, stmt) > +? STMT_VINFO_DR_STEP (stmt_info) : DR_STEP (dr), > +size_zero_node) < 0; > > The formatting looks wrong, do: > negative > = tree_int_cst_compare (loop && nested_in_vect_loop_p (loop, stmt) > ? STMT_VINFO_DR_STEP (stmt_info) : DR_STEP (dr), > size_zero_node) < 0; > instead. > > + if (negative && ncopies > 1) > { >if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > - "negative step for store.\n"); > + "multiple types with negative step."); >return false; > } > > + if (negative) > +{ > + gcc_assert (!grouped_store); > + alignment_support_scheme = vect
[PING] RE: [PATCH] Vectorization for store with negative step
OK to commit? Thanks, Bingfeng -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Bingfeng Mei Sent: 18 December 2013 16:25 To: Jakub Jelinek Cc: Richard Biener; gcc-patches@gcc.gnu.org Subject: RE: [PATCH] Vectorization for store with negative step Hi, Jakub, Sorry for all the formatting issues. Haven't submit a patch for a while :-). Please find the updated patch. Thanks, Bingfeng -Original Message- From: Jakub Jelinek [mailto:ja...@redhat.com] Sent: 18 December 2013 13:38 To: Bingfeng Mei Cc: Richard Biener; gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Vectorization for store with negative step On Wed, Dec 18, 2013 at 01:31:05PM +, Bingfeng Mei wrote: Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 206016) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,9 @@ +2013-12-18 Bingfeng Mei + + PR tree-optimization/59544 +* tree-vect-stmts.c (perm_mask_for_reverse): Move before This should be a tab instead of 8 spaces. + vectorizable_store. (vectorizable_store): Handle negative step. Newline and tab after "store.", rather than space. Property changes on: gcc/testsuite/gcc.target/i386/pr59544.c ___ Added: svn:executable + * Please don't add such bogus property. Testcases aren't executable. Index: gcc/testsuite/ChangeLog === --- gcc/testsuite/ChangeLog (revision 206016) +++ gcc/testsuite/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2013-12-18 Bingfeng Mei + + PR tree-optimization/59544 + * gcc.target/i386/pr59544.c: New test Missing dot at the end of line. + 2013-12-16 Jakub Jelinek PR middle-end/58956 Index: gcc/tree-vect-stmts.c === --- gcc/tree-vect-stmts.c (revision 206016) +++ gcc/tree-vect-stmts.c (working copy) @@ -4859,6 +4859,25 @@ ensure_base_align (stmt_vec_info stmt_in } +/* Given a vector type VECTYPE returns the VECTOR_CST mask that implements + reversal of the vector elements. If that is impossible to do, + returns NULL. */ + +static tree +perm_mask_for_reverse (tree vectype) +{ + int i, nunits; + unsigned char *sel; + + nunits = TYPE_VECTOR_SUBPARTS (vectype); + sel = XALLOCAVEC (unsigned char, nunits); + + for (i = 0; i < nunits; ++i) +sel[i] = nunits - 1 - i; + + return vect_gen_perm_mask (vectype, sel); +} + /* Function vectorizable_store. Check if STMT defines a non scalar data-ref (array/pointer/structure) that @@ -4902,6 +4921,8 @@ vectorizable_store (gimple stmt, gimple_ vec oprnds = vNULL; vec result_chain = vNULL; bool inv_p; + bool negative = false; + tree offset = NULL_TREE; vec vec_oprnds = vNULL; bool slp = (slp_node != NULL); unsigned int vec_num; @@ -4976,16 +4997,38 @@ vectorizable_store (gimple stmt, gimple_ if (!STMT_VINFO_DATA_REF (stmt_info)) return false; - if (tree_int_cst_compare (loop && nested_in_vect_loop_p (loop, stmt) - ? STMT_VINFO_DR_STEP (stmt_info) : DR_STEP (dr), - size_zero_node) < 0) + negative = tree_int_cst_compare (loop && nested_in_vect_loop_p (loop, stmt) +? STMT_VINFO_DR_STEP (stmt_info) : DR_STEP (dr), +size_zero_node) < 0; The formatting looks wrong, do: negative = tree_int_cst_compare (loop && nested_in_vect_loop_p (loop, stmt) ? STMT_VINFO_DR_STEP (stmt_info) : DR_STEP (dr), size_zero_node) < 0; instead. + if (negative && ncopies > 1) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, - "negative step for store.\n"); + "multiple types with negative step."); return false; } + if (negative) +{ + gcc_assert (!grouped_store); + alignment_support_scheme = vect_supportable_dr_alignment (dr, false); + if (alignment_support_scheme != dr_aligned + && alignment_support_scheme != dr_unaligned_supported) Lots of places where you use 8 spaces instead of tab, please fix. +offset = size_int (-TYPE_VECTOR_SUBPARTS (vectype) + 1); + if (store_lanes_p) aggr_type = build_array_type_nelts (elem_type, vec_num * nunits); else @@ -5200,7 +5246,7 @@ vectorizable_store (gimple stmt, gimple_ dataref_ptr = vect_create_data_ref_ptr (first_stmt, aggr_type, simd_lane_access_p ? loop : NULL, - NULL_TR
RE: [PATCH] Vectorization for store with negative step
Hi, Jakub, Sorry for all the formatting issues. Haven't submit a patch for a while :-). Please find the updated patch. Thanks, Bingfeng -Original Message- From: Jakub Jelinek [mailto:ja...@redhat.com] Sent: 18 December 2013 13:38 To: Bingfeng Mei Cc: Richard Biener; gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Vectorization for store with negative step On Wed, Dec 18, 2013 at 01:31:05PM +, Bingfeng Mei wrote: Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 206016) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,9 @@ +2013-12-18 Bingfeng Mei + + PR tree-optimization/59544 +* tree-vect-stmts.c (perm_mask_for_reverse): Move before This should be a tab instead of 8 spaces. + vectorizable_store. (vectorizable_store): Handle negative step. Newline and tab after "store.", rather than space. Property changes on: gcc/testsuite/gcc.target/i386/pr59544.c ___ Added: svn:executable + * Please don't add such bogus property. Testcases aren't executable. Index: gcc/testsuite/ChangeLog === --- gcc/testsuite/ChangeLog (revision 206016) +++ gcc/testsuite/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2013-12-18 Bingfeng Mei + + PR tree-optimization/59544 + * gcc.target/i386/pr59544.c: New test Missing dot at the end of line. + 2013-12-16 Jakub Jelinek PR middle-end/58956 Index: gcc/tree-vect-stmts.c === --- gcc/tree-vect-stmts.c (revision 206016) +++ gcc/tree-vect-stmts.c (working copy) @@ -4859,6 +4859,25 @@ ensure_base_align (stmt_vec_info stmt_in } +/* Given a vector type VECTYPE returns the VECTOR_CST mask that implements + reversal of the vector elements. If that is impossible to do, + returns NULL. */ + +static tree +perm_mask_for_reverse (tree vectype) +{ + int i, nunits; + unsigned char *sel; + + nunits = TYPE_VECTOR_SUBPARTS (vectype); + sel = XALLOCAVEC (unsigned char, nunits); + + for (i = 0; i < nunits; ++i) +sel[i] = nunits - 1 - i; + + return vect_gen_perm_mask (vectype, sel); +} + /* Function vectorizable_store. Check if STMT defines a non scalar data-ref (array/pointer/structure) that @@ -4902,6 +4921,8 @@ vectorizable_store (gimple stmt, gimple_ vec oprnds = vNULL; vec result_chain = vNULL; bool inv_p; + bool negative = false; + tree offset = NULL_TREE; vec vec_oprnds = vNULL; bool slp = (slp_node != NULL); unsigned int vec_num; @@ -4976,16 +4997,38 @@ vectorizable_store (gimple stmt, gimple_ if (!STMT_VINFO_DATA_REF (stmt_info)) return false; - if (tree_int_cst_compare (loop && nested_in_vect_loop_p (loop, stmt) - ? STMT_VINFO_DR_STEP (stmt_info) : DR_STEP (dr), - size_zero_node) < 0) + negative = tree_int_cst_compare (loop && nested_in_vect_loop_p (loop, stmt) +? STMT_VINFO_DR_STEP (stmt_info) : DR_STEP (dr), +size_zero_node) < 0; The formatting looks wrong, do: negative = tree_int_cst_compare (loop && nested_in_vect_loop_p (loop, stmt) ? STMT_VINFO_DR_STEP (stmt_info) : DR_STEP (dr), size_zero_node) < 0; instead. + if (negative && ncopies > 1) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, - "negative step for store.\n"); + "multiple types with negative step."); return false; } + if (negative) +{ + gcc_assert (!grouped_store); + alignment_support_scheme = vect_supportable_dr_alignment (dr, false); + if (alignment_support_scheme != dr_aligned + && alignment_support_scheme != dr_unaligned_supported) Lots of places where you use 8 spaces instead of tab, please fix. +offset = size_int (-TYPE_VECTOR_SUBPARTS (vectype) + 1); + if (store_lanes_p) aggr_type = build_array_type_nelts (elem_type, vec_num * nunits); else @@ -5200,7 +5246,7 @@ vectorizable_store (gimple stmt, gimple_ dataref_ptr = vect_create_data_ref_ptr (first_stmt, aggr_type, simd_lane_access_p ? loop : NULL, - NULL_TREE, &dummy, gsi, &ptr_incr, + offset, &dummy, gsi, &ptr_incr, simd_lane_access_p, &inv_p); gcc_assert (bb_vinfo || !inv_p); } @@ -5306,6 +5352,21 @@ vectorizable_store (gimple stmt, gimple_ set_ptr_info_align
Re: [PATCH] Vectorization for store with negative step
On Wed, Dec 18, 2013 at 01:31:05PM +, Bingfeng Mei wrote: Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 206016) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,9 @@ +2013-12-18 Bingfeng Mei + + PR tree-optimization/59544 +* tree-vect-stmts.c (perm_mask_for_reverse): Move before This should be a tab instead of 8 spaces. + vectorizable_store. (vectorizable_store): Handle negative step. Newline and tab after "store.", rather than space. Property changes on: gcc/testsuite/gcc.target/i386/pr59544.c ___ Added: svn:executable + * Please don't add such bogus property. Testcases aren't executable. Index: gcc/testsuite/ChangeLog === --- gcc/testsuite/ChangeLog (revision 206016) +++ gcc/testsuite/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2013-12-18 Bingfeng Mei + + PR tree-optimization/59544 + * gcc.target/i386/pr59544.c: New test Missing dot at the end of line. + 2013-12-16 Jakub Jelinek PR middle-end/58956 Index: gcc/tree-vect-stmts.c === --- gcc/tree-vect-stmts.c (revision 206016) +++ gcc/tree-vect-stmts.c (working copy) @@ -4859,6 +4859,25 @@ ensure_base_align (stmt_vec_info stmt_in } +/* Given a vector type VECTYPE returns the VECTOR_CST mask that implements + reversal of the vector elements. If that is impossible to do, + returns NULL. */ + +static tree +perm_mask_for_reverse (tree vectype) +{ + int i, nunits; + unsigned char *sel; + + nunits = TYPE_VECTOR_SUBPARTS (vectype); + sel = XALLOCAVEC (unsigned char, nunits); + + for (i = 0; i < nunits; ++i) +sel[i] = nunits - 1 - i; + + return vect_gen_perm_mask (vectype, sel); +} + /* Function vectorizable_store. Check if STMT defines a non scalar data-ref (array/pointer/structure) that @@ -4902,6 +4921,8 @@ vectorizable_store (gimple stmt, gimple_ vec oprnds = vNULL; vec result_chain = vNULL; bool inv_p; + bool negative = false; + tree offset = NULL_TREE; vec vec_oprnds = vNULL; bool slp = (slp_node != NULL); unsigned int vec_num; @@ -4976,16 +4997,38 @@ vectorizable_store (gimple stmt, gimple_ if (!STMT_VINFO_DATA_REF (stmt_info)) return false; - if (tree_int_cst_compare (loop && nested_in_vect_loop_p (loop, stmt) - ? STMT_VINFO_DR_STEP (stmt_info) : DR_STEP (dr), - size_zero_node) < 0) + negative = tree_int_cst_compare (loop && nested_in_vect_loop_p (loop, stmt) +? STMT_VINFO_DR_STEP (stmt_info) : DR_STEP (dr), +size_zero_node) < 0; The formatting looks wrong, do: negative = tree_int_cst_compare (loop && nested_in_vect_loop_p (loop, stmt) ? STMT_VINFO_DR_STEP (stmt_info) : DR_STEP (dr), size_zero_node) < 0; instead. + if (negative && ncopies > 1) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, - "negative step for store.\n"); + "multiple types with negative step."); return false; } + if (negative) +{ + gcc_assert (!grouped_store); + alignment_support_scheme = vect_supportable_dr_alignment (dr, false); + if (alignment_support_scheme != dr_aligned + && alignment_support_scheme != dr_unaligned_supported) Lots of places where you use 8 spaces instead of tab, please fix. +offset = size_int (-TYPE_VECTOR_SUBPARTS (vectype) + 1); + if (store_lanes_p) aggr_type = build_array_type_nelts (elem_type, vec_num * nunits); else @@ -5200,7 +5246,7 @@ vectorizable_store (gimple stmt, gimple_ dataref_ptr = vect_create_data_ref_ptr (first_stmt, aggr_type, simd_lane_access_p ? loop : NULL, - NULL_TREE, &dummy, gsi, &ptr_incr, + offset, &dummy, gsi, &ptr_incr, simd_lane_access_p, &inv_p); gcc_assert (bb_vinfo || !inv_p); } @@ -5306,6 +5352,21 @@ vectorizable_store (gimple stmt, gimple_ set_ptr_info_alignment (get_ptr_info (dataref_ptr), align, misalign); + if (negative) +{ + tree perm_mask = perm_mask_for_reverse (vectype); + tree perm_dest = vect_create_destination_var (gimple_assign_rhs1 (stmt), vectype); + tree new_temp = make_ssa_name (perm_dest, NULL); + + /* Generate the permute statement. */ + gimple perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR,
[PATCH] Vectorization for store with negative step
Hi, I created PR59544 and here is the patch. OK to commit? Thanks, Bingfeng 2013-12-18 Bingfeng Mei PR tree-optimization/59544 * tree-vect-stmts.c (perm_mask_for_reverse): Move before vectorizable_store. (vectorizable_store): Handle negative step. 2013-12-18 Bingfeng Mei PR tree-optimization/59544 * gcc.target/i386/pr59544.c: New test -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Richard Biener Sent: 18 December 2013 11:47 To: Bingfeng Mei Cc: gcc-patches@gcc.gnu.org Subject: Re: Vectorization for store with negative step On Wed, Dec 18, 2013 at 12:34 PM, Bingfeng Mei wrote: > Thanks, Richard. I will file a bug report and prepare a complete patch. For > perm_mask_for_reverse function, should I move it before vectorizable_store or > add a declaration. Move it. Richard. > > Bingfeng > -Original Message- > From: Richard Biener [mailto:richard.guent...@gmail.com] > Sent: 18 December 2013 11:26 > To: Bingfeng Mei > Cc: gcc-patches@gcc.gnu.org > Subject: Re: Vectorization for store with negative step > > On Mon, Dec 16, 2013 at 5:54 PM, Bingfeng Mei wrote: >> Hi, >> I was looking at some loops that can be vectorized by LLVM, but not GCC. One >> type of loop is with store of negative step. >> >> void test1(short * __restrict__ x, short * __restrict__ y, short * >> __restrict__ z) >> { >> int i; >> for (i=127; i>=0; i--) { >> x[i] = y[127-i] + z[127-i]; >> } >> } >> >> I don't know why GCC only implements negative step for load, but not store. >> I implemented a patch, very similar to code in vectorizable_load. >> >> ~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx >> >> Without patch: >> test1: >> .LFB0: >> addq$254, %rdi >> xorl%eax, %eax >> .p2align 4,,10 >> .p2align 3 >> .L2: >> movzwl (%rsi,%rax), %ecx >> subq$2, %rdi >> addw(%rdx,%rax), %cx >> addq$2, %rax >> movw%cx, 2(%rdi) >> cmpq$256, %rax >> jne .L2 >> rep; ret >> >> With patch: >> test1: >> .LFB0: >> vmovdqa .LC0(%rip), %xmm1 >> xorl%eax, %eax >> .p2align 4,,10 >> .p2align 3 >> .L2: >> vmovdqu (%rsi,%rax), %xmm0 >> movq%rax, %rcx >> negq%rcx >> vpaddw (%rdx,%rax), %xmm0, %xmm0 >> vpshufb %xmm1, %xmm0, %xmm0 >> addq$16, %rax >> cmpq$256, %rax >> vmovups %xmm0, 240(%rdi,%rcx) >> jne .L2 >> rep; ret >> >> Performance is definitely improved here. It is bootstrapped for >> x86_64-unknown-linux-gnu, and has no additional regressions on my machine. >> >> For reference, LLVM seems to use different instructions and slightly worse >> code. I am not so familiar with x86 assemble code. The patch is originally >> for our private port. >> test1: # @test1 >> .cfi_startproc >> # BB#0: # %entry >> addq$240, %rdi >> xorl%eax, %eax >> .align 16, 0x90 >> .LBB0_1:# %vector.body >> # =>This Inner Loop Header: Depth=1 >> movdqu (%rsi,%rax,2), %xmm0 >> movdqu (%rdx,%rax,2), %xmm1 >> paddw %xmm0, %xmm1 >> shufpd $1, %xmm1, %xmm1# xmm1 = xmm1[1,0] >> pshuflw $27, %xmm1, %xmm0 # xmm0 = xmm1[3,2,1,0,4,5,6,7] >> pshufhw $27, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2,3,7,6,5,4] >> movdqu %xmm0, (%rdi) >> addq$8, %rax >> addq$-16, %rdi >> cmpq$128, %rax >> jne .LBB0_1 >> # BB#2: # %for.end >> ret >> >> Any comment? > > Looks good to me. One of the various TODOs in vectorizable_store I presume. > > Needs a testcase and at this stage a bugreport that is fixed by it. > > Thanks, > Richard. > >> Bingfeng Mei >> Broadcom UK >> >> patch_vec_store Description: patch_vec_store
Re: Vectorization for store with negative step
On Wed, Dec 18, 2013 at 12:34 PM, Bingfeng Mei wrote: > Thanks, Richard. I will file a bug report and prepare a complete patch. For > perm_mask_for_reverse function, should I move it before vectorizable_store or > add a declaration. Move it. Richard. > > Bingfeng > -Original Message- > From: Richard Biener [mailto:richard.guent...@gmail.com] > Sent: 18 December 2013 11:26 > To: Bingfeng Mei > Cc: gcc-patches@gcc.gnu.org > Subject: Re: Vectorization for store with negative step > > On Mon, Dec 16, 2013 at 5:54 PM, Bingfeng Mei wrote: >> Hi, >> I was looking at some loops that can be vectorized by LLVM, but not GCC. One >> type of loop is with store of negative step. >> >> void test1(short * __restrict__ x, short * __restrict__ y, short * >> __restrict__ z) >> { >> int i; >> for (i=127; i>=0; i--) { >> x[i] = y[127-i] + z[127-i]; >> } >> } >> >> I don't know why GCC only implements negative step for load, but not store. >> I implemented a patch, very similar to code in vectorizable_load. >> >> ~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx >> >> Without patch: >> test1: >> .LFB0: >> addq$254, %rdi >> xorl%eax, %eax >> .p2align 4,,10 >> .p2align 3 >> .L2: >> movzwl (%rsi,%rax), %ecx >> subq$2, %rdi >> addw(%rdx,%rax), %cx >> addq$2, %rax >> movw%cx, 2(%rdi) >> cmpq$256, %rax >> jne .L2 >> rep; ret >> >> With patch: >> test1: >> .LFB0: >> vmovdqa .LC0(%rip), %xmm1 >> xorl%eax, %eax >> .p2align 4,,10 >> .p2align 3 >> .L2: >> vmovdqu (%rsi,%rax), %xmm0 >> movq%rax, %rcx >> negq%rcx >> vpaddw (%rdx,%rax), %xmm0, %xmm0 >> vpshufb %xmm1, %xmm0, %xmm0 >> addq$16, %rax >> cmpq$256, %rax >> vmovups %xmm0, 240(%rdi,%rcx) >> jne .L2 >> rep; ret >> >> Performance is definitely improved here. It is bootstrapped for >> x86_64-unknown-linux-gnu, and has no additional regressions on my machine. >> >> For reference, LLVM seems to use different instructions and slightly worse >> code. I am not so familiar with x86 assemble code. The patch is originally >> for our private port. >> test1: # @test1 >> .cfi_startproc >> # BB#0: # %entry >> addq$240, %rdi >> xorl%eax, %eax >> .align 16, 0x90 >> .LBB0_1:# %vector.body >> # =>This Inner Loop Header: Depth=1 >> movdqu (%rsi,%rax,2), %xmm0 >> movdqu (%rdx,%rax,2), %xmm1 >> paddw %xmm0, %xmm1 >> shufpd $1, %xmm1, %xmm1# xmm1 = xmm1[1,0] >> pshuflw $27, %xmm1, %xmm0 # xmm0 = xmm1[3,2,1,0,4,5,6,7] >> pshufhw $27, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2,3,7,6,5,4] >> movdqu %xmm0, (%rdi) >> addq$8, %rax >> addq$-16, %rdi >> cmpq$128, %rax >> jne .LBB0_1 >> # BB#2: # %for.end >> ret >> >> Any comment? > > Looks good to me. One of the various TODOs in vectorizable_store I presume. > > Needs a testcase and at this stage a bugreport that is fixed by it. > > Thanks, > Richard. > >> Bingfeng Mei >> Broadcom UK >> >>
RE: Vectorization for store with negative step
Thanks, Richard. I will file a bug report and prepare a complete patch. For perm_mask_for_reverse function, should I move it before vectorizable_store or add a declaration. Bingfeng -Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: 18 December 2013 11:26 To: Bingfeng Mei Cc: gcc-patches@gcc.gnu.org Subject: Re: Vectorization for store with negative step On Mon, Dec 16, 2013 at 5:54 PM, Bingfeng Mei wrote: > Hi, > I was looking at some loops that can be vectorized by LLVM, but not GCC. One > type of loop is with store of negative step. > > void test1(short * __restrict__ x, short * __restrict__ y, short * > __restrict__ z) > { > int i; > for (i=127; i>=0; i--) { > x[i] = y[127-i] + z[127-i]; > } > } > > I don't know why GCC only implements negative step for load, but not store. I > implemented a patch, very similar to code in vectorizable_load. > > ~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx > > Without patch: > test1: > .LFB0: > addq$254, %rdi > xorl%eax, %eax > .p2align 4,,10 > .p2align 3 > .L2: > movzwl (%rsi,%rax), %ecx > subq$2, %rdi > addw(%rdx,%rax), %cx > addq$2, %rax > movw%cx, 2(%rdi) > cmpq$256, %rax > jne .L2 > rep; ret > > With patch: > test1: > .LFB0: > vmovdqa .LC0(%rip), %xmm1 > xorl%eax, %eax > .p2align 4,,10 > .p2align 3 > .L2: > vmovdqu (%rsi,%rax), %xmm0 > movq%rax, %rcx > negq%rcx > vpaddw (%rdx,%rax), %xmm0, %xmm0 > vpshufb %xmm1, %xmm0, %xmm0 > addq$16, %rax > cmpq$256, %rax > vmovups %xmm0, 240(%rdi,%rcx) > jne .L2 > rep; ret > > Performance is definitely improved here. It is bootstrapped for > x86_64-unknown-linux-gnu, and has no additional regressions on my machine. > > For reference, LLVM seems to use different instructions and slightly worse > code. I am not so familiar with x86 assemble code. The patch is originally > for our private port. > test1: # @test1 > .cfi_startproc > # BB#0: # %entry > addq$240, %rdi > xorl%eax, %eax > .align 16, 0x90 > .LBB0_1:# %vector.body > # =>This Inner Loop Header: Depth=1 > movdqu (%rsi,%rax,2), %xmm0 > movdqu (%rdx,%rax,2), %xmm1 > paddw %xmm0, %xmm1 > shufpd $1, %xmm1, %xmm1# xmm1 = xmm1[1,0] > pshuflw $27, %xmm1, %xmm0 # xmm0 = xmm1[3,2,1,0,4,5,6,7] > pshufhw $27, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2,3,7,6,5,4] > movdqu %xmm0, (%rdi) > addq$8, %rax > addq$-16, %rdi > cmpq$128, %rax > jne .LBB0_1 > # BB#2: # %for.end > ret > > Any comment? Looks good to me. One of the various TODOs in vectorizable_store I presume. Needs a testcase and at this stage a bugreport that is fixed by it. Thanks, Richard. > Bingfeng Mei > Broadcom UK > >
Re: Vectorization for store with negative step
On Mon, Dec 16, 2013 at 5:54 PM, Bingfeng Mei wrote: > Hi, > I was looking at some loops that can be vectorized by LLVM, but not GCC. One > type of loop is with store of negative step. > > void test1(short * __restrict__ x, short * __restrict__ y, short * > __restrict__ z) > { > int i; > for (i=127; i>=0; i--) { > x[i] = y[127-i] + z[127-i]; > } > } > > I don't know why GCC only implements negative step for load, but not store. I > implemented a patch, very similar to code in vectorizable_load. > > ~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx > > Without patch: > test1: > .LFB0: > addq$254, %rdi > xorl%eax, %eax > .p2align 4,,10 > .p2align 3 > .L2: > movzwl (%rsi,%rax), %ecx > subq$2, %rdi > addw(%rdx,%rax), %cx > addq$2, %rax > movw%cx, 2(%rdi) > cmpq$256, %rax > jne .L2 > rep; ret > > With patch: > test1: > .LFB0: > vmovdqa .LC0(%rip), %xmm1 > xorl%eax, %eax > .p2align 4,,10 > .p2align 3 > .L2: > vmovdqu (%rsi,%rax), %xmm0 > movq%rax, %rcx > negq%rcx > vpaddw (%rdx,%rax), %xmm0, %xmm0 > vpshufb %xmm1, %xmm0, %xmm0 > addq$16, %rax > cmpq$256, %rax > vmovups %xmm0, 240(%rdi,%rcx) > jne .L2 > rep; ret > > Performance is definitely improved here. It is bootstrapped for > x86_64-unknown-linux-gnu, and has no additional regressions on my machine. > > For reference, LLVM seems to use different instructions and slightly worse > code. I am not so familiar with x86 assemble code. The patch is originally > for our private port. > test1: # @test1 > .cfi_startproc > # BB#0: # %entry > addq$240, %rdi > xorl%eax, %eax > .align 16, 0x90 > .LBB0_1:# %vector.body > # =>This Inner Loop Header: Depth=1 > movdqu (%rsi,%rax,2), %xmm0 > movdqu (%rdx,%rax,2), %xmm1 > paddw %xmm0, %xmm1 > shufpd $1, %xmm1, %xmm1# xmm1 = xmm1[1,0] > pshuflw $27, %xmm1, %xmm0 # xmm0 = xmm1[3,2,1,0,4,5,6,7] > pshufhw $27, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2,3,7,6,5,4] > movdqu %xmm0, (%rdi) > addq$8, %rax > addq$-16, %rdi > cmpq$128, %rax > jne .LBB0_1 > # BB#2: # %for.end > ret > > Any comment? Looks good to me. One of the various TODOs in vectorizable_store I presume. Needs a testcase and at this stage a bugreport that is fixed by it. Thanks, Richard. > Bingfeng Mei > Broadcom UK > >
Vectorization for store with negative step
Hi, I was looking at some loops that can be vectorized by LLVM, but not GCC. One type of loop is with store of negative step. void test1(short * __restrict__ x, short * __restrict__ y, short * __restrict__ z) { int i; for (i=127; i>=0; i--) { x[i] = y[127-i] + z[127-i]; } } I don't know why GCC only implements negative step for load, but not store. I implemented a patch, very similar to code in vectorizable_load. ~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx Without patch: test1: .LFB0: addq$254, %rdi xorl%eax, %eax .p2align 4,,10 .p2align 3 .L2: movzwl (%rsi,%rax), %ecx subq$2, %rdi addw(%rdx,%rax), %cx addq$2, %rax movw%cx, 2(%rdi) cmpq$256, %rax jne .L2 rep; ret With patch: test1: .LFB0: vmovdqa .LC0(%rip), %xmm1 xorl%eax, %eax .p2align 4,,10 .p2align 3 .L2: vmovdqu (%rsi,%rax), %xmm0 movq%rax, %rcx negq%rcx vpaddw (%rdx,%rax), %xmm0, %xmm0 vpshufb %xmm1, %xmm0, %xmm0 addq$16, %rax cmpq$256, %rax vmovups %xmm0, 240(%rdi,%rcx) jne .L2 rep; ret Performance is definitely improved here. It is bootstrapped for x86_64-unknown-linux-gnu, and has no additional regressions on my machine. For reference, LLVM seems to use different instructions and slightly worse code. I am not so familiar with x86 assemble code. The patch is originally for our private port. test1: # @test1 .cfi_startproc # BB#0: # %entry addq$240, %rdi xorl%eax, %eax .align 16, 0x90 .LBB0_1:# %vector.body # =>This Inner Loop Header: Depth=1 movdqu (%rsi,%rax,2), %xmm0 movdqu (%rdx,%rax,2), %xmm1 paddw %xmm0, %xmm1 shufpd $1, %xmm1, %xmm1# xmm1 = xmm1[1,0] pshuflw $27, %xmm1, %xmm0 # xmm0 = xmm1[3,2,1,0,4,5,6,7] pshufhw $27, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2,3,7,6,5,4] movdqu %xmm0, (%rdi) addq$8, %rax addq$-16, %rdi cmpq$128, %rax jne .LBB0_1 # BB#2: # %for.end ret Any comment? Bingfeng Mei Broadcom UK patch Description: patch