Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.

2014-07-09 Thread Cong Hou
Ping?

thanks,
Cong


On Tue, Jul 8, 2014 at 8:23 PM, Xinliang David Li davi...@google.com wrote:
 Cong,  can you ping this patch again? There does not seem to be
 pending comments left.

 David

 On Tue, Dec 17, 2013 at 10:05 AM, Cong Hou co...@google.com wrote:
 Ping?


 thanks,
 Cong


 On Mon, Dec 2, 2013 at 5:02 PM, Cong Hou co...@google.com wrote:
 Any comment on this patch?


 thanks,
 Cong


 On Fri, Nov 22, 2013 at 11:40 AM, Cong Hou co...@google.com wrote:
 On Fri, Nov 22, 2013 at 3:57 AM, Marc Glisse marc.gli...@inria.fr wrote:
 On Thu, 21 Nov 2013, Cong Hou wrote:

 On Thu, Nov 21, 2013 at 4:39 PM, Marc Glisse marc.gli...@inria.fr 
 wrote:

 On Thu, 21 Nov 2013, Cong Hou wrote:

 While I added the new define_insn_and_split for vec_merge, a bug is
 exposed: in config/i386/sse.md, [ define_expand xop_vmfrczmode2 ]
 only takes one input, but the corresponding builtin functions have two
 inputs, which are shown in i386.c:

  { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv4sf2,
 __builtin_ia32_vfrczss, IX86_BUILTIN_VFRCZSS, UNKNOWN,
 (int)MULTI_ARG_2_SF },
  { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv2df2,
 __builtin_ia32_vfrczsd, IX86_BUILTIN_VFRCZSD, UNKNOWN,
 (int)MULTI_ARG_2_DF },

 In consequence, the ix86_expand_multi_arg_builtin() function tries to
 check two args but based on the define_expand of xop_vmfrczmode2,
 the content of insn_data[CODE_FOR_xop_vmfrczv4sf2].operand[2] may be
 incorrect (because it only needs one input).

 The patch below fixed this issue.

 Bootstrapped and tested on ax x86-64 machine. Note that this patch
 should be applied before the one I sent earlier (sorry for sending
 them in wrong order).



 This is PR 56788. Your patch seems strange to me and I don't think it
 fixes the real issue, but I'll let more knowledgeable people answer.



 Thank you for pointing out the bug report. This patch is not intended
 to fix PR56788.


 IMHO, if PR56788 was fixed, you wouldn't have this issue, and if PR56788
 doesn't get fixed, I'll post a patch to remove _mm_frcz_sd and the
 associated builtin, which would solve your issue as well.


 I agree. Then I will wait until your patch is merged to the trunk,
 otherwise my patch could not pass the test.




 For your function:

 #include x86intrin.h
 __m128d f(__m128d x, __m128d y){
  return _mm_frcz_sd(x,y);
 }

 Note that the second parameter is ignored intentionally, but the
 prototype of this function contains two parameters. My fix is
 explicitly telling GCC that the optab xop_vmfrczv4sf3 should have
 three operands instead of two, to let it have the correct information
 in insn_data[CODE_FOR_xop_vmfrczv4sf3].operand[2] which is used to
 match the type of the second parameter in the builtin function in
 ix86_expand_multi_arg_builtin().


 I disagree that this is intentional, it is a bug. AFAIK there is no AMD
 documentation that could be used as a reference for what _mm_frcz_sd is
 supposed to do. The only existing documentations are by Microsoft (which
 does *not* ignore the second argument) and by LLVM (which has a single
 argument). Whatever we chose for _mm_frcz_sd, the builtin should take a
 single argument, and if necessary we'll use 2 builtins to implement
 _mm_frcz_sd.



 I also only found the one by Microsoft.. If the second argument is
 ignored, we could just remove it, as long as there is no standard
 that requires two arguments. Hopefully it won't break current projects
 using _mm_frcz_sd.

 Thank you for your comments!


 Cong


 --
 Marc Glisse


Re: [PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.

2014-06-25 Thread Cong Hou
On Tue, Jun 24, 2014 at 4:05 AM, Richard Biener
richard.guent...@gmail.com wrote:
 On Sat, May 3, 2014 at 2:39 AM, Cong Hou co...@google.com wrote:
 On Mon, Apr 28, 2014 at 4:04 AM, Richard Biener rguent...@suse.de wrote:
 On Thu, 24 Apr 2014, Cong Hou wrote:

 Given the following loop:

 int a[N];
 short b[N*2];

 for (int i = 0; i  N; ++i)
   a[i] = b[i*2];


 After being vectorized, the access to b[i*2] will be compiled into
 several packing statements, while the type promotion from short to int
 will be compiled into several unpacking statements. With this patch,
 each pair of pack/unpack statements will be replaced by less expensive
 statements (with shift or bit-and operations).

 On x86_64, the loop above will be compiled into the following assembly
 (with -O2 -ftree-vectorize):

 movdqu 0x10(%rcx),%xmm3
 movdqu -0x20(%rcx),%xmm0
 movdqa %xmm0,%xmm2
 punpcklwd %xmm3,%xmm0
 punpckhwd %xmm3,%xmm2
 movdqa %xmm0,%xmm3
 punpcklwd %xmm2,%xmm0
 punpckhwd %xmm2,%xmm3
 movdqa %xmm1,%xmm2
 punpcklwd %xmm3,%xmm0
 pcmpgtw %xmm0,%xmm2
 movdqa %xmm0,%xmm3
 punpckhwd %xmm2,%xmm0
 punpcklwd %xmm2,%xmm3
 movups %xmm0,-0x10(%rdx)
 movups %xmm3,-0x20(%rdx)


 With this patch, the generated assembly is shown below:

 movdqu 0x10(%rcx),%xmm0
 movdqu -0x20(%rcx),%xmm1
 pslld  $0x10,%xmm0
 psrad  $0x10,%xmm0
 pslld  $0x10,%xmm1
 movups %xmm0,-0x10(%rdx)
 psrad  $0x10,%xmm1
 movups %xmm1,-0x20(%rdx)


 Bootstrapped and tested on x86-64. OK for trunk?

 This is an odd place to implement such transform.  Also if it
 is faster or not depends on the exact ISA you target - for
 example ppc has constraints on the maximum number of shifts
 carried out in parallel and the above has 4 in very short
 succession.  Esp. for the sign-extend path.

 Thank you for the information about ppc. If this is an issue, I think
 we can do it in a target dependent way.



 So this looks more like an opportunity for a post-vectorizer
 transform on RTL or for the vectorizer special-casing
 widening loads with a vectorizer pattern.

 I am not sure if the RTL transform is more difficult to implement. I
 prefer the widening loads method, which can be detected in a pattern
 recognizer. The target related issue will be resolved by only
 expanding the widening load on those targets where this pattern is
 beneficial. But this requires new tree operations to be defined. What
 is your suggestion?

 I apologize for the delayed reply.

 Likewise ;)

 I suggest to implement this optimization in vector lowering in
 tree-vect-generic.c.  This sees for your example

   vect__5.7_32 = MEM[symbol: b, index: ivtmp.15_13, offset: 0B];
   vect__5.8_34 = MEM[symbol: b, index: ivtmp.15_13, offset: 16B];
   vect_perm_even_35 = VEC_PERM_EXPR vect__5.7_32, vect__5.8_34, { 0,
 2, 4, 6, 8, 10, 12, 14 };
   vect__6.9_37 = [vec_unpack_lo_expr] vect_perm_even_35;
   vect__6.9_38 = [vec_unpack_hi_expr] vect_perm_even_35;

 where you can apply the pattern matching and transform (after checking
 with the target, of course).

This sounds good to me! I'll try to make a patch following your suggestion.

Thank you!


Cong


 Richard.


 thanks,
 Cong


 Richard.


Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2014-06-24 Thread Cong Hou
OK. Thank you very much for your review, Richard!

thanks,
Cong


On Tue, Jun 24, 2014 at 4:19 AM, Richard Biener
richard.guent...@gmail.com wrote:
 On Tue, Dec 3, 2013 at 2:06 AM, Cong Hou co...@google.com wrote:
 Hi Richard

 Could you please take a look at this patch and see if it is ready for
 the trunk? The patch is pasted as a text file here again.

 (found it)

 The patch is ok for trunk.  (please consider re-testing before you commit)

 Thanks,
 Richard.

 Thank you very much!


 Cong


 On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou co...@google.com wrote:
 Hi James

 Sorry for the late reply.


 On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
 james.greenha...@arm.com wrote:
 On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou co...@google.com wrote:
  Thank you for your detailed explanation.
 
  Once GCC detects a reduction operation, it will automatically
  accumulate all elements in the vector after the loop. In the loop the
  reduction variable is always a vector whose elements are reductions of
  corresponding values from other vectors. Therefore in your case the
  only instruction you need to generate is:
 
  VABAL   ops[3], ops[1], ops[2]
 
  It is OK if you accumulate the elements into one in the vector inside
  of the loop (if one instruction can do this), but you have to make
  sure other elements in the vector should remain zero so that the final
  result is correct.
 
  If you are confused about the documentation, check the one for
  udot_prod (just above usad in md.texi), as it has very similar
  behavior as usad. Actually I copied the text from there and did some
  changes. As those two instruction patterns are both for vectorization,
  their behavior should not be difficult to explain.
 
  If you have more questions or think that the documentation is still
  improper please let me know.

 Hi Cong,

 Thanks for your reply.

 I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
 DOT_PROD_EXPR and I see that the same ambiguity exists for
 DOT_PROD_EXPR. Can you please add a note in your tree.def
 that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:

   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
   tmp2 = ABS_EXPR (tmp)
   arg3 = PLUS_EXPR (tmp2, arg3)

 or:

   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
   tmp2 = ABS_EXPR (tmp)
   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)

 Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
 a value of the same (widened) type as arg3.



 I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
 mentioned it in tree.def).


 Also, while looking for the history of DOT_PROD_EXPR I spotted this
 patch:

   [autovect] [patch] detect mult-hi and sad patterns
   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html

 I wonder what the reason was for that patch to be dropped?


 It has been 8 years.. I have no idea why this patch is not accepted
 finally. There is even no reply in that thread. But I believe the SAD
 pattern is very important to be recognized. ARM also provides
 instructions for it.


 Thank you for your comment again!


 thanks,
 Cong



 Thanks,
 James



Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2014-06-23 Thread Cong Hou
It has been 8 months since this patch is posted. I have addressed all
comments to this patch.

The SAD pattern is very useful for some multimedia algorithms like
ffmpeg. This patch will greatly improve the performance of such
algorithms. Could you please have a look again and check if it is OK
for the trunk? If it is necessary I can re-post this patch in a new
thread.

Thank you!


Cong


On Tue, Dec 17, 2013 at 10:04 AM, Cong Hou co...@google.com wrote:

 Ping?


 thanks,
 Cong


 On Mon, Dec 2, 2013 at 5:06 PM, Cong Hou co...@google.com wrote:
  Hi Richard
 
  Could you please take a look at this patch and see if it is ready for
  the trunk? The patch is pasted as a text file here again.
 
  Thank you very much!
 
 
  Cong
 
 
  On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou co...@google.com wrote:
  Hi James
 
  Sorry for the late reply.
 
 
  On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
  james.greenha...@arm.com wrote:
  On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou co...@google.com wrote:
   Thank you for your detailed explanation.
  
   Once GCC detects a reduction operation, it will automatically
   accumulate all elements in the vector after the loop. In the loop the
   reduction variable is always a vector whose elements are reductions of
   corresponding values from other vectors. Therefore in your case the
   only instruction you need to generate is:
  
   VABAL   ops[3], ops[1], ops[2]
  
   It is OK if you accumulate the elements into one in the vector inside
   of the loop (if one instruction can do this), but you have to make
   sure other elements in the vector should remain zero so that the final
   result is correct.
  
   If you are confused about the documentation, check the one for
   udot_prod (just above usad in md.texi), as it has very similar
   behavior as usad. Actually I copied the text from there and did some
   changes. As those two instruction patterns are both for vectorization,
   their behavior should not be difficult to explain.
  
   If you have more questions or think that the documentation is still
   improper please let me know.
 
  Hi Cong,
 
  Thanks for your reply.
 
  I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
  DOT_PROD_EXPR and I see that the same ambiguity exists for
  DOT_PROD_EXPR. Can you please add a note in your tree.def
  that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:
 
tmp = WIDEN_MINUS_EXPR (arg1, arg2)
tmp2 = ABS_EXPR (tmp)
arg3 = PLUS_EXPR (tmp2, arg3)
 
  or:
 
tmp = WIDEN_MINUS_EXPR (arg1, arg2)
tmp2 = ABS_EXPR (tmp)
arg3 = WIDEN_SUM_EXPR (tmp2, arg3)
 
  Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
  a value of the same (widened) type as arg3.
 
 
 
  I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
  mentioned it in tree.def).
 
 
  Also, while looking for the history of DOT_PROD_EXPR I spotted this
  patch:
 
[autovect] [patch] detect mult-hi and sad patterns
http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html
 
  I wonder what the reason was for that patch to be dropped?
 
 
  It has been 8 years.. I have no idea why this patch is not accepted
  finally. There is even no reply in that thread. But I believe the SAD
  pattern is very important to be recognized. ARM also provides
  instructions for it.
 
 
  Thank you for your comment again!
 
 
  thanks,
  Cong
 
 
 
  Thanks,
  James
 


Re: [PATCH] A new reload-rewrite pattern recognizer for GCC vectorizer.

2014-05-22 Thread Cong Hou
Ping?


thanks,
Cong


On Wed, Apr 30, 2014 at 1:28 PM, Cong Hou co...@google.com wrote:
 Thank you for reminding me the omp possibility. Yes, in this case my
 pattern is incorrect, because I assume all aliases will be resolved by
 alias checks, which may not be true with omp.

 LOOP_VINFO_NO_DATA_DEPENDENCIES or LOOP_REQUIRES_VERSIONING_FOR_ALIAS
 may not help here because vect_pattern_recog() is called prior to
 vect_analyze_data_ref_dependences() in vect_analyze_loop_2().

 So can we detect if omp is used in the pattern recognizer? Like
 checking loop-force_vectorize? Is there any other case in which my
 assumption does not hold?


 thanks,
 Cong


 On Sat, Apr 26, 2014 at 12:54 AM, Jakub Jelinek ja...@redhat.com wrote:
 On Thu, Apr 24, 2014 at 05:32:54PM -0700, Cong Hou wrote:
 In this patch a new reload-rewrite pattern detector is composed to
 handle the following pattern in the loop being vectorized:

x = *p;
...
y = *p;

or

*p = x;
...
y = *p;

 In both cases, *p is reloaded because there may exist other defs to
 another memref that may alias with p.  However, aliasing is eliminated
 with alias checks.  Then we can safely replace the last statement in
 above cases by y = x.

 Not safely, at least not for #pragma omp simd/#pragma simd/#pragma ivdep
 loops if alias analysis hasn't proven there is no aliasing.

 So, IMNSHO you need to guard this with LOOP_VINFO_NO_DATA_DEPENDENCIES,
 assuming it has been computed at that point already (otherwise you need to
 do it elsewhere).

 Consider:

 void
 foo (int *p, int *q)
 {
   int i;
   #pragma omp simd safelen(16)
   for (i = 0; i  128; i++)
 {
   int x = *p;
   *q += 8;
   *p = *p + x;
   p++;
   q++;
 }
 }

 It is valid to call the above with completely unrelated p and q, but
 also e.g. p == q, or q == p + 16 or p == q + 16.
 Your patch would certainly break it e.g. for p == q.

 Jakub


Re: [PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.

2014-05-02 Thread Cong Hou
On Mon, Apr 28, 2014 at 4:04 AM, Richard Biener rguent...@suse.de wrote:
 On Thu, 24 Apr 2014, Cong Hou wrote:

 Given the following loop:

 int a[N];
 short b[N*2];

 for (int i = 0; i  N; ++i)
   a[i] = b[i*2];


 After being vectorized, the access to b[i*2] will be compiled into
 several packing statements, while the type promotion from short to int
 will be compiled into several unpacking statements. With this patch,
 each pair of pack/unpack statements will be replaced by less expensive
 statements (with shift or bit-and operations).

 On x86_64, the loop above will be compiled into the following assembly
 (with -O2 -ftree-vectorize):

 movdqu 0x10(%rcx),%xmm3
 movdqu -0x20(%rcx),%xmm0
 movdqa %xmm0,%xmm2
 punpcklwd %xmm3,%xmm0
 punpckhwd %xmm3,%xmm2
 movdqa %xmm0,%xmm3
 punpcklwd %xmm2,%xmm0
 punpckhwd %xmm2,%xmm3
 movdqa %xmm1,%xmm2
 punpcklwd %xmm3,%xmm0
 pcmpgtw %xmm0,%xmm2
 movdqa %xmm0,%xmm3
 punpckhwd %xmm2,%xmm0
 punpcklwd %xmm2,%xmm3
 movups %xmm0,-0x10(%rdx)
 movups %xmm3,-0x20(%rdx)


 With this patch, the generated assembly is shown below:

 movdqu 0x10(%rcx),%xmm0
 movdqu -0x20(%rcx),%xmm1
 pslld  $0x10,%xmm0
 psrad  $0x10,%xmm0
 pslld  $0x10,%xmm1
 movups %xmm0,-0x10(%rdx)
 psrad  $0x10,%xmm1
 movups %xmm1,-0x20(%rdx)


 Bootstrapped and tested on x86-64. OK for trunk?

 This is an odd place to implement such transform.  Also if it
 is faster or not depends on the exact ISA you target - for
 example ppc has constraints on the maximum number of shifts
 carried out in parallel and the above has 4 in very short
 succession.  Esp. for the sign-extend path.

Thank you for the information about ppc. If this is an issue, I think
we can do it in a target dependent way.



 So this looks more like an opportunity for a post-vectorizer
 transform on RTL or for the vectorizer special-casing
 widening loads with a vectorizer pattern.

I am not sure if the RTL transform is more difficult to implement. I
prefer the widening loads method, which can be detected in a pattern
recognizer. The target related issue will be resolved by only
expanding the widening load on those targets where this pattern is
beneficial. But this requires new tree operations to be defined. What
is your suggestion?

I apologize for the delayed reply.


thanks,
Cong


 Richard.


Re: [PATCH] A new reload-rewrite pattern recognizer for GCC vectorizer.

2014-04-30 Thread Cong Hou
Thank you for reminding me the omp possibility. Yes, in this case my
pattern is incorrect, because I assume all aliases will be resolved by
alias checks, which may not be true with omp.

LOOP_VINFO_NO_DATA_DEPENDENCIES or LOOP_REQUIRES_VERSIONING_FOR_ALIAS
may not help here because vect_pattern_recog() is called prior to
vect_analyze_data_ref_dependences() in vect_analyze_loop_2().

So can we detect if omp is used in the pattern recognizer? Like
checking loop-force_vectorize? Is there any other case in which my
assumption does not hold?


thanks,
Cong


On Sat, Apr 26, 2014 at 12:54 AM, Jakub Jelinek ja...@redhat.com wrote:
 On Thu, Apr 24, 2014 at 05:32:54PM -0700, Cong Hou wrote:
 In this patch a new reload-rewrite pattern detector is composed to
 handle the following pattern in the loop being vectorized:

x = *p;
...
y = *p;

or

*p = x;
...
y = *p;

 In both cases, *p is reloaded because there may exist other defs to
 another memref that may alias with p.  However, aliasing is eliminated
 with alias checks.  Then we can safely replace the last statement in
 above cases by y = x.

 Not safely, at least not for #pragma omp simd/#pragma simd/#pragma ivdep
 loops if alias analysis hasn't proven there is no aliasing.

 So, IMNSHO you need to guard this with LOOP_VINFO_NO_DATA_DEPENDENCIES,
 assuming it has been computed at that point already (otherwise you need to
 do it elsewhere).

 Consider:

 void
 foo (int *p, int *q)
 {
   int i;
   #pragma omp simd safelen(16)
   for (i = 0; i  128; i++)
 {
   int x = *p;
   *q += 8;
   *p = *p + x;
   p++;
   q++;
 }
 }

 It is valid to call the above with completely unrelated p and q, but
 also e.g. p == q, or q == p + 16 or p == q + 16.
 Your patch would certainly break it e.g. for p == q.

 Jakub


[PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.

2014-04-24 Thread Cong Hou
Given the following loop:

int a[N];
short b[N*2];

for (int i = 0; i  N; ++i)
  a[i] = b[i*2];


After being vectorized, the access to b[i*2] will be compiled into
several packing statements, while the type promotion from short to int
will be compiled into several unpacking statements. With this patch,
each pair of pack/unpack statements will be replaced by less expensive
statements (with shift or bit-and operations).

On x86_64, the loop above will be compiled into the following assembly
(with -O2 -ftree-vectorize):

movdqu 0x10(%rcx),%xmm3
movdqu -0x20(%rcx),%xmm0
movdqa %xmm0,%xmm2
punpcklwd %xmm3,%xmm0
punpckhwd %xmm3,%xmm2
movdqa %xmm0,%xmm3
punpcklwd %xmm2,%xmm0
punpckhwd %xmm2,%xmm3
movdqa %xmm1,%xmm2
punpcklwd %xmm3,%xmm0
pcmpgtw %xmm0,%xmm2
movdqa %xmm0,%xmm3
punpckhwd %xmm2,%xmm0
punpcklwd %xmm2,%xmm3
movups %xmm0,-0x10(%rdx)
movups %xmm3,-0x20(%rdx)


With this patch, the generated assembly is shown below:

movdqu 0x10(%rcx),%xmm0
movdqu -0x20(%rcx),%xmm1
pslld  $0x10,%xmm0
psrad  $0x10,%xmm0
pslld  $0x10,%xmm1
movups %xmm0,-0x10(%rdx)
psrad  $0x10,%xmm1
movups %xmm1,-0x20(%rdx)


Bootstrapped and tested on x86-64. OK for trunk?


thanks,
Cong
diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 117cdd0..e7143f1 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,8 @@
+2014-04-23  Cong Hou  co...@google.com
+
+   * tree-vect-stmts.c (detect_pack_unpack_pattern): New function.
+   (vect_gen_widened_results_half): Call detect_pack_unpack_pattern.
+
 2014-04-23  David Malcolm  dmalc...@redhat.com
 
* is-a.h: Update comments to reflect the following changes to the
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 62b07f4..a8755b3 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,7 @@
+2014-04-23  Cong Hou  co...@google.com
+
+   * gcc.dg/vect/vect-125.c: New test.
+
 2014-04-23  Jeff Law  l...@redhat.com
 
PR tree-optimization/60902
diff --git a/gcc/testsuite/gcc.dg/vect/vect-125.c 
b/gcc/testsuite/gcc.dg/vect/vect-125.c
new file mode 100644
index 000..988dea6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-125.c
@@ -0,0 +1,122 @@
+/* { dg-require-effective-target vect_int } */
+
+#include limits.h
+#include tree-vect.h
+
+#define N 64
+
+char b[N];
+unsigned char c[N];
+short d[N];
+unsigned short e[N];
+
+__attribute__((noinline)) void
+test1 ()
+{
+  int a[N];
+  int i;
+  for (i = 0; i  N/2; i++)
+{
+  a[i] = b[i*2];
+  a[i+N/2] = b[i*2+1];
+}
+  for (i = 0; i  N/2; i++)
+if (a[i] != b[i*2] || a[i+N/2] != b[i*2+1])
+  abort ();
+
+  for (i = 0; i  N/2; i++)
+{
+  a[i] = c[i*2];
+  a[i+N/2] = c[i*2+1];
+}
+  for (i = 0; i  N/2; i++)
+if (a[i] != c[i*2] || a[i+N/2] != c[i*2+1])
+  abort ();
+
+  for (i = 0; i  N/2; i++)
+{
+  a[i] = d[i*2];
+  a[i+N/2] = d[i*2+1];
+}
+  for (i = 0; i  N/2; i++)
+if (a[i] != d[i*2] || a[i+N/2] != d[i*2+1])
+  abort ();
+
+  for (i = 0; i  N/2; i++)
+{
+  a[i] = e[i*2];
+  a[i+N/2] = e[i*2+1];
+}
+  for (i = 0; i  N/2; i++)
+if (a[i] != e[i*2] || a[i+N/2] != e[i*2+1])
+  abort ();
+}
+
+__attribute__((noinline)) void
+test2 ()
+{
+  unsigned int a[N];
+  int i;
+  for (i = 0; i  N/2; i++)
+{
+  a[i] = b[i*2];
+  a[i+N/2] = b[i*2+1];
+}
+  for (i = 0; i  N/2; i++)
+if (a[i] != b[i*2] || a[i+N/2] != b[i*2+1])
+  abort ();
+
+  for (i = 0; i  N/2; i++)
+{
+  a[i] = c[i*2];
+  a[i+N/2] = c[i*2+1];
+}
+  for (i = 0; i  N/2; i++)
+if (a[i] != c[i*2] || a[i+N/2] != c[i*2+1])
+  abort ();
+
+  for (i = 0; i  N/2; i++)
+{
+  a[i] = d[i*2];
+  a[i+N/2] = d[i*2+1];
+}
+  for (i = 0; i  N/2; i++)
+if (a[i] != d[i*2] || a[i+N/2] != d[i*2+1])
+  abort ();
+
+  for (i = 0; i  N/2; i++)
+{
+  a[i] = e[i*2];
+  a[i+N/2] = e[i*2+1];
+}
+  for (i = 0; i  N/2; i++)
+if (a[i] != e[i*2] || a[i+N/2] != e[i*2+1])
+  abort ();
+}
+
+int
+main ()
+{
+  b[0] = CHAR_MIN;
+  c[0] = UCHAR_MAX;
+  d[0] = SHRT_MIN;
+  e[0] = USHRT_MAX;
+
+  int i;
+  for (i = 1; i  N; i++)
+{
+  b[i] = b[i-1] + 1;
+  c[i] = c[i-1] - 1;
+  d[i] = d[i-1] + 1;
+  e[i] = e[i-1] - 1;
+}
+
+  test1 ();
+  test2 ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times vectorized 4 loops 2 vect } } */
+/* { dg-final { scan-tree-dump-times A pack-unpack pattern is recognized 32 
vect } } */
+/* { dg-final { cleanup-tree-dump vect } } */
+
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 1a51d6d..d0cf1f4 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -3191,6 +3191,174 @@ vectorizable_simd_clone_call (gimple stmt, 
gimple_stmt_iterator *gsi,
 }
 
 
+/* Function detect_pack_unpack_pattern
+
+   Detect the following pattern:
+
+   S1  vect3 = VEC_PERM_EXPR vect1, vect2, { 0, 2, 4, ... };
+   or
+   S1  vect3 = VEC_PERM_EXPR vect1, vect2, { 1, 3, 5, ... };
+
+   S2  vect4 = [vec_unpack_lo_expr

[PATCH] A new reload-rewrite pattern recognizer for GCC vectorizer.

2014-04-24 Thread Cong Hou
In this patch a new reload-rewrite pattern detector is composed to
handle the following pattern in the loop being vectorized:

   x = *p;
   ...
   y = *p;

   or

   *p = x;
   ...
   y = *p;

In both cases, *p is reloaded because there may exist other defs to
another memref that may alias with p.  However, aliasing is eliminated
with alias checks.  Then we can safely replace the last statement in
above cases by y = x.

The following rewrite pattern is also detected:

   *p = x;
   ...
   *p = y;

The first write is redundant due to the fact that there is no aliasing
between p and other pointers.  In this case we don't need to vectorize
this write.  Here we replace it with a dummy statement z = x.

Bootstrapped and tested on x86-64.

OK for trunk?


thanks,
Cong
diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 117cdd0..59a4388 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,10 @@
+2014-04-23  Cong Hou  co...@google.com
+
+   * tree-vect-patterns.c (vect_recog_reload_rewrite_pattern):
+   New function.
+   (vect_vect_recog_func_ptrs): Add new pattern.
+   * tree-vectorizer.h (NUM_PATTERNS):  Update the pattern count.
+
 2014-04-23  David Malcolm  dmalc...@redhat.com
 
* is-a.h: Update comments to reflect the following changes to the
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 62b07f4..2116cd3 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,7 @@
+2014-04-23  Cong Hou  co...@google.com
+
+   * gcc.dg/vect/vect-reload-rewrite-pattern.c: New test.
+
 2014-04-23  Jeff Law  l...@redhat.com
 
PR tree-optimization/60902
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reload-rewrite-pattern.c 
b/gcc/testsuite/gcc.dg/vect/vect-reload-rewrite-pattern.c
new file mode 100644
index 000..e75f969
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reload-rewrite-pattern.c
@@ -0,0 +1,61 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+
+#define N 1000
+int a[N];
+
+void test1 (int *b, int *c)
+{
+  int i;
+  for (i = 0; i  N; ++i)
+{
+  a[i] = c[i];
+  /* Reload of c[i].  */
+  b[i] = c[i];
+}
+}
+
+void test2 (int *b, int *c)
+{
+  int i;
+  for (i = 0; i  N; ++i)
+{
+  c[i] = a[i] + 10;
+  /* Reload of a[i].  */
+  a[i]++;
+  /* Reload of c[i].  */
+  b[i] = c[i];
+}
+}
+
+void test3 (int *b, int *c)
+{
+  int i;
+  for (i = 0; i  N; ++i)
+{
+  c[i] = a[i]  63;
+  /* Reload of a[i].  */
+  a[i]++;
+  /* Reload of c[i].  */
+  /* Rewrite to c[i].  */
+  c[i]--;
+}
+}
+
+void test4 (_Complex int *b, _Complex int *c, _Complex int *d)
+{
+  int i;
+  for (i = 0; i  N; ++i)
+{
+  b[i] = c[i] + d[i];
+  /* Reload of REALPART_EXPR (c[i]).  */
+  /* Reload of IMAGPART_EXPR (c[i]).  */
+  /* Reload of REALPART_EXPR (d[i]).  */
+  /* Reload of IMAGPART_EXPR (d[i]).  */
+  c[i] = c[i] - d[i];
+}
+}
+
+/* { dg-final { scan-tree-dump-times vectorized 1 loops 4 vect} } */
+/* { dg-final { scan-tree-dump-times vect_recog_reload_rewrite_pattern: 
detected 10 vect } } */
+/* { dg-final { cleanup-tree-dump vect } } */
diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index 5daaf24..38a0fec 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -40,6 +40,7 @@ along with GCC; see the file COPYING3.  If not see
 #include ssa-iterators.h
 #include stringpool.h
 #include tree-ssanames.h
+#include tree-ssa-sccvn.h
 #include cfgloop.h
 #include expr.h
 #include optabs.h
@@ -70,6 +71,7 @@ static gimple vect_recog_divmod_pattern (vecgimple *,
 static gimple vect_recog_mixed_size_cond_pattern (vecgimple *,
  tree *, tree *);
 static gimple vect_recog_bool_pattern (vecgimple *, tree *, tree *);
+static gimple vect_recog_reload_rewrite_pattern (vecgimple *, tree *, tree 
*);
 static vect_recog_func_ptr vect_vect_recog_func_ptrs[NUM_PATTERNS] = {
vect_recog_widen_mult_pattern,
vect_recog_widen_sum_pattern,
@@ -81,6 +83,7 @@ static vect_recog_func_ptr 
vect_vect_recog_func_ptrs[NUM_PATTERNS] = {
vect_recog_vector_vector_shift_pattern,
vect_recog_divmod_pattern,
vect_recog_mixed_size_cond_pattern,
+   vect_recog_reload_rewrite_pattern,
vect_recog_bool_pattern};
 
 static inline void
@@ -3019,6 +3022,160 @@ vect_recog_bool_pattern (vecgimple *stmts, tree 
*type_in,
 return NULL;
 }
 
+/* Function vect_recog_reload_rewrite_pattern
+
+   Try to find the following reload pattern:
+
+   x = *p;
+   ...
+   y = *p;
+
+   or
+
+   *p = x;
+   ...
+   y = *p;
+
+   In both cases, *p is reloaded because there may exist other defs to another
+   memref that may alias with p.  However, aliasing is eliminated with alias
+   checks.  Then we can safely replace the last statement in above cases by
+   y = x.
+
+   Also try to detect rewrite pattern:
+
+   *p = x;
+   ...
+   *p = y

[PATCH] Fix PR60896

2014-04-23 Thread Cong Hou
See http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60896 for bug report.

The cause of PR60896 is that those statements in PATTERN_DEF_SEQ in
pre-recognized widen-mult pattern are not forwarded to later
recognized dot-product pattern.

Another issue is that the def types of statements in PATTERN_DEF_SEQ
are assigned with the def type of the pattern statement. This is
incorrect for reduction pattern statement, in which case all
statements in PATTERN_DEF_SEQ will all be vect_reduction_def, and none
of them will be vectorized later. The def type of statement in
PATTERN_DEF_SEQ should always be vect_internal_def.

The patch is attached. Bootstrapped and tested on a x86_64 machine.

OK for trunk?


thanks,
Cong
diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 117cdd0..0af5e16 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,11 @@
+2014-04-23  Cong Hou  co...@google.com
+
+   PR tree-optimization/60896
+   * tree-vect-patterns.c (vect_recog_dot_prod_pattern): Pick up
+   all statements in PATTERN_DEF_SEQ in recognized widen-mult pattern.
+   (vect_mark_pattern_stmts): Set the def type of all statements in
+   PATTERN_DEF_SEQ as vect_internal_def.
+
 2014-04-23  David Malcolm  dmalc...@redhat.com
 
* is-a.h: Update comments to reflect the following changes to the
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 62b07f4..55bc842 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2014-04-23  Cong Hou  co...@google.com
+
+   PR tree-optimization/60896
+   * g++.dg/vect/pr60896.cc: New test.
+
 2014-04-23  Jeff Law  l...@redhat.com
 
PR tree-optimization/60902
diff --git a/gcc/testsuite/g++.dg/vect/pr60896.cc 
b/gcc/testsuite/g++.dg/vect/pr60896.cc
new file mode 100644
index 000..c6ce68b
--- /dev/null
+++ b/gcc/testsuite/g++.dg/vect/pr60896.cc
@@ -0,0 +1,44 @@
+/* { dg-do compile } */
+/* { dg-options -O3 } */
+
+struct A
+{
+  int m_fn1 ();
+  short *m_fn2 ();
+};
+
+struct B
+{
+  void *fC;
+};
+
+int a, b;
+unsigned char i;
+void fn1 (unsigned char *p1, A p2)
+{
+  int c = p2.m_fn1 ();
+  for (int d = 0; c; d++)
+{
+  short *e = p2.m_fn2 ();
+  unsigned char *f = p1[0];
+  for (int g = 0; g  a; g++)
+   {
+ int h = e[0];
+ b += h * f[g];
+   }
+}
+}
+
+void fn2 (A p1, A p2, B p3)
+{
+  int j = p2.m_fn1 ();
+  for (int k = 0; j; k++)
+if (0)
+  ;
+else
+  fn1 (i, p1);
+  if (p3.fC)
+;
+  else
+;
+}
diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index 5daaf24..365cf01 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -392,6 +392,8 @@ vect_recog_dot_prod_pattern (vecgimple *stmts, tree 
*type_in,
   gcc_assert (STMT_VINFO_DEF_TYPE (stmt_vinfo) == vect_internal_def);
   oprnd00 = gimple_assign_rhs1 (stmt);
   oprnd01 = gimple_assign_rhs2 (stmt);
+  STMT_VINFO_PATTERN_DEF_SEQ (vinfo_for_stmt (last_stmt))
+ = STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo);
 }
   else
 {
@@ -3065,8 +3067,7 @@ vect_mark_pattern_stmts (gimple orig_stmt, gimple 
pattern_stmt,
}
  gimple_set_bb (def_stmt, gimple_bb (orig_stmt));
  STMT_VINFO_RELATED_STMT (def_stmt_info) = orig_stmt;
- STMT_VINFO_DEF_TYPE (def_stmt_info)
-   = STMT_VINFO_DEF_TYPE (orig_stmt_info);
+ STMT_VINFO_DEF_TYPE (def_stmt_info) = vect_internal_def;
  if (STMT_VINFO_VECTYPE (def_stmt_info) == NULL_TREE)
STMT_VINFO_VECTYPE (def_stmt_info) = pattern_vectype;
}


Re: Fixing PR60773

2014-04-08 Thread Cong Hou
Thanks for the comments, and the attached file is the updated patch.


thanks,
Cong


On Tue, Apr 8, 2014 at 12:28 AM, Rainer Orth
r...@cebitec.uni-bielefeld.de wrote:
 Cong Hou co...@google.com writes:

 In the patch of
 PR60656(http://gcc.gnu.org/ml/gcc-patches/2014-03/msg01668.html), the
 test case requires GCC to vectorize the widen-mult pattern from si to
 di types. This may result in test failures on some platforms that
 don't support this pattern. This patch adds a new target
 vect_widen_mult_si_to_di_pattern to fix this issue.

 Please document the new keyword in gcc/doc/sourcebuild.texi.

 diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
 index 414a745..ea860e7 100644
 --- a/gcc/testsuite/ChangeLog
 +++ b/gcc/testsuite/ChangeLog
 @@ -1,3 +1,11 @@
 +2014-04-07  Cong Hou  co...@google.com
 +
 + PR testsuite/60773
 + * testsuite/lib/target-supports.exp:
 + Add check_effective_target_vect_widen_mult_si_to_di_pattern.
 + * gcc.dg/vect/pr60656.c: Update the test by checking if the targets
 + vect_widen_mult_si_to_di_pattern and vect_long are supported.
 +

 Your mailer is broken: it swallows tabs and breaks long lines.  If you
 can't fix it, please attach patches instead of sending them inline.

 Thanks.

 Rainer

 --
 -
 Rainer Orth, Center for Biotechnology, Bielefeld University
diff --git a/gcc/doc/sourcebuild.texi b/gcc/doc/sourcebuild.texi
index 85ef819..9148608 100644
--- a/gcc/doc/sourcebuild.texi
+++ b/gcc/doc/sourcebuild.texi
@@ -1428,6 +1428,10 @@ Target supports a vector widening multiplication of 
@code{short} operands
 into @code{int} results, or can promote (unpack) from @code{short} to
 @code{int} and perform non-widening multiplication of @code{int}.
 
+@item vect_widen_mult_si_to_di_pattern
+Target supports a vector widening multiplication of @code{int} operands
+into @code{long} results.
+
 @item vect_sdot_qi
 Target supports a vector dot-product of @code{signed char}.
 
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 414a745..d426e29 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,11 @@
+2014-04-07  Cong Hou  co...@google.com
+
+   PR testsuite/60773
+   * lib/target-supports.exp:
+   Add check_effective_target_vect_widen_mult_si_to_di_pattern.
+   * gcc.dg/vect/pr60656.c: Update the test by checking if the targets
+   vect_widen_mult_si_to_di_pattern and vect_long are supported.
+
 2014-03-28  Cong Hou  co...@google.com
 
PR tree-optimization/60656
diff --git a/gcc/testsuite/gcc.dg/vect/pr60656.c 
b/gcc/testsuite/gcc.dg/vect/pr60656.c
index ebaab62..4950275 100644
--- a/gcc/testsuite/gcc.dg/vect/pr60656.c
+++ b/gcc/testsuite/gcc.dg/vect/pr60656.c
@@ -1,4 +1,5 @@
 /* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target vect_long } */
 
 #include tree-vect.h
 
@@ -12,7 +13,7 @@ foo ()
   for(i = 0; i  4; ++i)
 {
   long P = v[i];
-  s += P*P*P;
+  s += P * P * P;
 }
   return s;
 }
@@ -27,7 +28,7 @@ bar ()
   for(i = 0; i  4; ++i)
 {
   long P = v[i];
-  s += P*P*P;
+  s += P * P * P;
   __asm__ volatile ();
 }
   return s;
@@ -35,11 +36,12 @@ bar ()
 
 int main()
 {
+  check_vect ();
+
   if (foo () != bar ())
 abort ();
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect } } */
+/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect { target 
vect_widen_mult_si_to_di_pattern } } } */
 /* { dg-final { cleanup-tree-dump vect } } */
-
diff --git a/gcc/testsuite/lib/target-supports.exp 
b/gcc/testsuite/lib/target-supports.exp
index bee8471..6d9d689 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -3732,6 +3732,27 @@ proc 
check_effective_target_vect_widen_mult_hi_to_si_pattern { } {
 }
 
 # Return 1 if the target plus current options supports a vector
+# widening multiplication of *int* args into *long* result, 0 otherwise.
+#
+# This won't change for different subtargets so cache the result.
+
+proc check_effective_target_vect_widen_mult_si_to_di_pattern { } {
+global et_vect_widen_mult_si_to_di_pattern
+
+if [info exists et_vect_widen_mult_si_to_di_pattern_saved] {
+verbose check_effective_target_vect_widen_mult_si_to_di_pattern: 
using cached result 2
+} else {
+if {[istarget ia64-*-*]
+  || [istarget i?86-*-*]
+ || [istarget x86_64-*-*] } {
+set et_vect_widen_mult_si_to_di_pattern_saved 1
+}
+}
+verbose check_effective_target_vect_widen_mult_si_to_di_pattern: 
returning $et_vect_widen_mult_si_to_di_pattern_saved 2
+return $et_vect_widen_mult_si_to_di_pattern_saved
+}
+
+# Return 1 if the target plus current options supports a vector
 # widening shift, 0 otherwise.
 #
 # This won't change for different subtargets so cache the result.



Re: Fixing PR60773

2014-04-08 Thread Cong Hou
On Tue, Apr 8, 2014 at 12:07 AM, Jakub Jelinek ja...@redhat.com wrote:
 On Mon, Apr 07, 2014 at 12:16:12PM -0700, Cong Hou wrote:
 --- a/gcc/testsuite/ChangeLog
 +++ b/gcc/testsuite/ChangeLog
 @@ -1,3 +1,11 @@
 +2014-04-07  Cong Hou  co...@google.com
 +
 + PR testsuite/60773
 + * testsuite/lib/target-supports.exp:
 + Add check_effective_target_vect_widen_mult_si_to_di_pattern.

 No testsuite/ prefix here.  Please write it as:
 * lib/target-supports.exp
 (check_effective_target_vect_widen_si_to_di_pattern): New.

Thank you for pointing it out. Corrected.



 --- a/gcc/testsuite/gcc.dg/vect/pr60656.c
 +++ b/gcc/testsuite/gcc.dg/vect/pr60656.c
 @@ -1,5 +1,7 @@
  /* { dg-require-effective-target vect_int } */
 +/* { dg-require-effective-target vect_long } */

 +#include stdarg.h

 I fail to see why you need this include, neither your test nor tree-vect.h
 uses va_*.

I have removed this include.


thanks,
Cong



 Otherwise looks good to me.

 Jakub


Fixing PR60773

2014-04-07 Thread Cong Hou
In the patch of
PR60656(http://gcc.gnu.org/ml/gcc-patches/2014-03/msg01668.html), the
test case requires GCC to vectorize the widen-mult pattern from si to
di types. This may result in test failures on some platforms that
don't support this pattern. This patch adds a new target
vect_widen_mult_si_to_di_pattern to fix this issue.

Bootstrapped and tested on x86_64.

OK for trunk?


thanks,
Cong




diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 414a745..ea860e7 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,11 @@
+2014-04-07  Cong Hou  co...@google.com
+
+ PR testsuite/60773
+ * testsuite/lib/target-supports.exp:
+ Add check_effective_target_vect_widen_mult_si_to_di_pattern.
+ * gcc.dg/vect/pr60656.c: Update the test by checking if the targets
+ vect_widen_mult_si_to_di_pattern and vect_long are supported.
+
 2014-03-28  Cong Hou  co...@google.com

  PR tree-optimization/60656
diff --git a/gcc/testsuite/gcc.dg/vect/pr60656.c
b/gcc/testsuite/gcc.dg/vect/pr60656.c
index ebaab62..b80e008 100644
--- a/gcc/testsuite/gcc.dg/vect/pr60656.c
+++ b/gcc/testsuite/gcc.dg/vect/pr60656.c
@@ -1,5 +1,7 @@
 /* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target vect_long } */

+#include stdarg.h
 #include tree-vect.h

 __attribute__ ((noinline)) long
@@ -12,7 +14,7 @@ foo ()
   for(i = 0; i  4; ++i)
 {
   long P = v[i];
-  s += P*P*P;
+  s += P * P * P;
 }
   return s;
 }
@@ -27,7 +29,7 @@ bar ()
   for(i = 0; i  4; ++i)
 {
   long P = v[i];
-  s += P*P*P;
+  s += P * P * P;
   __asm__ volatile ();
 }
   return s;
@@ -35,11 +37,12 @@ bar ()

 int main()
 {
+  check_vect ();
+
   if (foo () != bar ())
 abort ();
   return 0;
 }

-/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect } } */
+/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect {
target vect_widen_mult_si_to_di_pattern } } } */
 /* { dg-final { cleanup-tree-dump vect } } */
-
diff --git a/gcc/testsuite/lib/target-supports.exp
b/gcc/testsuite/lib/target-supports.exp
index bee8471..6d9d689 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -3732,6 +3732,27 @@ proc
check_effective_target_vect_widen_mult_hi_to_si_pattern { } {
 }

 # Return 1 if the target plus current options supports a vector
+# widening multiplication of *int* args into *long* result, 0 otherwise.
+#
+# This won't change for different subtargets so cache the result.
+
+proc check_effective_target_vect_widen_mult_si_to_di_pattern { } {
+global et_vect_widen_mult_si_to_di_pattern
+
+if [info exists et_vect_widen_mult_si_to_di_pattern_saved] {
+verbose
check_effective_target_vect_widen_mult_si_to_di_pattern: using cached
result 2
+} else {
+if {[istarget ia64-*-*]
+  || [istarget i?86-*-*]
+  || [istarget x86_64-*-*] } {
+set et_vect_widen_mult_si_to_di_pattern_saved 1
+}
+}
+verbose check_effective_target_vect_widen_mult_si_to_di_pattern:
returning $et_vect_widen_mult_si_to_di_pattern_saved 2
+return $et_vect_widen_mult_si_to_di_pattern_saved
+}
+
+# Return 1 if the target plus current options supports a vector
 # widening shift, 0 otherwise.
 #
 # This won't change for different subtargets so cache the result.


[PATCH] Fixing PR60656

2014-03-28 Thread Cong Hou
This patch is fixing PR60656. Elements in a vector with
vect_used_by_reduction property cannot be reordered if the use chain
with this property does not have the same operation.

Bootstrapped and tested on a x86-64 machine.

OK for trunk?


thanks,
Cong


diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index e1d8666..d7d5b82 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,11 @@
+2014-03-28  Cong Hou  co...@google.com
+
+ PR tree-optimization/60656
+ * tree-vect-stmts.c (supportable_widening_operation):
+ Fix a bug that elements in a vector with vect_used_by_reduction
+ property are incorrectly reordered when the operation on it is not
+ consistant with the one in reduction operation.
+
 2014-03-10  Jakub Jelinek  ja...@redhat.com

  PR ipa/60457
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 41b6875..414a745 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2014-03-28  Cong Hou  co...@google.com
+
+ PR tree-optimization/60656
+ * gcc.dg/vect/pr60656.c: New test.
+
 2014-03-10  Jakub Jelinek  ja...@redhat.com

  PR ipa/60457
diff --git a/gcc/testsuite/gcc.dg/vect/pr60656.c
b/gcc/testsuite/gcc.dg/vect/pr60656.c
new file mode 100644
index 000..ebaab62
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr60656.c
@@ -0,0 +1,45 @@
+/* { dg-require-effective-target vect_int } */
+
+#include tree-vect.h
+
+__attribute__ ((noinline)) long
+foo ()
+{
+  int v[] = {5000, 5001, 5002, 5003};
+  long s = 0;
+  int i;
+
+  for(i = 0; i  4; ++i)
+{
+  long P = v[i];
+  s += P*P*P;
+}
+  return s;
+}
+
+long
+bar ()
+{
+  int v[] = {5000, 5001, 5002, 5003};
+  long s = 0;
+  int i;
+
+  for(i = 0; i  4; ++i)
+{
+  long P = v[i];
+  s += P*P*P;
+  __asm__ volatile ();
+}
+  return s;
+}
+
+int main()
+{
+  if (foo () != bar ())
+abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect } } */
+/* { dg-final { cleanup-tree-dump vect } } */
+
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 70fb411..7442d0c 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -7827,7 +7827,16 @@ supportable_widening_operation (enum tree_code
code, gimple stmt,
  stmt, vectype_out, vectype_in,
  code1, code2, multi_step_cvt,
  interm_types))
- return true;
+{
+  tree lhs = gimple_assign_lhs (stmt);
+  use_operand_p dummy;
+  gimple use_stmt;
+  stmt_vec_info use_stmt_info = NULL;
+  if (single_imm_use (lhs, dummy, use_stmt)
+   (use_stmt_info = vinfo_for_stmt (use_stmt))
+   STMT_VINFO_DEF_TYPE (use_stmt_info) == vect_reduction_def)
+return true;
+}
   c1 = VEC_WIDEN_MULT_LO_EXPR;
   c2 = VEC_WIDEN_MULT_HI_EXPR;
   break;


Re: [PATCH] Fix PR60505

2014-03-28 Thread Cong Hou
Ping?


thanks,
Cong


On Wed, Mar 19, 2014 at 11:39 AM, Cong Hou co...@google.com wrote:
 On Tue, Mar 18, 2014 at 4:43 AM, Richard Biener rguent...@suse.de wrote:

 On Mon, 17 Mar 2014, Cong Hou wrote:

  On Mon, Mar 17, 2014 at 6:44 AM, Richard Biener rguent...@suse.de wrote:
   On Fri, 14 Mar 2014, Cong Hou wrote:
  
   On Fri, Mar 14, 2014 at 12:58 AM, Richard Biener rguent...@suse.de 
   wrote:
On Fri, 14 Mar 2014, Jakub Jelinek wrote:
   
On Fri, Mar 14, 2014 at 08:52:07AM +0100, Richard Biener wrote:
  Consider this fact and if there are alias checks, we can safely 
  remove
  the epilogue if the maximum trip count of the loop is less than 
  or
  equal to the calculated threshold.

 You have to consider n % vf != 0, so an argument on only maximum
 trip count or threshold cannot work.
   
Well, if you only check if maximum trip count is = vf and you know
that for n  vf the vectorized loop + it's epilogue path will not be 
taken,
then perhaps you could, but it is a very special case.
Now, the question is when we are guaranteed we enter the scalar 
versioned
loop instead for n  vf, is that in case of versioning for alias or
versioning for alignment?
   
I think neither - I have plans to do the cost model check together
with the versioning condition but didn't get around to implement that.
That would allow stronger max bounds for the epilogue loop.
  
   In vect_transform_loop(), check_profitability will be set to true if
   th = VF-1 and the number of iteration is unknown (we only consider
   unknown trip count here), where th is calculated based on the
   parameter PARAM_MIN_VECT_LOOP_BOUND and cost model, with the minimum
   value VF-1. If the loop needs to be versioned, then
   check_profitability with true value will be passed to
   vect_loop_versioning(), in which an enhanced loop bound check
   (considering cost) will be built. So I think if the loop is versioned
   and n  VF, then we must enter the scalar version, and in this case
   removing epilogue should be safe when the maximum trip count = th+1.
  
   You mean exactly in the case where the profitability check ensures
   that n % vf == 0?  Thus effectively if n == maximum trip count?
   That's quite a special case, no?
 
 
  Yes, it is a special case. But it is in this special case that those
  warnings are thrown out. Also, I think declaring an array with VF*N as
  length is not unusual.

 Ok, but then for the patch compute the cost model threshold once
 in vect_analyze_loop_2 and store it in a new
 LOOP_VINFO_COST_MODEL_THRESHOLD.


 Done.


 Also you have to check
 the return value from max_stmt_executions_int as that may return
 -1 if the number cannot be computed (or isn't representable in
 a HOST_WIDE_INT).


 It will be converted to unsigned type so that -1 means infinity.


 You also should check for
 LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT which should have the
 same effect on the cost model check.


 Done.




 The existing condition is already complicated enough - adding new
 stuff warrants comments before the (sub-)checks.


 OK. Comments added.

 Below is the revised patch. Bootstrapped and tested on a x86-64 machine.


 Cong



 diff --git a/gcc/ChangeLog b/gcc/ChangeLog
 index e1d8666..eceefb3 100644
 --- a/gcc/ChangeLog
 +++ b/gcc/ChangeLog
 @@ -1,3 +1,18 @@
 +2014-03-11  Cong Hou  co...@google.com
 +
 + PR tree-optimization/60505
 + * tree-vectorizer.h (struct _stmt_vec_info): Add th field as the
 + threshold of number of iterations below which no vectorization will be
 + done.
 + * tree-vect-loop.c (new_loop_vec_info):
 + Initialize LOOP_VINFO_COST_MODEL_THRESHOLD.
 + * tree-vect-loop.c (vect_analyze_loop_operations):
 + Set LOOP_VINFO_COST_MODEL_THRESHOLD.
 + * tree-vect-loop.c (vect_transform_loop):
 + Use LOOP_VINFO_COST_MODEL_THRESHOLD.
 + * tree-vect-loop.c (vect_analyze_loop_2): Check the maximum number
 + of iterations of the loop and see if we should build the epilogue.
 +
  2014-03-10  Jakub Jelinek  ja...@redhat.com

   PR ipa/60457
 diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
 index 41b6875..09ec1c0 100644
 --- a/gcc/testsuite/ChangeLog
 +++ b/gcc/testsuite/ChangeLog
 @@ -1,3 +1,8 @@
 +2014-03-11  Cong Hou  co...@google.com
 +
 + PR tree-optimization/60505
 + * gcc.dg/vect/pr60505.c: New test.
 +
  2014-03-10  Jakub Jelinek  ja...@redhat.com

   PR ipa/60457
 diff --git a/gcc/testsuite/gcc.dg/vect/pr60505.c
 b/gcc/testsuite/gcc.dg/vect/pr60505.c
 new file mode 100644
 index 000..6940513
 --- /dev/null
 +++ b/gcc/testsuite/gcc.dg/vect/pr60505.c
 @@ -0,0 +1,12 @@
 +/* { dg-do compile } */
 +/* { dg-additional-options -Wall -Werror } */
 +
 +void foo(char *in, char *out, int num)
 +{
 +  int i;
 +  char ovec[16] = {0};
 +
 +  for(i = 0; i  num ; ++i)
 +out[i] = (ovec[i] = in[i]);
 +  out[num] = ovec[num/2];
 +}
 diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
 index df6ab6f..1c78e11 100644

Re: [PATCH] Fix PR60505

2014-03-19 Thread Cong Hou
On Tue, Mar 18, 2014 at 4:43 AM, Richard Biener rguent...@suse.de wrote:

 On Mon, 17 Mar 2014, Cong Hou wrote:

  On Mon, Mar 17, 2014 at 6:44 AM, Richard Biener rguent...@suse.de wrote:
   On Fri, 14 Mar 2014, Cong Hou wrote:
  
   On Fri, Mar 14, 2014 at 12:58 AM, Richard Biener rguent...@suse.de 
   wrote:
On Fri, 14 Mar 2014, Jakub Jelinek wrote:
   
On Fri, Mar 14, 2014 at 08:52:07AM +0100, Richard Biener wrote:
  Consider this fact and if there are alias checks, we can safely 
  remove
  the epilogue if the maximum trip count of the loop is less than or
  equal to the calculated threshold.

 You have to consider n % vf != 0, so an argument on only maximum
 trip count or threshold cannot work.
   
Well, if you only check if maximum trip count is = vf and you know
that for n  vf the vectorized loop + it's epilogue path will not be 
taken,
then perhaps you could, but it is a very special case.
Now, the question is when we are guaranteed we enter the scalar 
versioned
loop instead for n  vf, is that in case of versioning for alias or
versioning for alignment?
   
I think neither - I have plans to do the cost model check together
with the versioning condition but didn't get around to implement that.
That would allow stronger max bounds for the epilogue loop.
  
   In vect_transform_loop(), check_profitability will be set to true if
   th = VF-1 and the number of iteration is unknown (we only consider
   unknown trip count here), where th is calculated based on the
   parameter PARAM_MIN_VECT_LOOP_BOUND and cost model, with the minimum
   value VF-1. If the loop needs to be versioned, then
   check_profitability with true value will be passed to
   vect_loop_versioning(), in which an enhanced loop bound check
   (considering cost) will be built. So I think if the loop is versioned
   and n  VF, then we must enter the scalar version, and in this case
   removing epilogue should be safe when the maximum trip count = th+1.
  
   You mean exactly in the case where the profitability check ensures
   that n % vf == 0?  Thus effectively if n == maximum trip count?
   That's quite a special case, no?
 
 
  Yes, it is a special case. But it is in this special case that those
  warnings are thrown out. Also, I think declaring an array with VF*N as
  length is not unusual.

 Ok, but then for the patch compute the cost model threshold once
 in vect_analyze_loop_2 and store it in a new
 LOOP_VINFO_COST_MODEL_THRESHOLD.


Done.


 Also you have to check
 the return value from max_stmt_executions_int as that may return
 -1 if the number cannot be computed (or isn't representable in
 a HOST_WIDE_INT).


It will be converted to unsigned type so that -1 means infinity.


 You also should check for
 LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT which should have the
 same effect on the cost model check.


Done.




 The existing condition is already complicated enough - adding new
 stuff warrants comments before the (sub-)checks.


OK. Comments added.

Below is the revised patch. Bootstrapped and tested on a x86-64 machine.


Cong



diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index e1d8666..eceefb3 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,18 @@
+2014-03-11  Cong Hou  co...@google.com
+
+ PR tree-optimization/60505
+ * tree-vectorizer.h (struct _stmt_vec_info): Add th field as the
+ threshold of number of iterations below which no vectorization will be
+ done.
+ * tree-vect-loop.c (new_loop_vec_info):
+ Initialize LOOP_VINFO_COST_MODEL_THRESHOLD.
+ * tree-vect-loop.c (vect_analyze_loop_operations):
+ Set LOOP_VINFO_COST_MODEL_THRESHOLD.
+ * tree-vect-loop.c (vect_transform_loop):
+ Use LOOP_VINFO_COST_MODEL_THRESHOLD.
+ * tree-vect-loop.c (vect_analyze_loop_2): Check the maximum number
+ of iterations of the loop and see if we should build the epilogue.
+
 2014-03-10  Jakub Jelinek  ja...@redhat.com

  PR ipa/60457
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 41b6875..09ec1c0 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2014-03-11  Cong Hou  co...@google.com
+
+ PR tree-optimization/60505
+ * gcc.dg/vect/pr60505.c: New test.
+
 2014-03-10  Jakub Jelinek  ja...@redhat.com

  PR ipa/60457
diff --git a/gcc/testsuite/gcc.dg/vect/pr60505.c
b/gcc/testsuite/gcc.dg/vect/pr60505.c
new file mode 100644
index 000..6940513
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr60505.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-additional-options -Wall -Werror } */
+
+void foo(char *in, char *out, int num)
+{
+  int i;
+  char ovec[16] = {0};
+
+  for(i = 0; i  num ; ++i)
+out[i] = (ovec[i] = in[i]);
+  out[num] = ovec[num/2];
+}
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index df6ab6f..1c78e11 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -933,6 +933,7 @@ new_loop_vec_info (struct loop *loop)
   LOOP_VINFO_NITERS (res) = NULL

Re: [PATCH] Fix PR60505

2014-03-17 Thread Cong Hou
On Mon, Mar 17, 2014 at 6:44 AM, Richard Biener rguent...@suse.de wrote:
 On Fri, 14 Mar 2014, Cong Hou wrote:

 On Fri, Mar 14, 2014 at 12:58 AM, Richard Biener rguent...@suse.de wrote:
  On Fri, 14 Mar 2014, Jakub Jelinek wrote:
 
  On Fri, Mar 14, 2014 at 08:52:07AM +0100, Richard Biener wrote:
Consider this fact and if there are alias checks, we can safely remove
the epilogue if the maximum trip count of the loop is less than or
equal to the calculated threshold.
  
   You have to consider n % vf != 0, so an argument on only maximum
   trip count or threshold cannot work.
 
  Well, if you only check if maximum trip count is = vf and you know
  that for n  vf the vectorized loop + it's epilogue path will not be 
  taken,
  then perhaps you could, but it is a very special case.
  Now, the question is when we are guaranteed we enter the scalar versioned
  loop instead for n  vf, is that in case of versioning for alias or
  versioning for alignment?
 
  I think neither - I have plans to do the cost model check together
  with the versioning condition but didn't get around to implement that.
  That would allow stronger max bounds for the epilogue loop.

 In vect_transform_loop(), check_profitability will be set to true if
 th = VF-1 and the number of iteration is unknown (we only consider
 unknown trip count here), where th is calculated based on the
 parameter PARAM_MIN_VECT_LOOP_BOUND and cost model, with the minimum
 value VF-1. If the loop needs to be versioned, then
 check_profitability with true value will be passed to
 vect_loop_versioning(), in which an enhanced loop bound check
 (considering cost) will be built. So I think if the loop is versioned
 and n  VF, then we must enter the scalar version, and in this case
 removing epilogue should be safe when the maximum trip count = th+1.

 You mean exactly in the case where the profitability check ensures
 that n % vf == 0?  Thus effectively if n == maximum trip count?
 That's quite a special case, no?


Yes, it is a special case. But it is in this special case that those
warnings are thrown out. Also, I think declaring an array with VF*N as
length is not unusual.


thanks,
Cong



 Richard.

 --
 Richard Biener rguent...@suse.de
 SUSE / SUSE Labs
 SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
 GF: Jeff Hawn, Jennifer Guild, Felix Imendorffer


Re: [PATCH] Fix PR60505

2014-03-14 Thread Cong Hou
On Fri, Mar 14, 2014 at 12:58 AM, Richard Biener rguent...@suse.de wrote:
 On Fri, 14 Mar 2014, Jakub Jelinek wrote:

 On Fri, Mar 14, 2014 at 08:52:07AM +0100, Richard Biener wrote:
   Consider this fact and if there are alias checks, we can safely remove
   the epilogue if the maximum trip count of the loop is less than or
   equal to the calculated threshold.
 
  You have to consider n % vf != 0, so an argument on only maximum
  trip count or threshold cannot work.

 Well, if you only check if maximum trip count is = vf and you know
 that for n  vf the vectorized loop + it's epilogue path will not be taken,
 then perhaps you could, but it is a very special case.
 Now, the question is when we are guaranteed we enter the scalar versioned
 loop instead for n  vf, is that in case of versioning for alias or
 versioning for alignment?

 I think neither - I have plans to do the cost model check together
 with the versioning condition but didn't get around to implement that.
 That would allow stronger max bounds for the epilogue loop.

In vect_transform_loop(), check_profitability will be set to true if
th = VF-1 and the number of iteration is unknown (we only consider
unknown trip count here), where th is calculated based on the
parameter PARAM_MIN_VECT_LOOP_BOUND and cost model, with the minimum
value VF-1. If the loop needs to be versioned, then
check_profitability with true value will be passed to
vect_loop_versioning(), in which an enhanced loop bound check
(considering cost) will be built. So I think if the loop is versioned
and n  VF, then we must enter the scalar version, and in this case
removing epilogue should be safe when the maximum trip count = th+1.


thanks,
Cong



 Richard.


Re: [PATCH] Fix PR60505

2014-03-13 Thread Cong Hou
On Thu, Mar 13, 2014 at 2:27 AM, Richard Biener rguent...@suse.de wrote:
 On Wed, 12 Mar 2014, Cong Hou wrote:

 Thank you for pointing it out. I didn't realized that alias analysis
 has influences on this issue.

 The current problem is that the epilogue may be unnecessary if the
 loop bound cannot be larger than the number of iterations of the
 vectorized loop multiplied by VF when the vectorized loop is supposed
 to be executed. My method is incorrect because I assume the vectorized
 loop will be executed which is actually guaranteed by loop bound check
 (and also alias checks). So if the alias checks exist, my method is
 fine as both conditions are met.

 But there is still the loop bound check which, if it fails, uses
 the epilogue loop as fallback, not the scalar versioned loop.

The loop bound check is already performed together with alias checks
(assume we need alias checks). Actually, I did observe that the loop
bound check in the true body of alias checks may be unnecessary. For
example, for the following loop

  for(i=0; i  num ; ++i)
out[i] = (ovec[i] = in[i]);

GCC now generates the following GIMPLE code after vectorization:



  bb 3: // loop bound check (with cost model) and alias checks
  _29 = (unsigned int) num_5(D);
  _28 = _29  15;
  _24 = in_9(D) + 16;
  _23 = out_7(D) = _24;
  _2 = out_7(D) + 16;
  _1 = _2 = in_9(D);
  _32 = _1 | _23;
  _31 = _28  _32;
  if (_31 != 0)
goto bb 4;
  else
goto bb 12;

  bb 4:
  niters.3_44 = (unsigned int) num_5(D);
  _46 = niters.3_44 + 4294967280;
  _47 = _46  4;
  bnd.4_45 = _47 + 1;
  ratio_mult_vf.5_48 = bnd.4_45  4;
  _59 = (unsigned int) num_5(D);
  _60 = _59 + 4294967295;
  if (_60 = 14)   is this necessary?
goto bb 10;
  else
goto bb 5;


The check _60=14 should be unnecessary because it is implied by the
fact _29  15 in bb3.

Consider this fact and if there are alias checks, we can safely remove
the epilogue if the maximum trip count of the loop is less than or
equal to the calculated threshold.


Cong




 If there is no alias checks, I must
 consider the possibility that the vectorized loop may not be executed
 at runtime and then the epilogue should not be eliminated. The warning
 appears on epilogue, and with loop bound checks (and without alias
 checks) the warning will be gone. So I think the key is alias checks:
 my method only works if there is no alias checks.

 How about adding one more condition that checks if alias checks are
 needed, as the code shown below?

   else if (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
   || (tree_ctz (LOOP_VINFO_NITERS (loop_vinfo))
(unsigned)exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo))
(!LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)
|| (unsigned HOST_WIDE_INT)max_stmt_executions_int
(LOOP_VINFO_LOOP (loop_vinfo))  (unsigned)th)))
 LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = true;


 thanks,
 Cong


 On Wed, Mar 12, 2014 at 1:24 AM, Jakub Jelinek ja...@redhat.com wrote:
  On Tue, Mar 11, 2014 at 04:16:13PM -0700, Cong Hou wrote:
  This patch is fixing PR60505 in which the vectorizer may produce
  unnecessary epilogues.
 
  Bootstrapped and tested on a x86_64 machine.
 
  OK for trunk?
 
  That looks wrong.  Consider the case where the loop isn't versioned,
  if you disable generation of the epilogue loop, you end up only with
  a vector loop.
 
  Say:
  unsigned char ovec[16] __attribute__((aligned (16))) = { 0 };
  void
  foo (char *__restrict in, char *__restrict out, int num)
  {
int i;
 
in = __builtin_assume_aligned (in, 16);
out = __builtin_assume_aligned (out, 16);
for (i = 0; i  num; ++i)
  out[i] = (ovec[i] = in[i]);
out[num] = ovec[num / 2];
  }
  -O2 -ftree-vectorize.  Now, consider if this function is called
  with num != 16 (num  16 is of course invalid, but num 0 to 15 is
  valid and your patch will cause a wrong-code in this case).
 
  Jakub



 --
 Richard Biener rguent...@suse.de
 SUSE / SUSE Labs
 SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
 GF: Jeff Hawn, Jennifer Guild, Felix Imendorffer


Re: [PATCH] Fix PR60505

2014-03-12 Thread Cong Hou
Thank you for pointing it out. I didn't realized that alias analysis
has influences on this issue.

The current problem is that the epilogue may be unnecessary if the
loop bound cannot be larger than the number of iterations of the
vectorized loop multiplied by VF when the vectorized loop is supposed
to be executed. My method is incorrect because I assume the vectorized
loop will be executed which is actually guaranteed by loop bound check
(and also alias checks). So if the alias checks exist, my method is
fine as both conditions are met. If there is no alias checks, I must
consider the possibility that the vectorized loop may not be executed
at runtime and then the epilogue should not be eliminated. The warning
appears on epilogue, and with loop bound checks (and without alias
checks) the warning will be gone. So I think the key is alias checks:
my method only works if there is no alias checks.

How about adding one more condition that checks if alias checks are
needed, as the code shown below?

  else if (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
  || (tree_ctz (LOOP_VINFO_NITERS (loop_vinfo))
   (unsigned)exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo))
   (!LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)
   || (unsigned HOST_WIDE_INT)max_stmt_executions_int
   (LOOP_VINFO_LOOP (loop_vinfo))  (unsigned)th)))
LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = true;


thanks,
Cong


On Wed, Mar 12, 2014 at 1:24 AM, Jakub Jelinek ja...@redhat.com wrote:
 On Tue, Mar 11, 2014 at 04:16:13PM -0700, Cong Hou wrote:
 This patch is fixing PR60505 in which the vectorizer may produce
 unnecessary epilogues.

 Bootstrapped and tested on a x86_64 machine.

 OK for trunk?

 That looks wrong.  Consider the case where the loop isn't versioned,
 if you disable generation of the epilogue loop, you end up only with
 a vector loop.

 Say:
 unsigned char ovec[16] __attribute__((aligned (16))) = { 0 };
 void
 foo (char *__restrict in, char *__restrict out, int num)
 {
   int i;

   in = __builtin_assume_aligned (in, 16);
   out = __builtin_assume_aligned (out, 16);
   for (i = 0; i  num; ++i)
 out[i] = (ovec[i] = in[i]);
   out[num] = ovec[num / 2];
 }
 -O2 -ftree-vectorize.  Now, consider if this function is called
 with num != 16 (num  16 is of course invalid, but num 0 to 15 is
 valid and your patch will cause a wrong-code in this case).

 Jakub


[PATCH] Fix PR60505

2014-03-11 Thread Cong Hou
This patch is fixing PR60505 in which the vectorizer may produce
unnecessary epilogues.

Bootstrapped and tested on a x86_64 machine.

OK for trunk?


thanks,
Cong


diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index e1d8666..f98e628 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,9 @@
+2014-03-11  Cong Hou  co...@google.com
+
+ PR tree-optimization/60505
+ * tree-vect-loop.c (vect_analyze_loop_2): Check the maximum number
+ of iterations of the loop and see if we should build the epilogue.
+
 2014-03-10  Jakub Jelinek  ja...@redhat.com

  PR ipa/60457
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 41b6875..09ec1c0 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2014-03-11  Cong Hou  co...@google.com
+
+ PR tree-optimization/60505
+ * gcc.dg/vect/pr60505.c: New test.
+
 2014-03-10  Jakub Jelinek  ja...@redhat.com

  PR ipa/60457
diff --git a/gcc/testsuite/gcc.dg/vect/pr60505.c
b/gcc/testsuite/gcc.dg/vect/pr60505.c
new file mode 100644
index 000..6940513
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr60505.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-additional-options -Wall -Werror } */
+
+void foo(char *in, char *out, int num)
+{
+  int i;
+  char ovec[16] = {0};
+
+  for(i = 0; i  num ; ++i)
+out[i] = (ovec[i] = in[i]);
+  out[num] = ovec[num/2];
+}
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index df6ab6f..2156d5f 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -1625,6 +1625,7 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo)
   bool ok, slp = false;
   int max_vf = MAX_VECTORIZATION_FACTOR;
   int min_vf = 2;
+  int th;

   /* Find all data references in the loop (which correspond to vdefs/vuses)
  and analyze their evolution in the loop.  Also adjust the minimal
@@ -1769,6 +1770,12 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo)

   /* Decide whether we need to create an epilogue loop to handle
  remaining scalar iterations.  */
+  th = MAX (PARAM_VALUE (PARAM_MIN_VECT_LOOP_BOUND), 1)
+   * LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1;
+  th = MAX (th, LOOP_VINFO_COST_MODEL_MIN_ITERS (loop_vinfo)) + 1;
+  th = (th / LOOP_VINFO_VECT_FACTOR (loop_vinfo))
+   * LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+
   if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)  0)
 {
@@ -1779,7 +1786,9 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo)
 }
   else if (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
|| (tree_ctz (LOOP_VINFO_NITERS (loop_vinfo))
-(unsigned)exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo
+(unsigned)exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo))
+(unsigned HOST_WIDE_INT)max_stmt_executions_int
+(LOOP_VINFO_LOOP (loop_vinfo))  (unsigned)th))
 LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = true;

   /* If an epilogue loop is required make sure we can create one.  */


[GOOGLE] Emit a single unaligned load/store instruction for i386 m_GENERIC

2014-02-11 Thread Cong Hou
This small patch lets GCC emit a single unaligned load/store
instruction for m_GENERIC i386 CPUs.

Bootstrapped and passed regression test.

OK for Google branch?


thanks,
Cong


Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c (revision 207701)
+++ gcc/config/i386/i386.c (working copy)
@@ -1903,10 +1903,10 @@ static unsigned int initial_ix86_tune_fe
   m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_ATOM  | m_AMDFAM10 | m_BDVER
| m_GENERIC,

   /* X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL */
-  m_COREI7 | m_COREI7_AVX | m_AMDFAM10 | m_BDVER | m_BTVER,
+  m_COREI7 | m_COREI7_AVX | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC,

   /* X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL */
-  m_COREI7 | m_COREI7_AVX | m_BDVER,
+  m_COREI7 | m_COREI7_AVX | m_BDVER | m_GENERIC,

   /* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */
   m_BDVER ,


[GOOGLE] Prevent x_flag_complex_method to be set to 2 for C++.

2014-02-11 Thread Cong Hou
With this patch x_flag_complex_method won't be set to 2 for C++ so
that multiply/divide between std::complex objects won't be replaced by
expensive builtin function calls.

Bootstrapped and passed regression test.

OK for Google branch?


thanks,
Cong



Index: gcc/c-family/c-opts.c
===
--- gcc/c-family/c-opts.c (revision 207701)
+++ gcc/c-family/c-opts.c (working copy)
@@ -204,8 +204,10 @@ c_common_init_options_struct (struct gcc
   opts-x_warn_write_strings = c_dialect_cxx ();
   opts-x_flag_warn_unused_result = true;

-  /* By default, C99-like requirements for complex multiply and divide.  */
-  opts-x_flag_complex_method = 2;
+  /* By default, C99-like requirements for complex multiply and divide.
+ But for C++ this should not be required.  */
+  if (c_language != clk_cxx)
+opts-x_flag_complex_method = 2;
 }

 /* Common initialization before calling option handlers.  */


Re: [PATCH] Fixing PR60000: A bug in the vectorizer.

2014-01-31 Thread Cong Hou
On Fri, Jan 31, 2014 at 5:06 AM, Jakub Jelinek ja...@redhat.com wrote:

 On Fri, Jan 31, 2014 at 09:41:59AM +0100, Richard Biener wrote:
  Is that because si and pattern_def_si point to the same stmts?  Then
  I'd prefer to do
 
if (is_store)
 {
   ...
   pattern_def_seq = NULL;
 }
   else if (!transform_pattern_stmt  gsi_end_p (pattern_def_si))
 {
   pattern_def_seq = NULL;
   gsi_next (si);
 }

 Yeah, I think stores can only appear at the end of patterns, so IMHO it 
 should be
 safe to just clear pattern_def_seq always in that case.  Right now the code
 has continue; separately for STMT_VINFO_GROUPED_ACCESS (stmt_info) and
 for !STMT_VINFO_GROUPED_ACCESS (stmt_info) stores, but I guess you should
 just move them at the end of if (is_store) and clear pattern_def_seq there
 before the continue.  Add gcc_assert (!transform_pattern_stmt); too?


I agree. I have updated the patch accordingly. Bootstrapped and tested
on x86_64. OK for the trunk?

thanks,
Cong




diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 95a324c..cabcaf8 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,10 @@
+2014-01-30  Cong Hou  co...@google.com
+
+   PR tree-optimization/6
+   * tree-vect-loop.c (vect_transform_loop): Set pattern_def_seq to NULL
+   if the vectorized statement is a store. A store statement can only
+   appear at the end of pattern statements.
+
 2014-01-27  Jakub Jelinek  ja...@redhat.com

PR bootstrap/59934
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index fa61d5c..f2ce70f 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2014-01-30  Cong Hou  co...@google.com
+
+   PR tree-optimization/6
+   * g++.dg/vect/pr6.cc: New test.
+
 2014-01-27  Christian Bruel  christian.br...@st.com

* gcc.target/sh/torture/strncmp.c: New tests.
diff --git a/gcc/testsuite/g++.dg/vect/pr6.cc
b/gcc/testsuite/g++.dg/vect/pr6.cc
new file mode 100644
index 000..fe39d6a
--- /dev/null
+++ b/gcc/testsuite/g++.dg/vect/pr6.cc
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-additional-options -fno-tree-vrp } */
+
+void foo (bool* a, int* b)
+{
+  for (int i = 0; i  1000; ++i)
+{
+  a[i] = i % 2;
+  b[i] = i % 3;
+}
+}
+
+/* { dg-final { cleanup-tree-dump vect } } */
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 69c8d21..0e162cb 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -6053,7 +6053,6 @@ vect_transform_loop (loop_vec_info loop_vinfo)
 the chain.  */
  gsi_next (si);
  vect_remove_stores (GROUP_FIRST_ELEMENT (stmt_info));
- continue;
}
  else
{
@@ -6063,11 +6062,13 @@ vect_transform_loop (loop_vec_info loop_vinfo)
  unlink_stmt_vdef (store);
  gsi_remove (si, true);
  release_defs (store);
- continue;
}
-   }

- if (!transform_pattern_stmt  gsi_end_p (pattern_def_si))
+ /* Stores can only appear at the end of pattern statements.  */
+ gcc_assert (!transform_pattern_stmt);
+ pattern_def_seq = NULL;
+   }
+ else if (!transform_pattern_stmt  gsi_end_p (pattern_def_si))
{
  pattern_def_seq = NULL;
  gsi_next (si);







 Jakub


Re: [PATCH] Fixing PR60000: A bug in the vectorizer.

2014-01-30 Thread Cong Hou
Wrong format. Send it again.


On Thu, Jan 30, 2014 at 4:57 PM, Cong Hou co...@google.com wrote:

 Hi

 PR6 (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=6) is caused by GCC 
 vectorizer. The bug appears when handling vectorization patterns. When a 
 pattern statement has additional new statements stored in pattern_def_seq in 
 vect_transform_loop(), those statements are vectorized before the pattern 
 statement. Once all those statements are handled, pattern_def_seq is set to 
 NULL. However, if the pattern statement is a store, pattern_def_seq will not 
 be set to NULL. In consequence, the next pattern statement will not have the 
 correct pattern_def_seq.

 This bug can be fixed by nullifying pattern_def_seq before checking if the 
 vectorized statement is a store. The patch is pasted below. Bootstrapped and 
 tested on x86_64.


 thanks,
 Cong



 diff --git a/gcc/ChangeLog b/gcc/ChangeLog
 index 95a324c..9df0d34 100644
 --- a/gcc/ChangeLog
 +++ b/gcc/ChangeLog
 @@ -1,3 +1,10 @@
 +2014-01-30  Cong Hou  co...@google.com
 +
 +   PR tree-optimization/6
 +   * tree-vect-loop.c (vect_transform_loop): Set pattern_def_seq to NULL
 +   before checking if the vectorized statement is a store. A store
 +   statement can be a pattern one.
 +
  2014-01-27  Jakub Jelinek  ja...@redhat.com

 PR bootstrap/59934
 diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
 index fa61d5c..f2ce70f 100644
 --- a/gcc/testsuite/ChangeLog
 +++ b/gcc/testsuite/ChangeLog
 @@ -1,3 +1,8 @@
 +2014-01-30  Cong Hou  co...@google.com
 +
 +   PR tree-optimization/6
 +   * g++.dg/vect/pr6.cc: New test.
 +
  2014-01-27  Christian Bruel  christian.br...@st.com

 * gcc.target/sh/torture/strncmp.c: New tests.
 diff --git a/gcc/testsuite/g++.dg/vect/pr6.cc 
 b/gcc/testsuite/g++.dg/vect/pr6.cc
 new file mode 100644
 index 000..8a8bd22
 --- /dev/null
 +++ b/gcc/testsuite/g++.dg/vect/pr6.cc
 @@ -0,0 +1,13 @@
 +/* { dg-do compile } */
 +/* { dg-additional-options -fno-tree-vrp } */
 +
 +void foo (bool* a, int* b)
 +{
 +  for (int i = 0; i  1000; ++i)
 +{
 +  a[i] = i % 2;
 +  b[i] = i % 3;
 +}
 +}
 +
 +/* { dg-final { cleanup-tree-dump vect } } */
 diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
 index 69c8d21..8c8bece 100644
 --- a/gcc/tree-vect-loop.c
 +++ b/gcc/tree-vect-loop.c
 @@ -6044,6 +6044,10 @@ vect_transform_loop (loop_vec_info loop_vinfo)

   grouped_store = false;
   is_store = vect_transform_stmt (stmt, si, grouped_store, NULL, 
 NULL);
 +
 + if (!transform_pattern_stmt  gsi_end_p (pattern_def_si))
 +   pattern_def_seq = NULL;
 +
if (is_store)
  {
   if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
 @@ -6068,10 +6072,7 @@ vect_transform_loop (loop_vec_info loop_vinfo)
 }

   if (!transform_pattern_stmt  gsi_end_p (pattern_def_si))
 -   {
 - pattern_def_seq = NULL;
 - gsi_next (si);
 -   }
 +   gsi_next (si);
 }   /* stmts in BB */
  }  /* BBs in loop */





Re: [PATCH] Fixing PR59006 and PR58921 by delaying loop invariant hoisting in vectorizer.

2014-01-13 Thread Cong Hou
I noticed that LIM could not hoist vector invariant, and that is why
my first implementation tries to hoist them all.

In addition, there are two disadvantages of hoisting invariant load +
lim method:

First, for some instructions the scalar version is faster than the
vector version, and in this case hoisting scalar instructions before
vectorization is better. Those instructions include data
packing/unpacking, integer multiplication with SSE2, etc..

Second, it may use more SIMD registers.

The following code shows a simple example:

char *a, *b, *c;
for (int i = 0; i  N; ++i)
  a[i] = b[0] * c[0] + a[i];

Vectorizing b[0]*c[0] is worse than loading the result of b[0]*c[0]
into a vector.


thanks,
Cong


On Mon, Jan 13, 2014 at 5:37 AM, Richard Biener rguent...@suse.de wrote:
 On Wed, 27 Nov 2013, Jakub Jelinek wrote:

 On Wed, Nov 27, 2013 at 10:53:56AM +0100, Richard Biener wrote:
  Hmm.  I'm still thinking that we should handle this during the regular
  transform step.

 I wonder if it can't be done instead just in vectorizable_load,
 if LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo) and the load is
 invariant, just emit the (broadcasted) load not inside of the loop, but on
 the loop preheader edge.

 So this implements this suggestion, XFAILing the no longer handled cases.
 For example we get

   _94 = *b_8(D);
   vect_cst_.18_95 = {_94, _94, _94, _94};
   _99 = prolog_loop_adjusted_niters.9_132 * 4;
   vectp_a.22_98 = a_6(D) + _99;
   ivtmp.43_77 = (unsigned long) vectp_a.22_98;

   bb 13:
   # ivtmp.41_67 = PHI ivtmp.41_70(3), 0(12)
   # ivtmp.43_71 = PHI ivtmp.43_69(3), ivtmp.43_77(12)
   vect__10.19_97 = vect_cst_.18_95 + { 1, 1, 1, 1 };
   _76 = (void *) ivtmp.43_71;
   MEM[base: _76, offset: 0B] = vect__10.19_97;

 ...

 instead of having hoisted *b_8 + 1 as scalar computation.  Not sure
 why LIM doesn't hoist the vector variant later.

 vect__10.19_97 = vect_cst_.18_95 + vect_cst_.20_96;
   invariant up to level 1, cost 1.

 ah, the cost thing.  Should be improved to see that hoisting
 reduces the number of live SSA names in the loop.

 Eventually lower_vector_ssa could optimize vector to scalar
 code again ... (ick).

 Bootstrap / regtest running on x86_64.

 Comments?

 Thanks,
 Richard.

 2014-01-13  Richard Biener  rguent...@suse.de

 PR tree-optimization/58921
 PR tree-optimization/59006
 * tree-vect-loop-manip.c (vect_loop_versioning): Remove code
 hoisting invariant stmts.
 * tree-vect-stmts.c (vectorizable_load): Insert the splat of
 invariant loads on the preheader edge if possible.

 * gcc.dg/torture/pr58921.c: New testcase.
 * gcc.dg/torture/pr59006.c: Likewise.
 * gcc.dg/vect/pr58508.c: XFAIL no longer handled cases.

 Index: gcc/tree-vect-loop-manip.c
 ===
 *** gcc/tree-vect-loop-manip.c  (revision 206576)
 --- gcc/tree-vect-loop-manip.c  (working copy)
 *** vect_loop_versioning (loop_vec_info loop
 *** 2435,2507 
 }
   }

 -
 -   /* Extract load statements on memrefs with zero-stride accesses.  */
 -
 -   if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
 - {
 -   /* In the loop body, we iterate each statement to check if it is a 
 load.
 -Then we check the DR_STEP of the data reference.  If DR_STEP is zero,
 -then we will hoist the load statement to the loop preheader.  */
 -
 -   basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
 -   int nbbs = loop-num_nodes;
 -
 -   for (int i = 0; i  nbbs; ++i)
 -   {
 - for (gimple_stmt_iterator si = gsi_start_bb (bbs[i]);
 -  !gsi_end_p (si);)
 -   {
 - gimple stmt = gsi_stmt (si);
 - stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
 - struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
 -
 - if (is_gimple_assign (stmt)
 -  (!dr
 - || (DR_IS_READ (dr)  integer_zerop (DR_STEP (dr)
 -   {
 - bool hoist = true;
 - ssa_op_iter iter;
 - tree var;
 -
 - /* We hoist a statement if all SSA uses in it are defined
 -outside of the loop.  */
 - FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_USE)
 -   {
 - gimple def = SSA_NAME_DEF_STMT (var);
 - if (!gimple_nop_p (def)
 -  flow_bb_inside_loop_p (loop, gimple_bb (def)))
 -   {
 - hoist = false;
 - break;
 -   }
 -   }
 -
 - if (hoist)
 -   {
 - if (dr)
 -   gimple_set_vuse (stmt, NULL);
 -
 - gsi_remove (si, false);
 - gsi_insert_on_edge_immediate (loop_preheader_edge 
 

Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-12-17 Thread Cong Hou
Ping?


thanks,
Cong


On Mon, Dec 2, 2013 at 5:06 PM, Cong Hou co...@google.com wrote:
 Hi Richard

 Could you please take a look at this patch and see if it is ready for
 the trunk? The patch is pasted as a text file here again.

 Thank you very much!


 Cong


 On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou co...@google.com wrote:
 Hi James

 Sorry for the late reply.


 On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
 james.greenha...@arm.com wrote:
 On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou co...@google.com wrote:
  Thank you for your detailed explanation.
 
  Once GCC detects a reduction operation, it will automatically
  accumulate all elements in the vector after the loop. In the loop the
  reduction variable is always a vector whose elements are reductions of
  corresponding values from other vectors. Therefore in your case the
  only instruction you need to generate is:
 
  VABAL   ops[3], ops[1], ops[2]
 
  It is OK if you accumulate the elements into one in the vector inside
  of the loop (if one instruction can do this), but you have to make
  sure other elements in the vector should remain zero so that the final
  result is correct.
 
  If you are confused about the documentation, check the one for
  udot_prod (just above usad in md.texi), as it has very similar
  behavior as usad. Actually I copied the text from there and did some
  changes. As those two instruction patterns are both for vectorization,
  their behavior should not be difficult to explain.
 
  If you have more questions or think that the documentation is still
  improper please let me know.

 Hi Cong,

 Thanks for your reply.

 I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
 DOT_PROD_EXPR and I see that the same ambiguity exists for
 DOT_PROD_EXPR. Can you please add a note in your tree.def
 that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:

   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
   tmp2 = ABS_EXPR (tmp)
   arg3 = PLUS_EXPR (tmp2, arg3)

 or:

   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
   tmp2 = ABS_EXPR (tmp)
   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)

 Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
 a value of the same (widened) type as arg3.



 I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
 mentioned it in tree.def).


 Also, while looking for the history of DOT_PROD_EXPR I spotted this
 patch:

   [autovect] [patch] detect mult-hi and sad patterns
   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html

 I wonder what the reason was for that patch to be dropped?


 It has been 8 years.. I have no idea why this patch is not accepted
 finally. There is even no reply in that thread. But I believe the SAD
 pattern is very important to be recognized. ARM also provides
 instructions for it.


 Thank you for your comment again!


 thanks,
 Cong



 Thanks,
 James



Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.

2013-12-17 Thread Cong Hou
Ping?


thanks,
Cong


On Mon, Dec 2, 2013 at 5:02 PM, Cong Hou co...@google.com wrote:
 Any comment on this patch?


 thanks,
 Cong


 On Fri, Nov 22, 2013 at 11:40 AM, Cong Hou co...@google.com wrote:
 On Fri, Nov 22, 2013 at 3:57 AM, Marc Glisse marc.gli...@inria.fr wrote:
 On Thu, 21 Nov 2013, Cong Hou wrote:

 On Thu, Nov 21, 2013 at 4:39 PM, Marc Glisse marc.gli...@inria.fr wrote:

 On Thu, 21 Nov 2013, Cong Hou wrote:

 While I added the new define_insn_and_split for vec_merge, a bug is
 exposed: in config/i386/sse.md, [ define_expand xop_vmfrczmode2 ]
 only takes one input, but the corresponding builtin functions have two
 inputs, which are shown in i386.c:

  { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv4sf2,
 __builtin_ia32_vfrczss, IX86_BUILTIN_VFRCZSS, UNKNOWN,
 (int)MULTI_ARG_2_SF },
  { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv2df2,
 __builtin_ia32_vfrczsd, IX86_BUILTIN_VFRCZSD, UNKNOWN,
 (int)MULTI_ARG_2_DF },

 In consequence, the ix86_expand_multi_arg_builtin() function tries to
 check two args but based on the define_expand of xop_vmfrczmode2,
 the content of insn_data[CODE_FOR_xop_vmfrczv4sf2].operand[2] may be
 incorrect (because it only needs one input).

 The patch below fixed this issue.

 Bootstrapped and tested on ax x86-64 machine. Note that this patch
 should be applied before the one I sent earlier (sorry for sending
 them in wrong order).



 This is PR 56788. Your patch seems strange to me and I don't think it
 fixes the real issue, but I'll let more knowledgeable people answer.



 Thank you for pointing out the bug report. This patch is not intended
 to fix PR56788.


 IMHO, if PR56788 was fixed, you wouldn't have this issue, and if PR56788
 doesn't get fixed, I'll post a patch to remove _mm_frcz_sd and the
 associated builtin, which would solve your issue as well.


 I agree. Then I will wait until your patch is merged to the trunk,
 otherwise my patch could not pass the test.




 For your function:

 #include x86intrin.h
 __m128d f(__m128d x, __m128d y){
  return _mm_frcz_sd(x,y);
 }

 Note that the second parameter is ignored intentionally, but the
 prototype of this function contains two parameters. My fix is
 explicitly telling GCC that the optab xop_vmfrczv4sf3 should have
 three operands instead of two, to let it have the correct information
 in insn_data[CODE_FOR_xop_vmfrczv4sf3].operand[2] which is used to
 match the type of the second parameter in the builtin function in
 ix86_expand_multi_arg_builtin().


 I disagree that this is intentional, it is a bug. AFAIK there is no AMD
 documentation that could be used as a reference for what _mm_frcz_sd is
 supposed to do. The only existing documentations are by Microsoft (which
 does *not* ignore the second argument) and by LLVM (which has a single
 argument). Whatever we chose for _mm_frcz_sd, the builtin should take a
 single argument, and if necessary we'll use 2 builtins to implement
 _mm_frcz_sd.



 I also only found the one by Microsoft.. If the second argument is
 ignored, we could just remove it, as long as there is no standard
 that requires two arguments. Hopefully it won't break current projects
 using _mm_frcz_sd.

 Thank you for your comments!


 Cong


 --
 Marc Glisse


Re: [PATCH] Enhancing the widen-mult pattern in vectorization.

2013-12-06 Thread Cong Hou
After further reviewing this patch, I found I don't have to change the
code in tree-vect-stmts.c to allow further type conversion after
widen-mult operation. Instead, I detect the following pattern in
vect_recog_widen_mult_pattern():

T1 a, b;
ai = (T2) a;
bi = (T2) b;
c = ai * bi;

where T2 is more that double the size of T1. (e.g. T1 is char and T2 is int).

In this case I just create a new type T3 whose size is double of the
size of T1, then get an intermediate result of type T3 from
widen-mult. Then I add a new statement to STMT_VINFO_PATTERN_DEF_SEQ
converting the result into type T2.

This strategy makes the patch more clean.

Bootstrapped and tested on an x86-64 machine.


thanks,
Cong


diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index f298c0b..12990b2 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,10 @@
+2013-12-02  Cong Hou  co...@google.com
+
+ * tree-vect-patterns.c (vect_recog_widen_mult_pattern): Enhance
+ the widen-mult pattern by handling two operands with different
+ sizes, and operands whose size is smaller than half of the result
+ type.
+
 2013-11-22  Jakub Jelinek  ja...@redhat.com

  PR sanitizer/59061
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 12d2c90..611ae1c 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2013-12-02  Cong Hou  co...@google.com
+
+ * gcc.dg/vect/vect-widen-mult-u8-s16-s32.c: New test.
+ * gcc.dg/vect/vect-widen-mult-u8-u32.c: New test.
+
 2013-11-22  Jakub Jelinek  ja...@redhat.com

  * c-c++-common/asan/no-redundant-instrumentation-7.c: Fix
diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
new file mode 100644
index 000..9f9081b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
@@ -0,0 +1,48 @@
+/* { dg-require-effective-target vect_int } */
+
+#include stdarg.h
+#include tree-vect.h
+
+#define N 64
+
+unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+short Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+int result[N];
+
+/* unsigned char * short - int widening-mult.  */
+__attribute__ ((noinline)) int
+foo1(int len) {
+  int i;
+
+  for (i=0; ilen; i++) {
+result[i] = X[i] * Y[i];
+  }
+}
+
+int main (void)
+{
+  int i;
+
+  check_vect ();
+
+  for (i=0; iN; i++) {
+X[i] = i;
+Y[i] = 64-i;
+__asm__ volatile ();
+  }
+
+  foo1 (N);
+
+  for (i=0; iN; i++) {
+if (result[i] != X[i] * Y[i])
+  abort ();
+  }
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect {
target { vect_widen_mult_hi_to_si || vect_unpack } } } } */
+/* { dg-final { scan-tree-dump-times vect_recog_widen_mult_pattern:
detected 1 vect { target vect_widen_mult_hi_to_si_pattern } } } */
+/* { dg-final { scan-tree-dump-times pattern recognized 1 vect {
target vect_widen_mult_hi_to_si_pattern } } } */
+/* { dg-final { cleanup-tree-dump vect } } */
+
diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-u32.c
b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-u32.c
new file mode 100644
index 000..12c4692
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-u32.c
@@ -0,0 +1,48 @@
+/* { dg-require-effective-target vect_int } */
+
+#include stdarg.h
+#include tree-vect.h
+
+#define N 64
+
+unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+unsigned char Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+unsigned int result[N];
+
+/* unsigned char- unsigned int widening-mult.  */
+__attribute__ ((noinline)) int
+foo1(int len) {
+  int i;
+
+  for (i=0; ilen; i++) {
+result[i] = X[i] * Y[i];
+  }
+}
+
+int main (void)
+{
+  int i;
+
+  check_vect ();
+
+  for (i=0; iN; i++) {
+X[i] = i;
+Y[i] = 64-i;
+__asm__ volatile ();
+  }
+
+  foo1 (N);
+
+  for (i=0; iN; i++) {
+if (result[i] != X[i] * Y[i])
+  abort ();
+  }
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect {
target { vect_widen_mult_qi_to_hi || vect_unpack } } } } */
+/* { dg-final { scan-tree-dump-times vect_recog_widen_mult_pattern:
detected 1 vect { target vect_widen_mult_qi_to_hi_pattern } } } */
+/* { dg-final { scan-tree-dump-times pattern recognized 1 vect {
target vect_widen_mult_qi_to_hi_pattern } } } */
+/* { dg-final { cleanup-tree-dump vect } } */
+
diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index 7823cc3..f412e2d 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -529,7 +529,8 @@ vect_handle_widen_op_by_const (gimple stmt, enum
tree_code code,

Try to find the following pattern:

- type a_t, b_t;
+ type1 a_t;
+ type2 b_t;
  TYPE a_T, b_T, prod_T;

  S1  a_t = ;
@@ -538,11 +539,12 @@ vect_handle_widen_op_by_const (gimple stmt, enum
tree_code code,
  S4  b_T = (TYPE) b_t;
  S5  prod_T = a_T * b_T;

-   where type 'TYPE' is at least double the size of type 'type'.
+   where type 'TYPE' is at least

Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.

2013-12-05 Thread Cong Hou
Hi Richard

You mentioned that Micha has a patch pending that enables of zero-step
stores. What is the status of this patch? I could not find it through
searching Micha.

Thank you!


Cong


On Wed, Oct 16, 2013 at 2:02 AM, Richard Biener rguent...@suse.de wrote:
 On Tue, 15 Oct 2013, Cong Hou wrote:

 Thank you for your reminder, Jeff! I just noticed Richard's comment. I
 have modified the patch according to that.

 The new patch is attached.

 (posting patches inline is easier for review, now you have to deal
 with no quoting markers ;))

 Comments inline.

 diff --git a/gcc/ChangeLog b/gcc/ChangeLog
 index 8a38316..2637309 100644
 --- a/gcc/ChangeLog
 +++ b/gcc/ChangeLog
 @@ -1,3 +1,8 @@
 +2013-10-15  Cong Hou  co...@google.com
 +
 +   * tree-vect-loop-manip.c (vect_loop_versioning): Hoist loop invariant
 +   statement that contains data refs with zero-step.
 +
  2013-10-14  David Malcolm  dmalc...@redhat.com

 * dumpfile.h (gcc::dump_manager): New class, to hold state
 diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
 index 075d071..9d0f4a5 100644
 --- a/gcc/testsuite/ChangeLog
 +++ b/gcc/testsuite/ChangeLog
 @@ -1,3 +1,7 @@
 +2013-10-15  Cong Hou  co...@google.com
 +
 +   * gcc.dg/vect/pr58508.c: New test.
 +
  2013-10-14  Tobias Burnus  bur...@net-b.de

 PR fortran/58658
 diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c 
 b/gcc/testsuite/gcc.dg/vect/pr58508.c
 new file mode 100644
 index 000..cb22b50
 --- /dev/null
 +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c
 @@ -0,0 +1,20 @@
 +/* { dg-do compile } */
 +/* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */
 +
 +
 +/* The GCC vectorizer generates loop versioning for the following loop
 +   since there may exist aliasing between A and B.  The predicate checks
 +   if A may alias with B across all iterations.  Then for the loop in
 +   the true body, we can assert that *B is a loop invariant so that
 +   we can hoist the load of *B before the loop body.  */
 +
 +void foo (int* a, int* b)
 +{
 +  int i;
 +  for (i = 0; i  10; ++i)
 +a[i] = *b + 1;
 +}
 +
 +
 +/* { dg-final { scan-tree-dump-times hoist 2 vect } } */
 +/* { dg-final { cleanup-tree-dump vect } } */
 diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
 index 574446a..f4fdec2 100644
 --- a/gcc/tree-vect-loop-manip.c
 +++ b/gcc/tree-vect-loop-manip.c
 @@ -2477,6 +2477,92 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
adjust_phi_and_debug_stmts (orig_phi, e, PHI_RESULT (new_phi));
  }


 Note that applying this kind of transform at this point invalidates
 some of the earlier analysis the vectorizer performed (namely the
 def-kind which now effectively gets vect_external_def from
 vect_internal_def).  In this case it doesn't seem to cause any
 issues (we re-compute the def-kind everytime we need it (how wasteful)).

 +  /* Extract load and store statements on pointers with zero-stride
 + accesses.  */
 +  if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
 +{
 +  /* In the loop body, we iterate each statement to check if it is a load
 +or store.  Then we check the DR_STEP of the data reference.  If
 +DR_STEP is zero, then we will hoist the load statement to the loop
 +preheader, and move the store statement to the loop exit.  */

 We don't move the store yet.  Micha has a patch pending that enables
 vectorization of zero-step stores.

 +  for (gimple_stmt_iterator si = gsi_start_bb (loop-header);
 +  !gsi_end_p (si);)

 While technically ok now (vectorized loops contain a single basic block)
 please use LOOP_VINFO_BBS () to get at the vector of basic-blcoks
 and iterate over them like other code does.

 +   {
 + gimple stmt = gsi_stmt (si);
 + stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
 + struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
 +
 + if (dr  integer_zerop (DR_STEP (dr)))
 +   {
 + if (DR_IS_READ (dr))
 +   {
 + if (dump_enabled_p ())
 +   {
 + dump_printf_loc
 + (MSG_NOTE, vect_location,
 +  hoist the statement to outside of the loop );

 hoisting out of the vectorized loop: 

 + dump_gimple_stmt (MSG_NOTE, TDF_SLIM, stmt, 0);
 + dump_printf (MSG_NOTE, \n);
 +   }
 +
 + gsi_remove (si, false);
 + gsi_insert_on_edge_immediate (loop_preheader_edge (loop), 
 stmt);

 Note that this will result in a bogus VUSE on the stmt at this point which
 will be only fixed because of implementation details of loop versioning.
 Either get the correct VUSE from the loop header virtual PHI node
 preheader edge (if there is none then the current VUSE is the correct one
 to use) or clear it.

 +   }
 + /* TODO: We also consider vectorizing loops containing zero-step

[PATCH] Enhancing the widen-mult pattern in vectorization.

2013-12-03 Thread Cong Hou
Hi

The current widen-mult pattern only considers two operands with the
same size. However, operands with different sizes can also benefit
from this pattern. The following loop shows such an example:


char a[N];
short b[N];
int c[N];

for (int i = 0; i  N; ++i)
  c[i] = a[i] * b[i];


In this case, we can convert a[i] into short type then perform
widen-mult on b[i] and the converted value:


for (int i = 0; i  N; ++i) {
  short t = a[i];
  c[i] = t w* b[i];
}


This patch adds such support. In addition, the following loop fails to
be recognized as a widen-mult pattern because the widening operation
from char to int is not directly supported by the target:


char a[N], b[N];
int c[N];

for (int i = 0; i  N; ++i)
  c[i] = a[i] * b[i];


In this case, we can still perform widen-mult on a[i] and b[i], and
get a result of short type, then convert it to int:


char a[N], b[N];
int c[N];

for (int i = 0; i  N; ++i) {
  short t = a[i] w* b[i];
  c[i] = (int) t;
}


Currently GCC does not allow multi-step conversions for binary
widening operations. This pattern removes this restriction and use
VEC_UNPACK_LO_EXPR/VEC_UNPACK_HI_EXPR to arrange data after the
widen-mult is performed for the widen-mult pattern. This can reduce
several unpacking instructions (for this example, the number of
packings/unpackings is reduced from 12 to 8. For SSE2, the inefficient
multiplication between two V4SI vectors can also be avoided).

Bootstrapped and tested on an x86_64 machine.



thanks,
Cong



diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index f298c0b..44ed204 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,12 @@
+2013-12-02  Cong Hou  co...@google.com
+
+ * tree-vect-patterns.c (vect_recog_widen_mult_pattern): Enhance
+ the widen-mult pattern by handling two operands with different
+ sizes.
+ * tree-vect-stmts.c (vectorizable_conversion): Allow multi-steps
+ conversions after widening mult operation.
+ (supportable_widening_operation): Likewise.
+
 2013-11-22  Jakub Jelinek  ja...@redhat.com

  PR sanitizer/59061
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 12d2c90..611ae1c 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2013-12-02  Cong Hou  co...@google.com
+
+ * gcc.dg/vect/vect-widen-mult-u8-s16-s32.c: New test.
+ * gcc.dg/vect/vect-widen-mult-u8-u32.c: New test.
+
 2013-11-22  Jakub Jelinek  ja...@redhat.com

  * c-c++-common/asan/no-redundant-instrumentation-7.c: Fix
diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
new file mode 100644
index 000..9f9081b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
@@ -0,0 +1,48 @@
+/* { dg-require-effective-target vect_int } */
+
+#include stdarg.h
+#include tree-vect.h
+
+#define N 64
+
+unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+short Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+int result[N];
+
+/* unsigned char * short - int widening-mult.  */
+__attribute__ ((noinline)) int
+foo1(int len) {
+  int i;
+
+  for (i=0; ilen; i++) {
+result[i] = X[i] * Y[i];
+  }
+}
+
+int main (void)
+{
+  int i;
+
+  check_vect ();
+
+  for (i=0; iN; i++) {
+X[i] = i;
+Y[i] = 64-i;
+__asm__ volatile ();
+  }
+
+  foo1 (N);
+
+  for (i=0; iN; i++) {
+if (result[i] != X[i] * Y[i])
+  abort ();
+  }
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect {
target { vect_widen_mult_hi_to_si || vect_unpack } } } } */
+/* { dg-final { scan-tree-dump-times vect_recog_widen_mult_pattern:
detected 1 vect { target vect_widen_mult_hi_to_si_pattern } } } */
+/* { dg-final { scan-tree-dump-times pattern recognized 1 vect {
target vect_widen_mult_hi_to_si_pattern } } } */
+/* { dg-final { cleanup-tree-dump vect } } */
+
diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-u32.c
b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-u32.c
new file mode 100644
index 000..51e9178
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-u32.c
@@ -0,0 +1,48 @@
+/* { dg-require-effective-target vect_int } */
+
+#include stdarg.h
+#include tree-vect.h
+
+#define N 64
+
+unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+unsigned char Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+unsigned int result[N];
+
+/* unsigned char- unsigned int widening-mult.  */
+__attribute__ ((noinline)) int
+foo1(int len) {
+  int i;
+
+  for (i=0; ilen; i++) {
+result[i] = X[i] * Y[i];
+  }
+}
+
+int main (void)
+{
+  int i;
+
+  check_vect ();
+
+  for (i=0; iN; i++) {
+X[i] = i;
+Y[i] = 64-i;
+__asm__ volatile ();
+  }
+
+  foo1 (N);
+
+  for (i=0; iN; i++) {
+if (result[i] != X[i] * Y[i])
+  abort ();
+  }
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect {
target { vect_widen_mult_qi_to_hi || vect_unpack } } } } */
+/* { dg-final { scan-tree-dump-times

Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.

2013-12-02 Thread Cong Hou
Any comment on this patch?


thanks,
Cong


On Fri, Nov 22, 2013 at 11:40 AM, Cong Hou co...@google.com wrote:
 On Fri, Nov 22, 2013 at 3:57 AM, Marc Glisse marc.gli...@inria.fr wrote:
 On Thu, 21 Nov 2013, Cong Hou wrote:

 On Thu, Nov 21, 2013 at 4:39 PM, Marc Glisse marc.gli...@inria.fr wrote:

 On Thu, 21 Nov 2013, Cong Hou wrote:

 While I added the new define_insn_and_split for vec_merge, a bug is
 exposed: in config/i386/sse.md, [ define_expand xop_vmfrczmode2 ]
 only takes one input, but the corresponding builtin functions have two
 inputs, which are shown in i386.c:

  { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv4sf2,
 __builtin_ia32_vfrczss, IX86_BUILTIN_VFRCZSS, UNKNOWN,
 (int)MULTI_ARG_2_SF },
  { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv2df2,
 __builtin_ia32_vfrczsd, IX86_BUILTIN_VFRCZSD, UNKNOWN,
 (int)MULTI_ARG_2_DF },

 In consequence, the ix86_expand_multi_arg_builtin() function tries to
 check two args but based on the define_expand of xop_vmfrczmode2,
 the content of insn_data[CODE_FOR_xop_vmfrczv4sf2].operand[2] may be
 incorrect (because it only needs one input).

 The patch below fixed this issue.

 Bootstrapped and tested on ax x86-64 machine. Note that this patch
 should be applied before the one I sent earlier (sorry for sending
 them in wrong order).



 This is PR 56788. Your patch seems strange to me and I don't think it
 fixes the real issue, but I'll let more knowledgeable people answer.



 Thank you for pointing out the bug report. This patch is not intended
 to fix PR56788.


 IMHO, if PR56788 was fixed, you wouldn't have this issue, and if PR56788
 doesn't get fixed, I'll post a patch to remove _mm_frcz_sd and the
 associated builtin, which would solve your issue as well.


 I agree. Then I will wait until your patch is merged to the trunk,
 otherwise my patch could not pass the test.




 For your function:

 #include x86intrin.h
 __m128d f(__m128d x, __m128d y){
  return _mm_frcz_sd(x,y);
 }

 Note that the second parameter is ignored intentionally, but the
 prototype of this function contains two parameters. My fix is
 explicitly telling GCC that the optab xop_vmfrczv4sf3 should have
 three operands instead of two, to let it have the correct information
 in insn_data[CODE_FOR_xop_vmfrczv4sf3].operand[2] which is used to
 match the type of the second parameter in the builtin function in
 ix86_expand_multi_arg_builtin().


 I disagree that this is intentional, it is a bug. AFAIK there is no AMD
 documentation that could be used as a reference for what _mm_frcz_sd is
 supposed to do. The only existing documentations are by Microsoft (which
 does *not* ignore the second argument) and by LLVM (which has a single
 argument). Whatever we chose for _mm_frcz_sd, the builtin should take a
 single argument, and if necessary we'll use 2 builtins to implement
 _mm_frcz_sd.



 I also only found the one by Microsoft.. If the second argument is
 ignored, we could just remove it, as long as there is no standard
 that requires two arguments. Hopefully it won't break current projects
 using _mm_frcz_sd.

 Thank you for your comments!


 Cong


 --
 Marc Glisse


Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-12-02 Thread Cong Hou
Hi Richard

Could you please take a look at this patch and see if it is ready for
the trunk? The patch is pasted as a text file here again.

Thank you very much!


Cong


On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou co...@google.com wrote:
 Hi James

 Sorry for the late reply.


 On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
 james.greenha...@arm.com wrote:
 On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou co...@google.com wrote:
  Thank you for your detailed explanation.
 
  Once GCC detects a reduction operation, it will automatically
  accumulate all elements in the vector after the loop. In the loop the
  reduction variable is always a vector whose elements are reductions of
  corresponding values from other vectors. Therefore in your case the
  only instruction you need to generate is:
 
  VABAL   ops[3], ops[1], ops[2]
 
  It is OK if you accumulate the elements into one in the vector inside
  of the loop (if one instruction can do this), but you have to make
  sure other elements in the vector should remain zero so that the final
  result is correct.
 
  If you are confused about the documentation, check the one for
  udot_prod (just above usad in md.texi), as it has very similar
  behavior as usad. Actually I copied the text from there and did some
  changes. As those two instruction patterns are both for vectorization,
  their behavior should not be difficult to explain.
 
  If you have more questions or think that the documentation is still
  improper please let me know.

 Hi Cong,

 Thanks for your reply.

 I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
 DOT_PROD_EXPR and I see that the same ambiguity exists for
 DOT_PROD_EXPR. Can you please add a note in your tree.def
 that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:

   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
   tmp2 = ABS_EXPR (tmp)
   arg3 = PLUS_EXPR (tmp2, arg3)

 or:

   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
   tmp2 = ABS_EXPR (tmp)
   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)

 Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
 a value of the same (widened) type as arg3.



 I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
 mentioned it in tree.def).


 Also, while looking for the history of DOT_PROD_EXPR I spotted this
 patch:

   [autovect] [patch] detect mult-hi and sad patterns
   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html

 I wonder what the reason was for that patch to be dropped?


 It has been 8 years.. I have no idea why this patch is not accepted
 finally. There is even no reply in that thread. But I believe the SAD
 pattern is very important to be recognized. ARM also provides
 instructions for it.


 Thank you for your comment again!


 thanks,
 Cong



 Thanks,
 James

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 6bdaa31..37ff6c4 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,4 +1,24 @@
-2013-11-01  Trevor Saunders  tsaund...@mozilla.com
+2013-10-29  Cong Hou  co...@google.com
+
+   * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD
+   pattern recognition.
+   (type_conversion_p): PROMOTION is true if it's a type promotion
+   conversion, and false otherwise.  Return true if the given expression
+   is a type conversion one.
+   * tree-vectorizer.h: Adjust the number of patterns.
+   * tree.def: Add SAD_EXPR.
+   * optabs.def: Add sad_optab.
+   * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case.
+   * expr.c (expand_expr_real_2): Likewise.
+   * gimple-pretty-print.c (dump_ternary_rhs): Likewise.
+   * gimple.c (get_gimple_rhs_num_ops): Likewise.
+   * optabs.c (optab_for_tree_code): Likewise.
+   * tree-cfg.c (estimate_operator_cost): Likewise.
+   * tree-ssa-operands.c (get_expr_operands): Likewise.
+   * tree-vect-loop.c (get_initial_def_for_reduction): Likewise.
+   * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD.
+   * doc/generic.texi: Add document for SAD_EXPR.
+   * doc/md.texi: Add document for ssad and usad.
 
* function.c (reorder_blocks): Convert block_stack to a stack_vec.
* gimplify.c (gimplify_compound_lval): Likewise.
diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
index fb05ce7..1f824fb 100644
--- a/gcc/cfgexpand.c
+++ b/gcc/cfgexpand.c
@@ -2740,6 +2740,7 @@ expand_debug_expr (tree exp)
{
case COND_EXPR:
case DOT_PROD_EXPR:
+   case SAD_EXPR:
case WIDEN_MULT_PLUS_EXPR:
case WIDEN_MULT_MINUS_EXPR:
case FMA_EXPR:
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 9094a1c..af73817 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -7278,6 +7278,36 @@
   DONE;
 })
 
+(define_expand usadv16qi
+  [(match_operand:V4SI 0 register_operand)
+   (match_operand:V16QI 1 register_operand)
+   (match_operand:V16QI 2 nonimmediate_operand)
+   (match_operand:V4SI 3 nonimmediate_operand)]
+  TARGET_SSE2
+{
+  rtx t1 = gen_reg_rtx (V2DImode);
+  rtx t2

Re: [PATCH] Fixing PR59006 and PR58921 by delaying loop invariant hoisting in vectorizer.

2013-11-27 Thread Cong Hou
On Wed, Nov 27, 2013 at 1:53 AM, Richard Biener rguent...@suse.de wrote:
 On Fri, 22 Nov 2013, Cong Hou wrote:

 Hi

 Currently in GCC vectorization, some loop invariant may be detected
 after aliasing checks, which can be hoisted outside of the loop. The
 current method in GCC may break the information built during the
 analysis phase, causing some crash (see PR59006 and PR58921).

 This patch improves the loop invariant hoisting by delaying it until
 all statements are vectorized, thereby keeping all built information.
 But those loop invariant statements won't be vectorized, and if a
 variable is defined by one of those loop invariant, it is treated as
 an external definition.

 Bootstrapped and testes on an x86-64 machine.

 Hmm.  I'm still thinking that we should handle this during the regular
 transform step.

 Like with the following incomplete patch.  Missing is adjusting
 the rest of the vectorizable_* functions to handle the case where all defs
 are dt_external or constant by setting their own STMT_VINFO_DEF_TYPE to
 dt_external.  From the gcc.dg/vect/pr58508.c we get only 4 hoists
 instead of 8 because of this (I think).

 Also gcc.dg/vect/pr52298.c ICEs for yet unanalyzed reason.

 I can take over the bug if you like.

 Thanks,
 Richard.

 Index: gcc/tree-vect-data-refs.c
 ===
 *** gcc/tree-vect-data-refs.c   (revision 205435)
 --- gcc/tree-vect-data-refs.c   (working copy)
 *** again:
 *** 3668,3673 
 --- 3668,3682 
 }
   STMT_VINFO_STRIDE_LOAD_P (stmt_info) = true;
 }
 +   else if (loop_vinfo
 +   integer_zerop (DR_STEP (dr)))
 +   {
 + /* All loads from a non-varying address will be disambiguated
 +by data-ref analysis or via a runtime alias check and thus
 +they will become invariant.  Force them to be vectorized
 +as external.  */
 + STMT_VINFO_DEF_TYPE (stmt_info) = vect_external_def;
 +   }
   }

 /* If we stopped analysis at the first dataref we could not analyze


I agree that setting the statement that loads a data-ref with zero
step as vect_external_def early at this point is a good idea. This
avoids two loop analyses seeing inconsistent def-info if we do this
later. Note with this change the following loop in PR59006 will not be
vectorized:


int a[8], b;

void fn1(void) {
  int c;
  for (; b; b++) {
int d = a[b];
c = a[0] ? d : 0;
a[b] = c;
  }
}

This is because the load to a[0] is now treated as an external def, in
which case vectype cannot be found for the condition of the
conditional expression, while vectorizable_condition requires that
comp_vectype should be set properly. We can treat it as a missed
optimization.



 Index: gcc/tree-vect-loop-manip.c
 ===
 *** gcc/tree-vect-loop-manip.c  (revision 205435)
 --- gcc/tree-vect-loop-manip.c  (working copy)
 *** vect_loop_versioning (loop_vec_info loop
 *** 2269,2275 

 /* Extract load statements on memrefs with zero-stride accesses.  */

 !   if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
   {
 /* In the loop body, we iterate each statement to check if it is a 
 load.
  Then we check the DR_STEP of the data reference.  If DR_STEP is zero,
 --- 2269,2275 

 /* Extract load statements on memrefs with zero-stride accesses.  */

 !   if (0  LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
   {
 /* In the loop body, we iterate each statement to check if it is a 
 load.
  Then we check the DR_STEP of the data reference.  If DR_STEP is zero,
 Index: gcc/tree-vect-loop.c
 ===
 *** gcc/tree-vect-loop.c(revision 205435)
 --- gcc/tree-vect-loop.c(working copy)
 *** vect_transform_loop (loop_vec_info loop_
 *** 5995,6000 
 --- 5995,6020 
 }
 }

 + /* If the stmt is loop invariant simply move it.  */
 + if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_external_def)
 +   {
 + if (dump_enabled_p ())
 +   {
 + dump_printf_loc (MSG_NOTE, vect_location,
 +  hoisting out of the vectorized loop: );
 + dump_gimple_stmt (MSG_NOTE, TDF_SLIM, stmt, 0);
 + dump_printf (MSG_NOTE, \n);
 +   }
 + gsi_remove (si, false);
 + if (gimple_vuse (stmt))
 +   gimple_set_vuse (stmt, NULL);
 + basic_block new_bb;
 + new_bb = gsi_insert_on_edge_immediate (loop_preheader_edge 
 (loop),
 +stmt);
 + gcc_assert (!new_bb);
 + continue;
 +   }
 +
   /*  vectorize statement  */
   if (dump_enabled_p

Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.

2013-11-22 Thread Cong Hou
On Fri, Nov 22, 2013 at 1:32 AM, Uros Bizjak ubiz...@gmail.com wrote:
 Hello!

 In consequence, the ix86_expand_multi_arg_builtin() function tries to
 check two args but based on the define_expand of xop_vmfrczmode2,
 the content of insn_data[CODE_FOR_xop_vmfrczv4sf2].operand[2] may be
 incorrect (because it only needs one input).

  ;; scalar insns
 -(define_expand xop_vmfrczmode2
 +(define_expand xop_vmfrczmode3
[(set (match_operand:VF_128 0 register_operand)
 (vec_merge:VF_128
   (unspec:VF_128
[(match_operand:VF_128 1 nonimmediate_operand)]
UNSPEC_FRCZ)
 - (match_dup 3)
 + (match_operand:VF_128 2 register_operand)
   (const_int 1)))]
TARGET_XOP
  {
 -  operands[3] = CONST0_RTX (MODEmode);
 +  operands[2] = CONST0_RTX (MODEmode);
  })

 No, just use (match_dup 2) in the RTX in addition to operands[2]
 change. Do not rename patterns.


If I use match_dup 2, GCC still thinks this optab has one input
argument instead of two, which won't fix the current issue.

Marc suggested we should remove the second argument. This also works.

Thank you!


Cong



 Uros.


Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.

2013-11-22 Thread Cong Hou
On Fri, Nov 22, 2013 at 3:57 AM, Marc Glisse marc.gli...@inria.fr wrote:
 On Thu, 21 Nov 2013, Cong Hou wrote:

 On Thu, Nov 21, 2013 at 4:39 PM, Marc Glisse marc.gli...@inria.fr wrote:

 On Thu, 21 Nov 2013, Cong Hou wrote:

 While I added the new define_insn_and_split for vec_merge, a bug is
 exposed: in config/i386/sse.md, [ define_expand xop_vmfrczmode2 ]
 only takes one input, but the corresponding builtin functions have two
 inputs, which are shown in i386.c:

  { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv4sf2,
 __builtin_ia32_vfrczss, IX86_BUILTIN_VFRCZSS, UNKNOWN,
 (int)MULTI_ARG_2_SF },
  { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv2df2,
 __builtin_ia32_vfrczsd, IX86_BUILTIN_VFRCZSD, UNKNOWN,
 (int)MULTI_ARG_2_DF },

 In consequence, the ix86_expand_multi_arg_builtin() function tries to
 check two args but based on the define_expand of xop_vmfrczmode2,
 the content of insn_data[CODE_FOR_xop_vmfrczv4sf2].operand[2] may be
 incorrect (because it only needs one input).

 The patch below fixed this issue.

 Bootstrapped and tested on ax x86-64 machine. Note that this patch
 should be applied before the one I sent earlier (sorry for sending
 them in wrong order).



 This is PR 56788. Your patch seems strange to me and I don't think it
 fixes the real issue, but I'll let more knowledgeable people answer.



 Thank you for pointing out the bug report. This patch is not intended
 to fix PR56788.


 IMHO, if PR56788 was fixed, you wouldn't have this issue, and if PR56788
 doesn't get fixed, I'll post a patch to remove _mm_frcz_sd and the
 associated builtin, which would solve your issue as well.


I agree. Then I will wait until your patch is merged to the trunk,
otherwise my patch could not pass the test.




 For your function:

 #include x86intrin.h
 __m128d f(__m128d x, __m128d y){
  return _mm_frcz_sd(x,y);
 }

 Note that the second parameter is ignored intentionally, but the
 prototype of this function contains two parameters. My fix is
 explicitly telling GCC that the optab xop_vmfrczv4sf3 should have
 three operands instead of two, to let it have the correct information
 in insn_data[CODE_FOR_xop_vmfrczv4sf3].operand[2] which is used to
 match the type of the second parameter in the builtin function in
 ix86_expand_multi_arg_builtin().


 I disagree that this is intentional, it is a bug. AFAIK there is no AMD
 documentation that could be used as a reference for what _mm_frcz_sd is
 supposed to do. The only existing documentations are by Microsoft (which
 does *not* ignore the second argument) and by LLVM (which has a single
 argument). Whatever we chose for _mm_frcz_sd, the builtin should take a
 single argument, and if necessary we'll use 2 builtins to implement
 _mm_frcz_sd.



I also only found the one by Microsoft.. If the second argument is
ignored, we could just remove it, as long as there is no standard
that requires two arguments. Hopefully it won't break current projects
using _mm_frcz_sd.

Thank you for your comments!


Cong


 --
 Marc Glisse


[PATCH] Fixing PR59006 and PR58921 by delaying loop invariant hoisting in vectorizer.

2013-11-22 Thread Cong Hou
Hi

Currently in GCC vectorization, some loop invariant may be detected
after aliasing checks, which can be hoisted outside of the loop. The
current method in GCC may break the information built during the
analysis phase, causing some crash (see PR59006 and PR58921).

This patch improves the loop invariant hoisting by delaying it until
all statements are vectorized, thereby keeping all built information.
But those loop invariant statements won't be vectorized, and if a
variable is defined by one of those loop invariant, it is treated as
an external definition.

Bootstrapped and testes on an x86-64 machine.


thanks,
Cong



diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 2c0554b..0614bab 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,18 @@
+2013-11-22  Cong Hou  co...@google.com
+
+ PR tree-optimization/58921
+ PR tree-optimization/59006
+ * tree-vectorizer.h (struct _stmt_vec_info): New data member
+ loop_invariant.
+ * tree-vect-loop-manip.c (vect_loop_versioning): Delay hoisting loop
+ invariants until all statements are vectorized.
+ * tree-vect-loop.c (vect_hoist_loop_invariants): New functions.
+ (vect_transform_loop): Hoist loop invariants after all statements
+ are vectorized.  Do not vectorize loop invariants stmts.
+ * tree-vect-stmts.c (vect_get_vec_def_for_operand): Treat a loop
+ invariant as an external definition.
+ (new_stmt_vec_info): Initialize new data member.
+
 2013-11-12  Jeff Law  l...@redhat.com

  * tree-ssa-threadedge.c (thread_around_empty_blocks): New
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 09c7f20..447625b 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,10 @@
+2013-11-22  Cong Hou  co...@google.com
+
+ PR tree-optimization/58921
+ PR tree-optimization/59006
+ * gcc.dg/vect/pr58921.c: New test.
+ * gcc.dg/vect/pr59006.c: New test.
+
 2013-11-12  Balaji V. Iyer  balaji.v.i...@intel.com

  * gcc.dg/cilk-plus/cilk-plus.exp: Added a check for LTO before running
diff --git a/gcc/testsuite/gcc.dg/vect/pr58921.c
b/gcc/testsuite/gcc.dg/vect/pr58921.c
new file mode 100644
index 000..ee3694a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr58921.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+
+int a[7];
+int b;
+
+void
+fn1 ()
+{
+  for (; b; b++)
+a[b] = ((a[b] = 0) == (a[0] != 0));
+}
+
+/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect } } */
+/* { dg-final { cleanup-tree-dump vect } } */
diff --git a/gcc/testsuite/gcc.dg/vect/pr59006.c
b/gcc/testsuite/gcc.dg/vect/pr59006.c
new file mode 100644
index 000..95d90a9
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr59006.c
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+
+int a[8], b;
+
+void fn1 (void)
+{
+  int c;
+  for (; b; b++)
+{
+  int d = a[b];
+  c = a[0] ? d : 0;
+  a[b] = c;
+}
+}
+
+void fn2 ()
+{
+  for (; b = 0; b++)
+a[b] = a[0] || b;
+}
+
+/* { dg-final { scan-tree-dump-times vectorized 1 loops 2 vect } } */
+/* { dg-final { cleanup-tree-dump vect } } */
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 15227856..3adc73d 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -2448,8 +2448,12 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
   FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_USE)
 {
   gimple def = SSA_NAME_DEF_STMT (var);
+  stmt_vec_info def_stmt_info;
+
   if (!gimple_nop_p (def)
-   flow_bb_inside_loop_p (loop, gimple_bb (def)))
+   flow_bb_inside_loop_p (loop, gimple_bb (def))
+   !((def_stmt_info = vinfo_for_stmt (def))
+  STMT_VINFO_LOOP_INVARIANT_P (def_stmt_info)))
  {
   hoist = false;
   break;
@@ -2458,21 +2462,8 @@ vect_loop_versioning (loop_vec_info loop_vinfo,

   if (hoist)
 {
-  if (dr)
- gimple_set_vuse (stmt, NULL);
-
-  gsi_remove (si, false);
-  gsi_insert_on_edge_immediate (loop_preheader_edge (loop),
-stmt);
-
-  if (dump_enabled_p ())
- {
-  dump_printf_loc
-  (MSG_NOTE, vect_location,
-   hoisting out of the vectorized loop: );
-  dump_gimple_stmt (MSG_NOTE, TDF_SLIM, stmt, 0);
-  dump_printf (MSG_NOTE, \n);
- }
+  STMT_VINFO_LOOP_INVARIANT_P (stmt_info) = true;
+  gsi_next (si);
   continue;
 }
  }
@@ -2481,6 +2472,7 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
  }
 }

+
   /* End loop-exit-fixes after versioning.  */

   if (cond_expr_stmt_list)
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 292e771..148f9f1 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -5572,6 +5572,49 @@ vect_loop_kill_debug_uses (struct loop *loop,
gimple stmt)
 }
 }

+/* Find all loop invariants detected after alias checks, and hoist them
+   before the loop preheader.  */
+
+static void
+vect_hoist_loop_invariants (loop_vec_info loop_vinfo)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo

Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.

2013-11-21 Thread Cong Hou
On Thu, Nov 21, 2013 at 4:39 PM, Marc Glisse marc.gli...@inria.fr wrote:
 On Thu, 21 Nov 2013, Cong Hou wrote:

 While I added the new define_insn_and_split for vec_merge, a bug is
 exposed: in config/i386/sse.md, [ define_expand xop_vmfrczmode2 ]
 only takes one input, but the corresponding builtin functions have two
 inputs, which are shown in i386.c:

  { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv4sf2,
 __builtin_ia32_vfrczss, IX86_BUILTIN_VFRCZSS, UNKNOWN,
 (int)MULTI_ARG_2_SF },
  { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv2df2,
 __builtin_ia32_vfrczsd, IX86_BUILTIN_VFRCZSD, UNKNOWN,
 (int)MULTI_ARG_2_DF },

 In consequence, the ix86_expand_multi_arg_builtin() function tries to
 check two args but based on the define_expand of xop_vmfrczmode2,
 the content of insn_data[CODE_FOR_xop_vmfrczv4sf2].operand[2] may be
 incorrect (because it only needs one input).

 The patch below fixed this issue.

 Bootstrapped and tested on ax x86-64 machine. Note that this patch
 should be applied before the one I sent earlier (sorry for sending
 them in wrong order).


 This is PR 56788. Your patch seems strange to me and I don't think it
 fixes the real issue, but I'll let more knowledgeable people answer.


Thank you for pointing out the bug report. This patch is not intended
to fix PR56788. For your function:

#include x86intrin.h
__m128d f(__m128d x, __m128d y){
  return _mm_frcz_sd(x,y);
}

Note that the second parameter is ignored intentionally, but the
prototype of this function contains two parameters. My fix is
explicitly telling GCC that the optab xop_vmfrczv4sf3 should have
three operands instead of two, to let it have the correct information
in insn_data[CODE_FOR_xop_vmfrczv4sf3].operand[2] which is used to
match the type of the second parameter in the builtin function in
ix86_expand_multi_arg_builtin().


thanks,
Cong



 --
 Marc Glisse


Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-11-20 Thread Cong Hou
Ping...


thanks,
Cong


On Fri, Nov 15, 2013 at 9:52 AM, Cong Hou co...@google.com wrote:
 Any more comments?



 thanks,
 Cong


 On Wed, Nov 13, 2013 at 6:06 PM, Cong Hou co...@google.com wrote:
 Ping?


 thanks,
 Cong


 On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou co...@google.com wrote:
 Hi James

 Sorry for the late reply.


 On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
 james.greenha...@arm.com wrote:
 On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou co...@google.com wrote:
  Thank you for your detailed explanation.
 
  Once GCC detects a reduction operation, it will automatically
  accumulate all elements in the vector after the loop. In the loop the
  reduction variable is always a vector whose elements are reductions of
  corresponding values from other vectors. Therefore in your case the
  only instruction you need to generate is:
 
  VABAL   ops[3], ops[1], ops[2]
 
  It is OK if you accumulate the elements into one in the vector inside
  of the loop (if one instruction can do this), but you have to make
  sure other elements in the vector should remain zero so that the final
  result is correct.
 
  If you are confused about the documentation, check the one for
  udot_prod (just above usad in md.texi), as it has very similar
  behavior as usad. Actually I copied the text from there and did some
  changes. As those two instruction patterns are both for vectorization,
  their behavior should not be difficult to explain.
 
  If you have more questions or think that the documentation is still
  improper please let me know.

 Hi Cong,

 Thanks for your reply.

 I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
 DOT_PROD_EXPR and I see that the same ambiguity exists for
 DOT_PROD_EXPR. Can you please add a note in your tree.def
 that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:

   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
   tmp2 = ABS_EXPR (tmp)
   arg3 = PLUS_EXPR (tmp2, arg3)

 or:

   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
   tmp2 = ABS_EXPR (tmp)
   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)

 Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
 a value of the same (widened) type as arg3.



 I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
 mentioned it in tree.def).


 Also, while looking for the history of DOT_PROD_EXPR I spotted this
 patch:

   [autovect] [patch] detect mult-hi and sad patterns
   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html

 I wonder what the reason was for that patch to be dropped?


 It has been 8 years.. I have no idea why this patch is not accepted
 finally. There is even no reply in that thread. But I believe the SAD
 pattern is very important to be recognized. ARM also provides
 instructions for it.


 Thank you for your comment again!


 thanks,
 Cong



 Thanks,
 James



Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.

2013-11-19 Thread Cong Hou
On Tue, Nov 19, 2013 at 1:45 AM, Richard Biener rguent...@suse.de wrote:

 On Mon, 18 Nov 2013, Cong Hou wrote:

  I tried your method and it works well for doubles. But for float,
  there is an issue. For the following gimple code:
 
 c1 = a - b;
 c2 = a + b;
 c = VEC_PERM c1, c2, [0,5,2,7]
 
  It needs two instructions to implement the VEC_PERM operation in
  SSE2-4, one of which should be using shufps which is represented by
  the following pattern in rtl:
 
 
  (define_insn sse_shufps_mode
[(set (match_operand:VI4F_128 0 register_operand =x,x)
  (vec_select:VI4F_128
   (vec_concat:ssedoublevecmode
 (match_operand:VI4F_128 1 register_operand 0,x)
 (match_operand:VI4F_128 2 nonimmediate_operand xm,xm))
   (parallel [(match_operand 3 const_0_to_3_operand)
  (match_operand 4 const_0_to_3_operand)
  (match_operand 5 const_4_to_7_operand)
  (match_operand 6 const_4_to_7_operand)])))]
  ...)
 
  Note that it contains two rtl instructions.

 It's a single instruction as far as combine is concerned (RTL
 instructions have arbitrary complexity).


Even it is one instruction, we will end up with four rtl statements,
which still cannot be combined as there are restrictions on combining
four instructions (loads of constants or binary operations involving a
constant). Note that vec_select instead of vec_merge is used here
because currently vec_merge is emitted only if SSE4 is enabled (thus
blend instructions can be used. If you look at
ix86_expand_vec_perm_const_1() in i386.c, you can find that vec_merge
is generated in expand_vec_perm_1() with SSE4.). Without SSE4 support,
in most cases a vec_merge statement could not be translated by one SSE
instruction.



  Together with minus, plus,
  and one more shuffling instruction, we have at least five instructions
  for addsub pattern. I think during the combine pass, only four
  instructions are considered to be combined, right? So unless we
  compress those five instructions into four or less, we could not use
  this method for float values.

 At the moment addsubv4sf looks like

 (define_insn sse3_addsubv4sf3
   [(set (match_operand:V4SF 0 register_operand =x,x)
 (vec_merge:V4SF
   (plus:V4SF
 (match_operand:V4SF 1 register_operand 0,x)
 (match_operand:V4SF 2 nonimmediate_operand xm,xm))
   (minus:V4SF (match_dup 1) (match_dup 2))
   (const_int 10)))]

 to match this it's best to have the VEC_SHUFFLE retained as
 vec_merge and thus support arbitrary(?) vec_merge for the aid
 of combining until reload(?) after which we can split it.



You mean VEC_PERM (this is generated in gimple from your patch)? Note
as I mentioned above, without SSE4, it is difficult to translate
VEC_PERM into vec_merge. Even if we can do it, we still need do define
split to convert one vec_merge into two or more other statements
later. ADDSUB instructions are proved by SSE3 and I think we should
not rely on SSE4 to perform this transformation, right?

To sum up, if we use vec_select instead of vec_merge, we may have four
rtl statements for float types, in which case they cannot be combined.
If we use vec_merge, we need to define the split for it without SSE4
support, and we need also to change the behavior of
ix86_expand_vec_perm_const_1().


  What do you think?

 Besides of addsub are there other instructions that can be expressed
 similarly?  Thus, how far should the combiner pattern go?



I think your method is quite flexible. Beside blending add/sub, we
could blend other combinations of two operations, and even one
operation and a no-op. For example, consider vectorizing complex
conjugate operation:

for (int i = 0; i  N; i+=2) {
  a[i] = b[i];
  a[i+1] = -b[i+1];
}

This is loop is better to be vectorized by hybrid SLP. The second
statement has a unary minus operation but there is no operation in the
first one. We can improve out SLP grouping algorithm to let GCC SLP
vectorize it.


thanks,
Cong



 Richard.

 
 
 
  thanks,
  Cong
 
 
  On Fri, Nov 15, 2013 at 12:53 AM, Richard Biener rguent...@suse.de wrote:
   On Thu, 14 Nov 2013, Cong Hou wrote:
  
   Hi
  
   This patch adds the support to two non-isomorphic operations addsub
   and subadd for SLP vectorizer. More non-isomorphic operations can be
   added later, but the limitation is that operations on even/odd
   elements should still be isomorphic. Once such an operation is
   detected, the code of the operation used in vectorized code is stored
   and later will be used during statement transformation. Two new GIMPLE
   opeartions VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR are defined. And also
   new optabs for them. They are also documented.
  
   The target supports for SSE/SSE2/SSE3/AVX are added for those two new
   operations on floating points. SSE3/AVX provides ADDSUBPD and ADDSUBPS
   instructions. For SSE/SSE2, those two operations are emulated using
   two instructions (selectively negate then add).
  
   With this patch the following

Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.

2013-11-18 Thread Cong Hou
I tried your method and it works well for doubles. But for float,
there is an issue. For the following gimple code:

   c1 = a - b;
   c2 = a + b;
   c = VEC_PERM c1, c2, [0,5,2,7]

It needs two instructions to implement the VEC_PERM operation in
SSE2-4, one of which should be using shufps which is represented by
the following pattern in rtl:


(define_insn sse_shufps_mode
  [(set (match_operand:VI4F_128 0 register_operand =x,x)
(vec_select:VI4F_128
 (vec_concat:ssedoublevecmode
   (match_operand:VI4F_128 1 register_operand 0,x)
   (match_operand:VI4F_128 2 nonimmediate_operand xm,xm))
 (parallel [(match_operand 3 const_0_to_3_operand)
(match_operand 4 const_0_to_3_operand)
(match_operand 5 const_4_to_7_operand)
(match_operand 6 const_4_to_7_operand)])))]
...)

Note that it contains two rtl instructions. Together with minus, plus,
and one more shuffling instruction, we have at least five instructions
for addsub pattern. I think during the combine pass, only four
instructions are considered to be combined, right? So unless we
compress those five instructions into four or less, we could not use
this method for float values.

What do you think?




thanks,
Cong


On Fri, Nov 15, 2013 at 12:53 AM, Richard Biener rguent...@suse.de wrote:
 On Thu, 14 Nov 2013, Cong Hou wrote:

 Hi

 This patch adds the support to two non-isomorphic operations addsub
 and subadd for SLP vectorizer. More non-isomorphic operations can be
 added later, but the limitation is that operations on even/odd
 elements should still be isomorphic. Once such an operation is
 detected, the code of the operation used in vectorized code is stored
 and later will be used during statement transformation. Two new GIMPLE
 opeartions VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR are defined. And also
 new optabs for them. They are also documented.

 The target supports for SSE/SSE2/SSE3/AVX are added for those two new
 operations on floating points. SSE3/AVX provides ADDSUBPD and ADDSUBPS
 instructions. For SSE/SSE2, those two operations are emulated using
 two instructions (selectively negate then add).

 With this patch the following function will be SLP vectorized:


 float a[4], b[4], c[4];  // double also OK.

 void subadd ()
 {
   c[0] = a[0] - b[0];
   c[1] = a[1] + b[1];
   c[2] = a[2] - b[2];
   c[3] = a[3] + b[3];
 }

 void addsub ()
 {
   c[0] = a[0] + b[0];
   c[1] = a[1] - b[1];
   c[2] = a[2] + b[2];
   c[3] = a[3] - b[3];
 }


 Boostrapped and tested on an x86-64 machine.

 I managed to do this without adding new tree codes or optabs by
 vectorizing the above as

c1 = a + b;
c2 = a - b;
c = VEC_PERM c1, c2, the-proper-mask

 which then matches sse3_addsubv4sf3 if you fix that pattern to
 not use vec_merge (or fix PR56766).  Doing it this way also
 means that the code is vectorizable if you don't have a HW
 instruction for that but can do the VEC_PERM efficiently.

 So, I'd like to avoid new tree codes and optabs whenever possible
 and here I've already proved (with a patch) that it is possible.
 Didn't have time to clean it up, and it likely doesn't apply anymore
 (and PR56766 blocks it but it even has a patch).

 Btw, this was PR56902 where I attached my patch.

 Richard.


 thanks,
 Cong





 diff --git a/gcc/ChangeLog b/gcc/ChangeLog
 index 2c0554b..656d5fb 100644
 --- a/gcc/ChangeLog
 +++ b/gcc/ChangeLog
 @@ -1,3 +1,31 @@
 +2013-11-14  Cong Hou  co...@google.com
 +
 + * tree-vect-slp.c (vect_create_new_slp_node): Initialize
 + SLP_TREE_OP_CODE.
 + (slp_supported_non_isomorphic_op): New function.  Check if the
 + non-isomorphic operation is supported or not.
 + (vect_build_slp_tree_1): Consider non-isomorphic operations.
 + (vect_build_slp_tree): Change argument.
 + * tree-vect-stmts.c (vectorizable_operation): Consider the opcode
 + for non-isomorphic operations.
 + * optabs.def (vec_addsub_optab, vec_subadd_optab): New optabs.
 + * tree.def (VEC_ADDSUB_EXPR, VEC_SUBADD_EXPR): New operations.
 + * expr.c (expand_expr_real_2): Add support to VEC_ADDSUB_EXPR and
 + VEC_SUBADD_EXPR.
 + * gimple-pretty-print.c (dump_binary_rhs): Likewise.
 + * optabs.c (optab_for_tree_code): Likewise.
 + * tree-cfg.c (verify_gimple_assign_binary): Likewise.
 + * tree-vectorizer.h (struct _slp_tree): New data member.
 + * config/i386/i386-protos.h (ix86_sse_expand_fp_addsub_operator):
 + New funtion.  Expand addsub/subadd operations for SSE2.
 + * config/i386/i386.c (ix86_sse_expand_fp_addsub_operator): Likewise.
 + * config/i386/sse.md (UNSPEC_SUBADD, UNSPEC_ADDSUB): New RTL operation.
 + (vec_subadd_v4sf3, vec_subadd_v2df3, vec_subadd_mode3,
 + vec_addsub_v4sf3, vec_addsub_v2df3, vec_addsub_mode3):
 + Expand addsub/subadd operations for SSE/SSE2/SSE3/AVX.
 + * doc/generic.texi (VEC_ADDSUB_EXPR, VEC_SUBADD_EXPR): New doc.
 + * doc/md.texi (vec_addsub_@var{m}3, vec_subadd_@var{m}3): New doc.
 +
  2013-11-12  Jeff Law  l...@redhat.com

   * tree-ssa-threadedge.c (thread_around_empty_blocks): New
 diff --git a/gcc/config/i386/i386-protos.h

Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.

2013-11-18 Thread Cong Hou
On Fri, Nov 15, 2013 at 1:20 AM, Uros Bizjak ubiz...@gmail.com wrote:
 Hello!

 This patch adds the support to two non-isomorphic operations addsub
 and subadd for SLP vectorizer. More non-isomorphic operations can be
 added later, but the limitation is that operations on even/odd
 elements should still be isomorphic. Once such an operation is
 detected, the code of the operation used in vectorized code is stored
 and later will be used during statement transformation. Two new GIMPLE
 opeartions VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR are defined. And also
 new optabs for them. They are also documented.

 The target supports for SSE/SSE2/SSE3/AVX are added for those two new
 operations on floating points. SSE3/AVX provides ADDSUBPD and ADDSUBPS
 instructions. For SSE/SSE2, those two operations are emulated using
 two instructions (selectively negate then add).

;; SSE3
UNSPEC_LDDQU
 +  UNSPEC_SUBADD
 +  UNSPEC_ADDSUB

 No! Please avoid unspecs.


OK, got it.



 +(define_expand vec_subadd_v4sf3
 +  [(set (match_operand:V4SF 0 register_operand)
 + (unspec:V4SF
 +  [(match_operand:V4SF 1 register_operand)
 +   (match_operand:V4SF 2 nonimmediate_operand)] UNSPEC_SUBADD))]
 +  TARGET_SSE
 +{
 +  if (TARGET_SSE3)
 +emit_insn (gen_sse3_addsubv4sf3 (operands[0], operands[1], operands[2]));
 +  else
 +ix86_sse_expand_fp_addsub_operator (true, V4SFmode, operands);
 +  DONE;
 +})

 Make the expander pattern look like correspondig sse3 insn and:
 ...
 {
   if (!TARGET_SSE3)
 {
   ix86_sse_expand_fp_...();
   DONE;
 }
 }


You mean I should write two expanders for SSE and SSE3 respectively?

Thank you for your comment!



Cong



 Uros.


Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.

2013-11-18 Thread Cong Hou
On Fri, Nov 15, 2013 at 10:18 AM, Richard Earnshaw rearn...@arm.com wrote:
 On 15/11/13 02:06, Cong Hou wrote:
 Hi

 This patch adds the support to two non-isomorphic operations addsub
 and subadd for SLP vectorizer. More non-isomorphic operations can be
 added later, but the limitation is that operations on even/odd
 elements should still be isomorphic. Once such an operation is
 detected, the code of the operation used in vectorized code is stored
 and later will be used during statement transformation. Two new GIMPLE
 opeartions VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR are defined. And also
 new optabs for them. They are also documented.


 Not withstanding what Richi has already said on this subject, you
 certainly don't need both VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR.  The
 latter can always be formed by vec-negating the second operand and
 passing it to VEC_ADDSUB_EXPR.


Right. But I also considered targets without the support to addsub
instructions. Then we could still selectively negate odd/even elements
using masks then use PLUS_EXPR (at most 2 instructions). If I
implement VEC_ADDSUB_EXPR by negating the second operand then using
VEC_ADDSUB_EXPR, I end up with one more instruction.


thanks,
Cong



 R.




Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.

2013-11-18 Thread Cong Hou
On Mon, Nov 18, 2013 at 12:27 PM, Uros Bizjak ubiz...@gmail.com wrote:
 On Mon, Nov 18, 2013 at 9:15 PM, Cong Hou co...@google.com wrote:

 This patch adds the support to two non-isomorphic operations addsub
 and subadd for SLP vectorizer. More non-isomorphic operations can be
 added later, but the limitation is that operations on even/odd
 elements should still be isomorphic. Once such an operation is
 detected, the code of the operation used in vectorized code is stored
 and later will be used during statement transformation. Two new GIMPLE
 opeartions VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR are defined. And also
 new optabs for them. They are also documented.

 The target supports for SSE/SSE2/SSE3/AVX are added for those two new
 operations on floating points. SSE3/AVX provides ADDSUBPD and ADDSUBPS
 instructions. For SSE/SSE2, those two operations are emulated using
 two instructions (selectively negate then add).

 +(define_expand vec_subadd_v4sf3
 +  [(set (match_operand:V4SF 0 register_operand)
 + (unspec:V4SF
 +  [(match_operand:V4SF 1 register_operand)
 +   (match_operand:V4SF 2 nonimmediate_operand)] UNSPEC_SUBADD))]
 +  TARGET_SSE
 +{
 +  if (TARGET_SSE3)
 +emit_insn (gen_sse3_addsubv4sf3 (operands[0], operands[1], 
 operands[2]));
 +  else
 +ix86_sse_expand_fp_addsub_operator (true, V4SFmode, operands);
 +  DONE;
 +})

 Make the expander pattern look like correspondig sse3 insn and:
 ...
 {
   if (!TARGET_SSE3)
 {
   ix86_sse_expand_fp_...();
   DONE;
 }
 }


 You mean I should write two expanders for SSE and SSE3 respectively?

 No, please use the same approach as you did for absmode2 expander.
 For !TARGET_SSE3, call the helper function (ix86_sse_expand...),
 otherwise expand through pattern. Also, it looks to me that you should
 partially expand in the pattern before calling helper function, mainly
 to avoid a bunch of if (...) at the beginning of the helper
 function.



I know what you mean. Then I have to change the pattern being detected
for sse3_addsubv4sf3, so that it can handle ADDSUB_EXPR for SSE3.

Currently I am considering using Richard's method without creating new
tree nodes and optabs, based on pattern matching. I will handle SSE2
and SSE3 separately by define_expand and define_insn. The current
problem is that the pattern may contain more than four instructions
which cannot be processed by the combine pass.

I am considering how to reduce the number of instructions in the
pattern to four.

Thank you very much!


Cong



 Uros.


Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-11-15 Thread Cong Hou
Any more comments?



thanks,
Cong


On Wed, Nov 13, 2013 at 6:06 PM, Cong Hou co...@google.com wrote:
 Ping?


 thanks,
 Cong


 On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou co...@google.com wrote:
 Hi James

 Sorry for the late reply.


 On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
 james.greenha...@arm.com wrote:
 On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou co...@google.com wrote:
  Thank you for your detailed explanation.
 
  Once GCC detects a reduction operation, it will automatically
  accumulate all elements in the vector after the loop. In the loop the
  reduction variable is always a vector whose elements are reductions of
  corresponding values from other vectors. Therefore in your case the
  only instruction you need to generate is:
 
  VABAL   ops[3], ops[1], ops[2]
 
  It is OK if you accumulate the elements into one in the vector inside
  of the loop (if one instruction can do this), but you have to make
  sure other elements in the vector should remain zero so that the final
  result is correct.
 
  If you are confused about the documentation, check the one for
  udot_prod (just above usad in md.texi), as it has very similar
  behavior as usad. Actually I copied the text from there and did some
  changes. As those two instruction patterns are both for vectorization,
  their behavior should not be difficult to explain.
 
  If you have more questions or think that the documentation is still
  improper please let me know.

 Hi Cong,

 Thanks for your reply.

 I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
 DOT_PROD_EXPR and I see that the same ambiguity exists for
 DOT_PROD_EXPR. Can you please add a note in your tree.def
 that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:

   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
   tmp2 = ABS_EXPR (tmp)
   arg3 = PLUS_EXPR (tmp2, arg3)

 or:

   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
   tmp2 = ABS_EXPR (tmp)
   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)

 Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
 a value of the same (widened) type as arg3.



 I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
 mentioned it in tree.def).


 Also, while looking for the history of DOT_PROD_EXPR I spotted this
 patch:

   [autovect] [patch] detect mult-hi and sad patterns
   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html

 I wonder what the reason was for that patch to be dropped?


 It has been 8 years.. I have no idea why this patch is not accepted
 finally. There is even no reply in that thread. But I believe the SAD
 pattern is very important to be recognized. ARM also provides
 instructions for it.


 Thank you for your comment again!


 thanks,
 Cong



 Thanks,
 James



Re: [PATCH] Do not set flag_complex_method to 2 for C++ by default.

2013-11-14 Thread Cong Hou
See the following code:


#include complex
using std::complex;

templatetypename _Tp, typename _Up
complex_Tp
mult_assign (complex_Tp __y, const complex_Up __z)
{
  _Up _M_real = __y.real();
  _Up _M_imag = __y.imag();
  const _Tp __r = _M_real * __z.real() - _M_imag * __z.imag();
  _M_imag = _M_real * __z.imag() + _M_imag * __z.real();
  _M_real = __r;
  return __y;
}

void foo (complexfloat c1, complexfloat c2)
{ c1 *= c2; }

void bar (complexfloat c1, complexfloat c2)
{ mult_assign(c1, c2); }


The function mult_assign is written almost by copying the
implementation of operator *= from complex. They have exactly the
same behavior from the view of the source code. However, the compiled
results of foo() and bar() are different: foo() is using builtin
function for multiplication but bar() is not. Just because of a name
change the final behavior is changed? This should not be how a
compiler is working.


thanks,
Cong


On Thu, Nov 14, 2013 at 10:17 AM, Andrew Pinski pins...@gmail.com wrote:
 On Thu, Nov 14, 2013 at 8:25 AM, Xinliang David Li davi...@google.com wrote:
 Can we revisit the decision for this? Here are the reasons:

 1) It seems that the motivation to make C++ consistent with c99 is to
 avoid confusing users who build the C source with both C and C++
 compilers. Why should C++'s default behavior be tuned for this niche
 case?

 It is not a niche case.  It is confusing for people who write C++ code
 to rewrite their code to C99 and find that C is much slower because of
 correctness?  I think they have this backwards here.  C++ should be
 consistent with C here.

 2) It is very confusing for users who see huge performance difference
 between compiler generated code for Complex multiplication vs manually
 expanded code

 I don't see why this is an issue if they understand how complex
 multiplication works for correctness.  I am sorry but correctness over
 speed is a good argument of why this should stay this way.

 3) The default setting can also block potential vectorization
 opportunities for complex operations

 Yes so again this is about a correctness issue over a speed issue.

 4) GCC is about the only compiler which has this default -- very few
 user knows about GCC's strict default, and will think GCC performs
 poorly.


 Correctness over speed is better.  I am sorry GCC is the only one
 which gets it correct here.  If people don't like there is a flag to
 disable it.

 Thanks,
 Andrew Pinski


 thanks,

 David


 On Wed, Nov 13, 2013 at 9:07 PM, Andrew Pinski pins...@gmail.com wrote:
 On Wed, Nov 13, 2013 at 5:26 PM, Cong Hou co...@google.com wrote:
 This patch is for PR58963.

 In the patch http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00560.html,
 the builtin function is used to perform complex multiplication and
 division. This is to comply with C99 standard, but I am wondering if
 C++ also needs this.

 There is no complex keyword in C++, and no content in C++ standard
 about the behavior of operations on complex types. The complex
 header file is all written in source code, including complex
 multiplication and division. GCC should not do too much for them by
 using builtin calls by default (although we can set -fcx-limited-range
 to prevent GCC doing this), which has a big impact on performance
 (there may exist vectorization opportunities).

 In this patch flag_complex_method will not be set to 2 for C++.
 Bootstraped and tested on an x86-64 machine.

 I think you need to look into this issue deeper as the original patch
 only enabled it for C99:
 http://gcc.gnu.org/ml/gcc-patches/2005-02/msg01483.html .

 Just a little deeper will find
 http://gcc.gnu.org/ml/gcc/2007-07/msg00124.html which says yes C++
 needs this.

 Thanks,
 Andrew Pinski



 thanks,
 Cong


 Index: gcc/c-family/c-opts.c
 ===
 --- gcc/c-family/c-opts.c (revision 204712)
 +++ gcc/c-family/c-opts.c (working copy)
 @@ -198,8 +198,10 @@ c_common_init_options_struct (struct gcc
opts-x_warn_write_strings = c_dialect_cxx ();
opts-x_flag_warn_unused_result = true;

 -  /* By default, C99-like requirements for complex multiply and divide.  
 */
 -  opts-x_flag_complex_method = 2;
 +  /* By default, C99-like requirements for complex multiply and divide.
 + But for C++ this should not be required.  */
 +  if (c_language != clk_cxx  c_language != clk_objcxx)
 +opts-x_flag_complex_method = 2;
  }

  /* Common initialization before calling option handlers.  */
 Index: gcc/c-family/ChangeLog
 ===
 --- gcc/c-family/ChangeLog (revision 204712)
 +++ gcc/c-family/ChangeLog (working copy)
 @@ -1,3 +1,8 @@
 +2013-11-13  Cong Hou  co...@google.com
 +
 + * c-opts.c (c_common_init_options_struct): Don't let C++ comply with
 + C99-like requirements for complex multiply and divide.
 +
  2013-11-12  Joseph Myers  jos...@codesourcery.com

   * c-common.c (c_common_reswords): Add _Thread_local.


[PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.

2013-11-14 Thread Cong Hou
Hi

This patch adds the support to two non-isomorphic operations addsub
and subadd for SLP vectorizer. More non-isomorphic operations can be
added later, but the limitation is that operations on even/odd
elements should still be isomorphic. Once such an operation is
detected, the code of the operation used in vectorized code is stored
and later will be used during statement transformation. Two new GIMPLE
opeartions VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR are defined. And also
new optabs for them. They are also documented.

The target supports for SSE/SSE2/SSE3/AVX are added for those two new
operations on floating points. SSE3/AVX provides ADDSUBPD and ADDSUBPS
instructions. For SSE/SSE2, those two operations are emulated using
two instructions (selectively negate then add).

With this patch the following function will be SLP vectorized:


float a[4], b[4], c[4];  // double also OK.

void subadd ()
{
  c[0] = a[0] - b[0];
  c[1] = a[1] + b[1];
  c[2] = a[2] - b[2];
  c[3] = a[3] + b[3];
}

void addsub ()
{
  c[0] = a[0] + b[0];
  c[1] = a[1] - b[1];
  c[2] = a[2] + b[2];
  c[3] = a[3] - b[3];
}


Boostrapped and tested on an x86-64 machine.


thanks,
Cong





diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 2c0554b..656d5fb 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,31 @@
+2013-11-14  Cong Hou  co...@google.com
+
+ * tree-vect-slp.c (vect_create_new_slp_node): Initialize
+ SLP_TREE_OP_CODE.
+ (slp_supported_non_isomorphic_op): New function.  Check if the
+ non-isomorphic operation is supported or not.
+ (vect_build_slp_tree_1): Consider non-isomorphic operations.
+ (vect_build_slp_tree): Change argument.
+ * tree-vect-stmts.c (vectorizable_operation): Consider the opcode
+ for non-isomorphic operations.
+ * optabs.def (vec_addsub_optab, vec_subadd_optab): New optabs.
+ * tree.def (VEC_ADDSUB_EXPR, VEC_SUBADD_EXPR): New operations.
+ * expr.c (expand_expr_real_2): Add support to VEC_ADDSUB_EXPR and
+ VEC_SUBADD_EXPR.
+ * gimple-pretty-print.c (dump_binary_rhs): Likewise.
+ * optabs.c (optab_for_tree_code): Likewise.
+ * tree-cfg.c (verify_gimple_assign_binary): Likewise.
+ * tree-vectorizer.h (struct _slp_tree): New data member.
+ * config/i386/i386-protos.h (ix86_sse_expand_fp_addsub_operator):
+ New funtion.  Expand addsub/subadd operations for SSE2.
+ * config/i386/i386.c (ix86_sse_expand_fp_addsub_operator): Likewise.
+ * config/i386/sse.md (UNSPEC_SUBADD, UNSPEC_ADDSUB): New RTL operation.
+ (vec_subadd_v4sf3, vec_subadd_v2df3, vec_subadd_mode3,
+ vec_addsub_v4sf3, vec_addsub_v2df3, vec_addsub_mode3):
+ Expand addsub/subadd operations for SSE/SSE2/SSE3/AVX.
+ * doc/generic.texi (VEC_ADDSUB_EXPR, VEC_SUBADD_EXPR): New doc.
+ * doc/md.texi (vec_addsub_@var{m}3, vec_subadd_@var{m}3): New doc.
+
 2013-11-12  Jeff Law  l...@redhat.com

  * tree-ssa-threadedge.c (thread_around_empty_blocks): New
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index fdf9d58..b02b757 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -117,6 +117,7 @@ extern rtx ix86_expand_adjust_ufix_to_sfix_si (rtx, rtx *);
 extern enum ix86_fpcmp_strategy ix86_fp_comparison_strategy (enum rtx_code);
 extern void ix86_expand_fp_absneg_operator (enum rtx_code, enum machine_mode,
 rtx[]);
+extern void ix86_sse_expand_fp_addsub_operator (bool, enum
machine_mode, rtx[]);
 extern void ix86_expand_copysign (rtx []);
 extern void ix86_split_copysign_const (rtx []);
 extern void ix86_split_copysign_var (rtx []);
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 5287b49..76f38f5 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -18702,6 +18702,51 @@ ix86_expand_fp_absneg_operator (enum rtx_code
code, enum machine_mode mode,
 emit_insn (set);
 }

+/* Generate code for addsub or subadd on fp vectors for sse/sse2.  The flag
+   SUBADD indicates if we are generating code for subadd or addsub.  */
+
+void
+ix86_sse_expand_fp_addsub_operator (bool subadd, enum machine_mode mode,
+rtx operands[])
+{
+  rtx mask;
+  rtx neg_mask32 = GEN_INT (0x8000);
+  rtx neg_mask64 = GEN_INT ((HOST_WIDE_INT)1  63);
+
+  switch (mode)
+{
+case V4SFmode:
+  if (subadd)
+ mask = gen_rtx_CONST_VECTOR (V4SImode, gen_rtvec (4,
+ neg_mask32, const0_rtx, neg_mask32, const0_rtx));
+  else
+ mask = gen_rtx_CONST_VECTOR (V4SImode, gen_rtvec (4,
+ const0_rtx, neg_mask32, const0_rtx, neg_mask32));
+  break;
+
+case V2DFmode:
+  if (subadd)
+ mask = gen_rtx_CONST_VECTOR (V2DImode, gen_rtvec (2,
+ neg_mask64, const0_rtx));
+  else
+ mask = gen_rtx_CONST_VECTOR (V2DImode, gen_rtvec (2,
+ const0_rtx, neg_mask64));
+  break;
+
+default:
+  gcc_unreachable ();
+}
+
+  rtx tmp = gen_reg_rtx (mode);
+  convert_move (tmp, mask, false);
+
+  rtx tmp2 = gen_reg_rtx (mode);
+  tmp2 = expand_simple_binop (mode, XOR, tmp, operands[2],
+  tmp2, 0, OPTAB_DIRECT);
+  expand_simple_binop (mode, PLUS, operands[1], tmp2

[PATCH] Do not set flag_complex_method to 2 for C++ by default.

2013-11-13 Thread Cong Hou
This patch is for PR58963.

In the patch http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00560.html,
the builtin function is used to perform complex multiplication and
division. This is to comply with C99 standard, but I am wondering if
C++ also needs this.

There is no complex keyword in C++, and no content in C++ standard
about the behavior of operations on complex types. The complex
header file is all written in source code, including complex
multiplication and division. GCC should not do too much for them by
using builtin calls by default (although we can set -fcx-limited-range
to prevent GCC doing this), which has a big impact on performance
(there may exist vectorization opportunities).

In this patch flag_complex_method will not be set to 2 for C++.
Bootstraped and tested on an x86-64 machine.


thanks,
Cong


Index: gcc/c-family/c-opts.c
===
--- gcc/c-family/c-opts.c (revision 204712)
+++ gcc/c-family/c-opts.c (working copy)
@@ -198,8 +198,10 @@ c_common_init_options_struct (struct gcc
   opts-x_warn_write_strings = c_dialect_cxx ();
   opts-x_flag_warn_unused_result = true;

-  /* By default, C99-like requirements for complex multiply and divide.  */
-  opts-x_flag_complex_method = 2;
+  /* By default, C99-like requirements for complex multiply and divide.
+ But for C++ this should not be required.  */
+  if (c_language != clk_cxx  c_language != clk_objcxx)
+opts-x_flag_complex_method = 2;
 }

 /* Common initialization before calling option handlers.  */
Index: gcc/c-family/ChangeLog
===
--- gcc/c-family/ChangeLog (revision 204712)
+++ gcc/c-family/ChangeLog (working copy)
@@ -1,3 +1,8 @@
+2013-11-13  Cong Hou  co...@google.com
+
+ * c-opts.c (c_common_init_options_struct): Don't let C++ comply with
+ C99-like requirements for complex multiply and divide.
+
 2013-11-12  Joseph Myers  jos...@codesourcery.com

  * c-common.c (c_common_reswords): Add _Thread_local.


Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-11-13 Thread Cong Hou
Ping?


thanks,
Cong


On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou co...@google.com wrote:
 Hi James

 Sorry for the late reply.


 On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
 james.greenha...@arm.com wrote:
 On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou co...@google.com wrote:
  Thank you for your detailed explanation.
 
  Once GCC detects a reduction operation, it will automatically
  accumulate all elements in the vector after the loop. In the loop the
  reduction variable is always a vector whose elements are reductions of
  corresponding values from other vectors. Therefore in your case the
  only instruction you need to generate is:
 
  VABAL   ops[3], ops[1], ops[2]
 
  It is OK if you accumulate the elements into one in the vector inside
  of the loop (if one instruction can do this), but you have to make
  sure other elements in the vector should remain zero so that the final
  result is correct.
 
  If you are confused about the documentation, check the one for
  udot_prod (just above usad in md.texi), as it has very similar
  behavior as usad. Actually I copied the text from there and did some
  changes. As those two instruction patterns are both for vectorization,
  their behavior should not be difficult to explain.
 
  If you have more questions or think that the documentation is still
  improper please let me know.

 Hi Cong,

 Thanks for your reply.

 I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
 DOT_PROD_EXPR and I see that the same ambiguity exists for
 DOT_PROD_EXPR. Can you please add a note in your tree.def
 that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:

   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
   tmp2 = ABS_EXPR (tmp)
   arg3 = PLUS_EXPR (tmp2, arg3)

 or:

   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
   tmp2 = ABS_EXPR (tmp)
   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)

 Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
 a value of the same (widened) type as arg3.



 I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
 mentioned it in tree.def).


 Also, while looking for the history of DOT_PROD_EXPR I spotted this
 patch:

   [autovect] [patch] detect mult-hi and sad patterns
   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html

 I wonder what the reason was for that patch to be dropped?


 It has been 8 years.. I have no idea why this patch is not accepted
 finally. There is even no reply in that thread. But I believe the SAD
 pattern is very important to be recognized. ARM also provides
 instructions for it.


 Thank you for your comment again!


 thanks,
 Cong



 Thanks,
 James



Re: [PATCH] Small fix: add { dg-require-effective-target vect_int } to testsuite/gcc.dg/vect/pr58508.c

2013-11-12 Thread Cong Hou
Hi Jakub

Thank you for pointing it out. The updated patch is pasted below. I
will pay attention to it in the future.


thanks,
Cong




diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 3d9916d..32a6ff7 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,7 @@
+2013-11-12  Cong Hou  co...@google.com
+
+   * gcc.dg/vect/pr58508.c: Remove dg-options as vect_int is indicated.
+
 2013-10-29  Cong Hou  co...@google.com

* gcc.dg/vect/pr58508.c: Update.
diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c
b/gcc/testsuite/gcc.dg/vect/pr58508.c
index fff7a04..c4921bb 100644
--- a/gcc/testsuite/gcc.dg/vect/pr58508.c
+++ b/gcc/testsuite/gcc.dg/vect/pr58508.c
@@ -1,6 +1,5 @@
 /* { dg-require-effective-target vect_int } */
 /* { dg-do compile } */
-/* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */


 /* The GCC vectorizer generates loop versioning for the following loop





On Tue, Nov 12, 2013 at 6:05 AM, Jakub Jelinek ja...@redhat.com wrote:
 On Thu, Nov 07, 2013 at 06:24:55PM -0800, Cong Hou wrote:
 Ping. OK for the trunk?
 On Fri, Nov 1, 2013 at 10:47 AM, Cong Hou co...@google.com wrote:
  It seems that on some platforms the loops in
  testsuite/gcc.dg/vect/pr58508.c may be unable to be vectorized. This
  small patch added { dg-require-effective-target vect_int } to make
  sure all loops can be vectorized.
  diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
  index 9d0f4a5..3d9916d 100644
  --- a/gcc/testsuite/ChangeLog
  +++ b/gcc/testsuite/ChangeLog
  @@ -1,3 +1,7 @@
  +2013-10-29  Cong Hou  co...@google.com
  +
  +   * gcc.dg/vect/pr58508.c: Update.
  +
   2013-10-15  Cong Hou  co...@google.com
 
  * gcc.dg/vect/pr58508.c: New test.
  diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c
  b/gcc/testsuite/gcc.dg/vect/pr58508.c
  index 6484a65..fff7a04 100644
  --- a/gcc/testsuite/gcc.dg/vect/pr58508.c
  +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c
  @@ -1,3 +1,4 @@
  +/* { dg-require-effective-target vect_int } */
   /* { dg-do compile } */
   /* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */

 This isn't the only bug in the testcase.  Another one is using
 dg-options in gcc.dg/vect/, you should just leave that out,
 the default options already include those options, but explicit dg-options
 mean that other required options like -msse2 on i?86 aren't added.

 Jakub


[PATCH] [Vectorization] Fixing a bug in alias checks merger.

2013-11-12 Thread Cong Hou
The current alias check merger does not consider the DR_STEP of
data-refs when sorting data-refs. For the following loop:

for (i = 0; i  N; ++i)
  a[i] = b[0] + b[i] + b[1];

The data ref b[0] and b[i] have the same DR_INIT and DR_OFFSET, and
after sorting three DR pairs, the following order can be a possible
result:

 (a[i], b[0]), (a[i], b[i]), (a[i], b[1])

This prevents the alias checks for (a[i], b[0]) and  (a[i], b[1]) being merged.

This patch added the comparison between DR_STEP of two data refs
during the sort.

The test case is also updated. The previous one used explicit
dg-options which blocks the options from the target vect_int. The test
case also assumes a vector can hold at least 4 integers of int type,
which may not be true on some targets.

The patch is pasted below. Bootstrapped and tested on a x86-64 machine.



thanks,
Cong



diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 2c0554b..5faa5ca 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,14 @@
+2013-11-12  Cong Hou  co...@google.com
+
+ * tree-vectorizer.h (struct dr_with_seg_len): Remove the base
+ address field as it can be obtained from dr.  Rename the struct.
+ * tree-vect-data-refs.c (comp_dr_with_seg_len_pair): Consider
+ steps of data references during sort.
+ (vect_prune_runtime_alias_test_list): Adjust with the change to
+ struct dr_with_seg_len.
+ * tree-vect-loop-manip.c (vect_create_cond_for_alias_checks):
+ Adjust with the change to struct dr_with_seg_len.
+
 2013-11-12  Jeff Law  l...@redhat.com

  * tree-ssa-threadedge.c (thread_around_empty_blocks): New
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 09c7f20..8075409 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,7 @@
+2013-11-12  Cong Hou  co...@google.com
+
+ * gcc.dg/vect/vect-alias-check.c: Update.
+
 2013-11-12  Balaji V. Iyer  balaji.v.i...@intel.com

  * gcc.dg/cilk-plus/cilk-plus.exp: Added a check for LTO before running
diff --git a/gcc/testsuite/gcc.dg/vect/vect-alias-check.c
b/gcc/testsuite/gcc.dg/vect/vect-alias-check.c
index 64a4e0c..c1bffed 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-alias-check.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-alias-check.c
@@ -1,17 +1,17 @@
 /* { dg-require-effective-target vect_int } */
 /* { dg-do compile } */
-/* { dg-options -O2 -ftree-vectorize
--param=vect-max-version-for-alias-checks=2 -fdump-tree-vect-details
} */
+/* { dg-additional-options --param=vect-max-version-for-alias-checks=2 } */

-/* A test case showing three potential alias checks between
-   a[i] and b[i], b[i+7], b[i+14]. With alias checks merging
-   enabled, those tree checks can be merged into one, and the
-   loop will be vectorized with vect-max-version-for-alias-checks=2.  */
+/* A test case showing four potential alias checks between a[i] and b[0], b[1],
+   b[i+1] and b[i+2].  With alias check merging enabled, those four checks
+   can be merged into two, and the loop will be vectorized with
+   vect-max-version-for-alias-checks=2.  */

 void foo (int *a, int *b)
 {
   int i;
   for (i = 0; i  1000; ++i)
-a[i] = b[i] + b[i+7] + b[i+14];
+a[i] = b[0] + b[1] + b[i+1] + b[i+2];
 }

 /* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect } } */
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index c479775..7f0920d 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -2620,7 +2620,7 @@ vect_analyze_data_ref_accesses (loop_vec_info
loop_vinfo, bb_vec_info bb_vinfo)
 }


-/* Operator == between two dr_addr_with_seg_len objects.
+/* Operator == between two dr_with_seg_len objects.

This equality operator is used to make sure two data refs
are the same one so that we will consider to combine the
@@ -2628,62 +2628,51 @@ vect_analyze_data_ref_accesses (loop_vec_info
loop_vinfo, bb_vec_info bb_vinfo)
refs.  */

 static bool
-operator == (const dr_addr_with_seg_len d1,
- const dr_addr_with_seg_len d2)
+operator == (const dr_with_seg_len d1,
+ const dr_with_seg_len d2)
 {
-  return operand_equal_p (d1.basic_addr, d2.basic_addr, 0)
-  compare_tree (d1.offset, d2.offset) == 0
-  compare_tree (d1.seg_len, d2.seg_len) == 0;
+  return operand_equal_p (DR_BASE_ADDRESS (d1.dr),
+  DR_BASE_ADDRESS (d2.dr), 0)
+compare_tree (d1.offset, d2.offset) == 0
+compare_tree (d1.seg_len, d2.seg_len) == 0;
 }

-/* Function comp_dr_addr_with_seg_len_pair.
+/* Function comp_dr_with_seg_len_pair.

-   Comparison function for sorting objects of dr_addr_with_seg_len_pair_t
+   Comparison function for sorting objects of dr_with_seg_len_pair_t
so that we can combine aliasing checks in one scan.  */

 static int
-comp_dr_addr_with_seg_len_pair (const void *p1_, const void *p2_)
+comp_dr_with_seg_len_pair (const void *p1_, const void *p2_)
 {
-  const dr_addr_with_seg_len_pair_t* p1 =
-(const dr_addr_with_seg_len_pair_t *) p1_;
-  const dr_addr_with_seg_len_pair_t* p2 =
-(const dr_addr_with_seg_len_pair_t *) p2_;
-
-  const

Re: [PATCH] Bug fix for PR59050

2013-11-11 Thread Cong Hou
Hi Jeff

I have committed the fix. Please update your repo.

Thank you!


Cong



On Mon, Nov 11, 2013 at 10:32 AM, Jeff Law l...@redhat.com wrote:
 On 11/11/13 02:32, Richard Biener wrote:

 On Fri, 8 Nov 2013, Cong Hou wrote:

 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59050

 This is my bad. I forget to check the test result for gfortran. With
 this patch the bug should be fixed (tested on x86-64).


 Ok.

 Btw, requirements are to bootstrap and test with all default
 languages enabled (that is, without any --enable-languages or
 --enable-languages=all).  That
 gets you c,c++,objc,java,fortran,lto and misses obj-c++ ada and go.
 I am personally using --enable-languages=all,ada,obj-c++.

 FWIW, I bootstrapped with Cong's patch to keep my own test results clean.
 So it's already been through those tests.

 If Cong doesn't get to it soon, I'll check it in myself.

 jeff



Re: [PATCH] Bug fix for PR59050

2013-11-11 Thread Cong Hou
Thank you for your advice! I will follow this instruction in future.


thanks,
Cong


On Mon, Nov 11, 2013 at 1:32 AM, Richard Biener rguent...@suse.de wrote:
 On Fri, 8 Nov 2013, Cong Hou wrote:

 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59050

 This is my bad. I forget to check the test result for gfortran. With
 this patch the bug should be fixed (tested on x86-64).

 Ok.

 Btw, requirements are to bootstrap and test with all default
 languages enabled (that is, without any --enable-languages or
 --enable-languages=all).  That
 gets you c,c++,objc,java,fortran,lto and misses obj-c++ ada and go.
 I am personally using --enable-languages=all,ada,obj-c++.

 Thanks,
 Richard.

 thanks,
 Cong


 diff --git a/gcc/ChangeLog b/gcc/ChangeLog
 index 90b01f2..e62c672 100644
 --- a/gcc/ChangeLog
 +++ b/gcc/ChangeLog
 @@ -1,3 +1,8 @@
 +2013-11-08  Cong Hou  co...@google.com
 +
 +   PR tree-optimization/59050
 +   * tree-vect-data-refs.c (comp_dr_addr_with_seg_len_pair): Bug fix.
 +
  2013-11-07  Cong Hou  co...@google.com

 * tree-vect-loop-manip.c (vect_create_cond_for_alias_checks):
 diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
 index b2a31b1..b7eb926 100644
 --- a/gcc/tree-vect-data-refs.c
 +++ b/gcc/tree-vect-data-refs.c
 @@ -2669,9 +2669,9 @@ comp_dr_addr_with_seg_len_pair (const void *p1_,
 const void *p2_)
if (comp_res != 0)
 return comp_res;
  }
 -  if (tree_int_cst_compare (p11.offset, p21.offset)  0)
 +  else if (tree_int_cst_compare (p11.offset, p21.offset)  0)
  return -1;
 -  if (tree_int_cst_compare (p11.offset, p21.offset)  0)
 +  else if (tree_int_cst_compare (p11.offset, p21.offset)  0)
  return 1;
if (TREE_CODE (p12.offset) != INTEGER_CST
|| TREE_CODE (p22.offset) != INTEGER_CST)
 @@ -2680,9 +2680,9 @@ comp_dr_addr_with_seg_len_pair (const void *p1_,
 const void *p2_)
if (comp_res != 0)
 return comp_res;
  }
 -  if (tree_int_cst_compare (p12.offset, p22.offset)  0)
 +  else if (tree_int_cst_compare (p12.offset, p22.offset)  0)
  return -1;
 -  if (tree_int_cst_compare (p12.offset, p22.offset)  0)
 +  else if (tree_int_cst_compare (p12.offset, p22.offset)  0)
  return 1;

return 0;



 --
 Richard Biener rguent...@suse.de
 SUSE / SUSE Labs
 SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
 GF: Jeff Hawn, Jennifer Guild, Felix Imend


Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-11-11 Thread Cong Hou
Hi James

Sorry for the late reply.


On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
james.greenha...@arm.com wrote:
 On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou co...@google.com wrote:
  Thank you for your detailed explanation.
 
  Once GCC detects a reduction operation, it will automatically
  accumulate all elements in the vector after the loop. In the loop the
  reduction variable is always a vector whose elements are reductions of
  corresponding values from other vectors. Therefore in your case the
  only instruction you need to generate is:
 
  VABAL   ops[3], ops[1], ops[2]
 
  It is OK if you accumulate the elements into one in the vector inside
  of the loop (if one instruction can do this), but you have to make
  sure other elements in the vector should remain zero so that the final
  result is correct.
 
  If you are confused about the documentation, check the one for
  udot_prod (just above usad in md.texi), as it has very similar
  behavior as usad. Actually I copied the text from there and did some
  changes. As those two instruction patterns are both for vectorization,
  their behavior should not be difficult to explain.
 
  If you have more questions or think that the documentation is still
  improper please let me know.

 Hi Cong,

 Thanks for your reply.

 I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
 DOT_PROD_EXPR and I see that the same ambiguity exists for
 DOT_PROD_EXPR. Can you please add a note in your tree.def
 that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:

   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
   tmp2 = ABS_EXPR (tmp)
   arg3 = PLUS_EXPR (tmp2, arg3)

 or:

   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
   tmp2 = ABS_EXPR (tmp)
   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)

 Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
 a value of the same (widened) type as arg3.



I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
mentioned it in tree.def).


 Also, while looking for the history of DOT_PROD_EXPR I spotted this
 patch:

   [autovect] [patch] detect mult-hi and sad patterns
   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html

 I wonder what the reason was for that patch to be dropped?


It has been 8 years.. I have no idea why this patch is not accepted
finally. There is even no reply in that thread. But I believe the SAD
pattern is very important to be recognized. ARM also provides
instructions for it.


Thank you for your comment again!


thanks,
Cong



 Thanks,
 James

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 6bdaa31..37ff6c4 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,4 +1,24 @@
-2013-11-01  Trevor Saunders  tsaund...@mozilla.com
+2013-10-29  Cong Hou  co...@google.com
+
+   * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD
+   pattern recognition.
+   (type_conversion_p): PROMOTION is true if it's a type promotion
+   conversion, and false otherwise.  Return true if the given expression
+   is a type conversion one.
+   * tree-vectorizer.h: Adjust the number of patterns.
+   * tree.def: Add SAD_EXPR.
+   * optabs.def: Add sad_optab.
+   * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case.
+   * expr.c (expand_expr_real_2): Likewise.
+   * gimple-pretty-print.c (dump_ternary_rhs): Likewise.
+   * gimple.c (get_gimple_rhs_num_ops): Likewise.
+   * optabs.c (optab_for_tree_code): Likewise.
+   * tree-cfg.c (estimate_operator_cost): Likewise.
+   * tree-ssa-operands.c (get_expr_operands): Likewise.
+   * tree-vect-loop.c (get_initial_def_for_reduction): Likewise.
+   * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD.
+   * doc/generic.texi: Add document for SAD_EXPR.
+   * doc/md.texi: Add document for ssad and usad.
 
* function.c (reorder_blocks): Convert block_stack to a stack_vec.
* gimplify.c (gimplify_compound_lval): Likewise.
diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
index fb05ce7..1f824fb 100644
--- a/gcc/cfgexpand.c
+++ b/gcc/cfgexpand.c
@@ -2740,6 +2740,7 @@ expand_debug_expr (tree exp)
{
case COND_EXPR:
case DOT_PROD_EXPR:
+   case SAD_EXPR:
case WIDEN_MULT_PLUS_EXPR:
case WIDEN_MULT_MINUS_EXPR:
case FMA_EXPR:
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 9094a1c..af73817 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -7278,6 +7278,36 @@
   DONE;
 })
 
+(define_expand usadv16qi
+  [(match_operand:V4SI 0 register_operand)
+   (match_operand:V16QI 1 register_operand)
+   (match_operand:V16QI 2 nonimmediate_operand)
+   (match_operand:V4SI 3 nonimmediate_operand)]
+  TARGET_SSE2
+{
+  rtx t1 = gen_reg_rtx (V2DImode);
+  rtx t2 = gen_reg_rtx (V4SImode);
+  emit_insn (gen_sse2_psadbw (t1, operands[1], operands[2]));
+  convert_move (t2, t1, 0);
+  emit_insn (gen_addv4si3 (operands[0], t2, operands[3]));
+  DONE;
+})
+
+(define_expand usadv32qi
+  [(match_operand:V8SI 0

[PATCH] Bug fix for PR59050

2013-11-08 Thread Cong Hou
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59050

This is my bad. I forget to check the test result for gfortran. With
this patch the bug should be fixed (tested on x86-64).


thanks,
Cong


diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 90b01f2..e62c672 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,8 @@
+2013-11-08  Cong Hou  co...@google.com
+
+   PR tree-optimization/59050
+   * tree-vect-data-refs.c (comp_dr_addr_with_seg_len_pair): Bug fix.
+
 2013-11-07  Cong Hou  co...@google.com

* tree-vect-loop-manip.c (vect_create_cond_for_alias_checks):
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index b2a31b1..b7eb926 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -2669,9 +2669,9 @@ comp_dr_addr_with_seg_len_pair (const void *p1_,
const void *p2_)
   if (comp_res != 0)
return comp_res;
 }
-  if (tree_int_cst_compare (p11.offset, p21.offset)  0)
+  else if (tree_int_cst_compare (p11.offset, p21.offset)  0)
 return -1;
-  if (tree_int_cst_compare (p11.offset, p21.offset)  0)
+  else if (tree_int_cst_compare (p11.offset, p21.offset)  0)
 return 1;
   if (TREE_CODE (p12.offset) != INTEGER_CST
   || TREE_CODE (p22.offset) != INTEGER_CST)
@@ -2680,9 +2680,9 @@ comp_dr_addr_with_seg_len_pair (const void *p1_,
const void *p2_)
   if (comp_res != 0)
return comp_res;
 }
-  if (tree_int_cst_compare (p12.offset, p22.offset)  0)
+  else if (tree_int_cst_compare (p12.offset, p22.offset)  0)
 return -1;
-  if (tree_int_cst_compare (p12.offset, p22.offset)  0)
+  else if (tree_int_cst_compare (p12.offset, p22.offset)  0)
 return 1;

   return 0;


Re: [PATCH] Reducing number of alias checks in vectorization.

2013-11-08 Thread Cong Hou
Thank you for the report. I have submitted a bug fix patch waiting to
be reviewed.



thanks,
Cong


On Fri, Nov 8, 2013 at 5:26 AM, Dominique Dhumieres domi...@lps.ens.fr wrote:
 According to http://gcc.gnu.org/ml/gcc-regression/2013-11/msg00197.html
 revision 204538 is breaking several tests. On x86_64-apple-darwin* the
 failures I have looked at are of the kind

 /opt/gcc/work/gcc/testsuite/gfortran.dg/typebound_operator_9.f03: In function 
 'nabla2_cart2d':
 /opt/gcc/work/gcc/testsuite/gfortran.dg/typebound_operator_9.f03:272:0: 
 internal compiler error: tree check: expected integer_cst, have plus_expr in 
 tree_int_cst_lt, at tree.c:7083
function nabla2_cart2d (obj)

 TIA

 Dominique


Re: [PATCH] Bug fix for PR59050

2013-11-08 Thread Cong Hou
Yes, I think so. The bug is that the arguments of
tree_int_cst_compare() may not be constant integers. This patch should
take care of it.



thanks,
Cong


On Fri, Nov 8, 2013 at 12:06 PM, H.J. Lu hjl.to...@gmail.com wrote:
 On Fri, Nov 8, 2013 at 10:34 AM, Cong Hou co...@google.com wrote:
 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59050

 This is my bad. I forget to check the test result for gfortran. With
 this patch the bug should be fixed (tested on x86-64).


 thanks,
 Cong


 diff --git a/gcc/ChangeLog b/gcc/ChangeLog
 index 90b01f2..e62c672 100644
 --- a/gcc/ChangeLog
 +++ b/gcc/ChangeLog
 @@ -1,3 +1,8 @@
 +2013-11-08  Cong Hou  co...@google.com
 +
 +   PR tree-optimization/59050
 +   * tree-vect-data-refs.c (comp_dr_addr_with_seg_len_pair): Bug fix.
 +

 Many SPEC CPU 2000 tests failed with

 costab.c: In function 'HandleCoinc2':
 costab.c:1565:17: internal compiler error: tree check: expected
 integer_cst, have plus_expr in tree_int_cst_lt, at tree.c:7083
  voidHandleCoinc2 ( cos1, cos2, hdfactor )
  ^
 0xb6e084 tree_check_failed(tree_node const*, char const*, int, char const*, 
 ...)
 ../../src-trunk/gcc/tree.c:9477
 0xb6ffe4 tree_check
 ../../src-trunk/gcc/tree.h:2914
 0xb6ffe4 tree_int_cst_lt(tree_node const*, tree_node const*)
 ../../src-trunk/gcc/tree.c:7083
 0xb70020 tree_int_cst_compare(tree_node const*, tree_node const*)
 ../../src-trunk/gcc/tree.c:7093
 0xe53f1c comp_dr_addr_with_seg_len_pair
 ../../src-trunk/gcc/tree-vect-data-refs.c:2672
 0xe5cbb5 vecdr_addr_with_seg_len_pair_t, va_heap,
 vl_embed::qsort(int (*)(void const*, void const*))
 ../../src-trunk/gcc/vec.h:941
 0xe5cbb5 vecdr_addr_with_seg_len_pair_t, va_heap, vl_ptr::qsort(int
 (*)(void const*, void const*))
 ../../src-trunk/gcc/vec.h:1620
 0xe5cbb5 vect_prune_runtime_alias_test_list(_loop_vec_info*)
 ../../src-trunk/gcc/tree-vect-data-refs.c:2845
 0xb39382 vect_analyze_loop_2
 ../../src-trunk/gcc/tree-vect-loop.c:1716
 0xb39382 vect_analyze_loop(loop*)
 ../../src-trunk/gcc/tree-vect-loop.c:1807
 0xb4f78f vectorize_loops()
 ../../src-trunk/gcc/tree-vectorizer.c:360
 Please submit a full bug report,
 with preprocessed source if appropriate.
 Please include the complete backtrace with any bug report.
 See http://gcc.gnu.org/bugs.html for instructions.
 specmake[3]: *** [costab.o] Error 1
 specmake[3]: *** Waiting for unfinished jobs

 Will this patch fix them?


 --
 H.J.


Re: [PATCH] Small fix: add { dg-require-effective-target vect_int } to testsuite/gcc.dg/vect/pr58508.c

2013-11-07 Thread Cong Hou
Ping. OK for the trunk?




thanks,
Cong


On Fri, Nov 1, 2013 at 10:47 AM, Cong Hou co...@google.com wrote:
 It seems that on some platforms the loops in
 testsuite/gcc.dg/vect/pr58508.c may be unable to be vectorized. This
 small patch added { dg-require-effective-target vect_int } to make
 sure all loops can be vectorized.


 thanks,
 Cong


 diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
 index 9d0f4a5..3d9916d 100644
 --- a/gcc/testsuite/ChangeLog
 +++ b/gcc/testsuite/ChangeLog
 @@ -1,3 +1,7 @@
 +2013-10-29  Cong Hou  co...@google.com
 +
 +   * gcc.dg/vect/pr58508.c: Update.
 +
  2013-10-15  Cong Hou  co...@google.com

 * gcc.dg/vect/pr58508.c: New test.
 diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c
 b/gcc/testsuite/gcc.dg/vect/pr58508.c
 index 6484a65..fff7a04 100644
 --- a/gcc/testsuite/gcc.dg/vect/pr58508.c
 +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c
 @@ -1,3 +1,4 @@
 +/* { dg-require-effective-target vect_int } */
  /* { dg-do compile } */
  /* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */


Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-11-07 Thread Cong Hou
Now is this patch OK for the trunk? Thank you!



thanks,
Cong


On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou co...@google.com wrote:
 Thank you for your detailed explanation.

 Once GCC detects a reduction operation, it will automatically
 accumulate all elements in the vector after the loop. In the loop the
 reduction variable is always a vector whose elements are reductions of
 corresponding values from other vectors. Therefore in your case the
 only instruction you need to generate is:

 VABAL   ops[3], ops[1], ops[2]

 It is OK if you accumulate the elements into one in the vector inside
 of the loop (if one instruction can do this), but you have to make
 sure other elements in the vector should remain zero so that the final
 result is correct.

 If you are confused about the documentation, check the one for
 udot_prod (just above usad in md.texi), as it has very similar
 behavior as usad. Actually I copied the text from there and did some
 changes. As those two instruction patterns are both for vectorization,
 their behavior should not be difficult to explain.

 If you have more questions or think that the documentation is still
 improper please let me know.

 Thank you very much!


 Cong


 On Tue, Nov 5, 2013 at 1:53 AM, James Greenhalgh
 james.greenha...@arm.com wrote:
 On Mon, Nov 04, 2013 at 06:30:55PM +, Cong Hou wrote:
 On Mon, Nov 4, 2013 at 2:06 AM, James Greenhalgh
 james.greenha...@arm.com wrote:
  On Fri, Nov 01, 2013 at 04:48:53PM +, Cong Hou wrote:
  diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
  index 2a5a2e1..8f5d39a 100644
  --- a/gcc/doc/md.texi
  +++ b/gcc/doc/md.texi
  @@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3.
  Operand 3 is of a mode equal or
   wider than the mode of the product. The result is placed in operand 0, 
  which
   is of the same mode as operand 3.
 
  +@cindex @code{ssad@var{m}} instruction pattern
  +@item @samp{ssad@var{m}}
  +@cindex @code{usad@var{m}} instruction pattern
  +@item @samp{usad@var{m}}
  +Compute the sum of absolute differences of two signed/unsigned elements.
  +Operand 1 and operand 2 are of the same mode. Their absolute 
  difference, which
  +is of a wider mode, is computed and added to operand 3. Operand 3 is of 
  a mode
  +equal or wider than the mode of the absolute difference. The result is 
  placed
  +in operand 0, which is of the same mode as operand 3.
  +
   @cindex @code{ssum_widen@var{m3}} instruction pattern
   @item @samp{ssum_widen@var{m3}}
   @cindex @code{usum_widen@var{m3}} instruction pattern
  diff --git a/gcc/expr.c b/gcc/expr.c
  index 4975a64..1db8a49 100644
 
  I'm not sure I follow, and if I do - I don't think it matches what
  you have implemented for i386.
 
  From your text description I would guess the series of operations to be:
 
v1 = widen (operands[1])
v2 = widen (operands[2])
v3 = abs (v1 - v2)
operands[0] = v3 + operands[3]
 
  But if I understand the behaviour of PSADBW correctly, what you have
  actually implemented is:
 
v1 = widen (operands[1])
v2 = widen (operands[2])
v3 = abs (v1 - v2)
v4 = reduce_plus (v3)
operands[0] = v4 + operands[3]
 
  To my mind, synthesizing the reduce_plus step will be wasteful for targets
  who do not get this for free with their Absolute Difference step. Imagine 
  a
  simple loop where we have synthesized the reduce_plus, we compute partial
  sums each loop iteration, though we would be better to leave the 
  reduce_plus
  step until after the loop. REDUC_PLUS_EXPR would be the appropriate
  Tree code for this.

 What do you mean when you use synthesizing here? For each pattern,
 the only synthesized operation is the one being returned from the
 pattern recognizer. In this case, it is USAD_EXPR. The recognition of
 reduce sum is necessary as we need corresponding prolog and epilog for
 reductions, which is already done before pattern recognition. Note
 that reduction is not a pattern but is a type of vector definition. A
 vectorization pattern can still be a reduction operation as long as
 STMT_VINFO_RELATED_STMT of this pattern is a reduction operation. You
 can check the other two reduction patterns: widen_sum_pattern and
 dot_prod_pattern for reference.

 My apologies for not being clear. What I mean is, for a target which does
 not have a dedicated PSADBW instruction, the individual steps of
 'usadm' must be synthesized in such a way as to match the expected
 behaviour of the tree code.

 So, I must expand 'usadm' to a series of equivalent instructions
 as USAD_EXPR expects.

 If USAD_EXPR requires me to emit a reduction on each loop iteration,
 I think that will be inefficient compared to performing the reduction
 after the loop body.

 To a first approximation on ARM, I would expect from your description
 of 'usadm' that generating,

  VABAL   ops[3], ops[1], ops[2]
  (Vector widening Absolute Difference and Accumulate)

 would fulfil the requirements.

 But to match the behaviour you have

Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-11-05 Thread Cong Hou
Thank you for your detailed explanation.

Once GCC detects a reduction operation, it will automatically
accumulate all elements in the vector after the loop. In the loop the
reduction variable is always a vector whose elements are reductions of
corresponding values from other vectors. Therefore in your case the
only instruction you need to generate is:

VABAL   ops[3], ops[1], ops[2]

It is OK if you accumulate the elements into one in the vector inside
of the loop (if one instruction can do this), but you have to make
sure other elements in the vector should remain zero so that the final
result is correct.

If you are confused about the documentation, check the one for
udot_prod (just above usad in md.texi), as it has very similar
behavior as usad. Actually I copied the text from there and did some
changes. As those two instruction patterns are both for vectorization,
their behavior should not be difficult to explain.

If you have more questions or think that the documentation is still
improper please let me know.

Thank you very much!


Cong


On Tue, Nov 5, 2013 at 1:53 AM, James Greenhalgh
james.greenha...@arm.com wrote:
 On Mon, Nov 04, 2013 at 06:30:55PM +, Cong Hou wrote:
 On Mon, Nov 4, 2013 at 2:06 AM, James Greenhalgh
 james.greenha...@arm.com wrote:
  On Fri, Nov 01, 2013 at 04:48:53PM +, Cong Hou wrote:
  diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
  index 2a5a2e1..8f5d39a 100644
  --- a/gcc/doc/md.texi
  +++ b/gcc/doc/md.texi
  @@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3.
  Operand 3 is of a mode equal or
   wider than the mode of the product. The result is placed in operand 0, 
  which
   is of the same mode as operand 3.
 
  +@cindex @code{ssad@var{m}} instruction pattern
  +@item @samp{ssad@var{m}}
  +@cindex @code{usad@var{m}} instruction pattern
  +@item @samp{usad@var{m}}
  +Compute the sum of absolute differences of two signed/unsigned elements.
  +Operand 1 and operand 2 are of the same mode. Their absolute difference, 
  which
  +is of a wider mode, is computed and added to operand 3. Operand 3 is of 
  a mode
  +equal or wider than the mode of the absolute difference. The result is 
  placed
  +in operand 0, which is of the same mode as operand 3.
  +
   @cindex @code{ssum_widen@var{m3}} instruction pattern
   @item @samp{ssum_widen@var{m3}}
   @cindex @code{usum_widen@var{m3}} instruction pattern
  diff --git a/gcc/expr.c b/gcc/expr.c
  index 4975a64..1db8a49 100644
 
  I'm not sure I follow, and if I do - I don't think it matches what
  you have implemented for i386.
 
  From your text description I would guess the series of operations to be:
 
v1 = widen (operands[1])
v2 = widen (operands[2])
v3 = abs (v1 - v2)
operands[0] = v3 + operands[3]
 
  But if I understand the behaviour of PSADBW correctly, what you have
  actually implemented is:
 
v1 = widen (operands[1])
v2 = widen (operands[2])
v3 = abs (v1 - v2)
v4 = reduce_plus (v3)
operands[0] = v4 + operands[3]
 
  To my mind, synthesizing the reduce_plus step will be wasteful for targets
  who do not get this for free with their Absolute Difference step. Imagine a
  simple loop where we have synthesized the reduce_plus, we compute partial
  sums each loop iteration, though we would be better to leave the 
  reduce_plus
  step until after the loop. REDUC_PLUS_EXPR would be the appropriate
  Tree code for this.

 What do you mean when you use synthesizing here? For each pattern,
 the only synthesized operation is the one being returned from the
 pattern recognizer. In this case, it is USAD_EXPR. The recognition of
 reduce sum is necessary as we need corresponding prolog and epilog for
 reductions, which is already done before pattern recognition. Note
 that reduction is not a pattern but is a type of vector definition. A
 vectorization pattern can still be a reduction operation as long as
 STMT_VINFO_RELATED_STMT of this pattern is a reduction operation. You
 can check the other two reduction patterns: widen_sum_pattern and
 dot_prod_pattern for reference.

 My apologies for not being clear. What I mean is, for a target which does
 not have a dedicated PSADBW instruction, the individual steps of
 'usadm' must be synthesized in such a way as to match the expected
 behaviour of the tree code.

 So, I must expand 'usadm' to a series of equivalent instructions
 as USAD_EXPR expects.

 If USAD_EXPR requires me to emit a reduction on each loop iteration,
 I think that will be inefficient compared to performing the reduction
 after the loop body.

 To a first approximation on ARM, I would expect from your description
 of 'usadm' that generating,

  VABAL   ops[3], ops[1], ops[2]
  (Vector widening Absolute Difference and Accumulate)

 would fulfil the requirements.

 But to match the behaviour you have implemented in the i386
 backend I would be required to generate:

 VABAL   ops[3], ops[1], ops[2]
 VPADD   ops[3], ops[3], ops[3] (add one set

Re: [PATCH] Handling == or != comparisons that may affect range test optimization.

2013-11-05 Thread Cong Hou
It seems there are some changes in GCC. But if you change the type of
n into signed int, the issue appears again:


int foo(int n)
{
   if (n != 0)
   if (n != 1)
   if (n != 2)
   if (n != 3)
   if (n != 4)
 return ++n;
   return n;
}

Also, ifcombine also suffers from the same issue here.


thanks,
Cong


On Tue, Nov 5, 2013 at 12:53 PM, Jakub Jelinek ja...@redhat.com wrote:
 On Tue, Nov 05, 2013 at 01:23:00PM -0700, Jeff Law wrote:
 On 10/31/13 18:03, Cong Hou wrote:
 (This patch is for the bug 58728:
 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58728)
 
 As in the bug report, consider the following loop:
 
 int foo(unsigned int n)
 {
if (n != 0)
if (n != 1)
if (n != 2)
if (n != 3)
if (n != 4)
  return ++n;
return n;
 }
 
 The range test optimization should be able to merge all those five
 conditions into one in reassoc pass, but I fails to do so. The reason
 is that the phi arg of n is replaced by the constant it compares to in
 case of == or != comparisons (in vrp pass). GCC checks there is no
 side effect on n between any two neighboring conditions by examining
 if they defined the same phi arg in the join node. But as the phi arg
 is replace by a constant, the check fails.

 I can't reproduce this, at least not on x86_64-linux with -O2,
 the ifcombine pass already merges those.

 Jakub


Re: [PATCH] Handling == or != comparisons that may affect range test optimization.

2013-11-05 Thread Cong Hou
On Tue, Nov 5, 2013 at 12:23 PM, Jeff Law l...@redhat.com wrote:
 On 10/31/13 18:03, Cong Hou wrote:

 (This patch is for the bug 58728:
 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58728)

 As in the bug report, consider the following loop:

 int foo(unsigned int n)
 {
if (n != 0)
if (n != 1)
if (n != 2)
if (n != 3)
if (n != 4)
  return ++n;
return n;
 }

 The range test optimization should be able to merge all those five
 conditions into one in reassoc pass, but I fails to do so. The reason
 is that the phi arg of n is replaced by the constant it compares to in
 case of == or != comparisons (in vrp pass). GCC checks there is no
 side effect on n between any two neighboring conditions by examining
 if they defined the same phi arg in the join node. But as the phi arg
 is replace by a constant, the check fails.

 This patch deals with this situation by considering the existence of
 == or != comparisons, which is attached below (a text file is also
 attached with proper tabs). Bootstrap and make check both get passed.

 Any comment?


 +   bool is_eq_expr = is_cond  (gimple_cond_code (stmt) == NE_EXPR
 +   || gimple_cond_code (stmt) ==
 EQ_EXPR)
 +  TREE_CODE (phi_arg) == INTEGER_CST;
 +
 +   if (is_eq_expr)
 + {
 +   lhs = gimple_cond_lhs (stmt);
 +   rhs = gimple_cond_rhs (stmt);
 +
 +   if (operand_equal_p (lhs, phi_arg, 0))
 + {
 +   tree t = lhs;
 +   lhs = rhs;
 +   rhs = t;
 + }
 +   if (operand_equal_p (rhs, phi_arg, 0)
 +operand_equal_p (lhs, phi_arg2, 0))
 + continue;
 + }
 +
 +   gimple stmt2 = last_stmt (test_bb);
 +   bool is_eq_expr2 = gimple_code (stmt2) == GIMPLE_COND
 + (gimple_cond_code (stmt2) == NE_EXPR
 +|| gimple_cond_code (stmt2) == EQ_EXPR)
 + TREE_CODE (phi_arg2) == INTEGER_CST;
 +
 +   if (is_eq_expr2)
 + {
 +   lhs2 = gimple_cond_lhs (stmt2);
 +   rhs2 = gimple_cond_rhs (stmt2);
 +
 +   if (operand_equal_p (lhs2, phi_arg2, 0))
 + {
 +   tree t = lhs2;
 +   lhs2 = rhs2;
 +   rhs2 = t;
 + }
 +   if (operand_equal_p (rhs2, phi_arg2, 0)
 +operand_equal_p (lhs2, phi_arg, 0))
 + continue;
 + }

 Can you factor those two hunks of nearly identical code into a single
 function and call it twice?  I'm also curious if you really need the code to
 swap lhs/rhs.  When can the LHS of a cond be an integer constant?  Don't we
 canonicalize it as ssa_name COND constant?


I was not aware that the comparison between a variable and a constant
will always be canonicalized as ssa_name COND constant. Then I
will remove the swap, and as the code is much smaller, I think it may
not be necessary to create a function for them.



 I'd probably write the ChangeLog as:

 * tree-ssa-reassoc.c (suitable_cond_bb): Handle constant PHI
 operands resulting from propagation of edge equivalences.



OK, much better than mine ;)


 I'm also curious -- did this code show up in a particular benchmark, if so,
 which one?

I didn't find this problem from any benchmark, but from another
concern about loop upper bound estimation. Look at the following code:

int foo(unsigned int n, int r)
{
  int i;
  if (n  0)
if (n  4)
{
  do {
 --n;
 r *= 2;
  } while (n  0);
}
  return r+n;
}


In order to get the upper bound of the loop in this function, GCC
traverses conditions n4 and n0 separately and tries to get any
useful information. But as those two conditions cannot be combined
into one due to this issue (note that n0 will be transformed into
n!=0), when GCC sees n4, it will consider the possibility that n may
be equal to 0, in which case the upper bound is UINT_MAX. If those two
conditions can be combined into one, which is n-1=2, then we can get
the correct upper bound of the loop.


thanks,
Cong


 jeff


Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-11-04 Thread Cong Hou
On Mon, Nov 4, 2013 at 2:06 AM, James Greenhalgh
james.greenha...@arm.com wrote:
 On Fri, Nov 01, 2013 at 04:48:53PM +, Cong Hou wrote:
 diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
 index 2a5a2e1..8f5d39a 100644
 --- a/gcc/doc/md.texi
 +++ b/gcc/doc/md.texi
 @@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3.
 Operand 3 is of a mode equal or
  wider than the mode of the product. The result is placed in operand 0, which
  is of the same mode as operand 3.

 +@cindex @code{ssad@var{m}} instruction pattern
 +@item @samp{ssad@var{m}}
 +@cindex @code{usad@var{m}} instruction pattern
 +@item @samp{usad@var{m}}
 +Compute the sum of absolute differences of two signed/unsigned elements.
 +Operand 1 and operand 2 are of the same mode. Their absolute difference, 
 which
 +is of a wider mode, is computed and added to operand 3. Operand 3 is of a 
 mode
 +equal or wider than the mode of the absolute difference. The result is 
 placed
 +in operand 0, which is of the same mode as operand 3.
 +
  @cindex @code{ssum_widen@var{m3}} instruction pattern
  @item @samp{ssum_widen@var{m3}}
  @cindex @code{usum_widen@var{m3}} instruction pattern
 diff --git a/gcc/expr.c b/gcc/expr.c
 index 4975a64..1db8a49 100644

 I'm not sure I follow, and if I do - I don't think it matches what
 you have implemented for i386.

 From your text description I would guess the series of operations to be:

   v1 = widen (operands[1])
   v2 = widen (operands[2])
   v3 = abs (v1 - v2)
   operands[0] = v3 + operands[3]

 But if I understand the behaviour of PSADBW correctly, what you have
 actually implemented is:

   v1 = widen (operands[1])
   v2 = widen (operands[2])
   v3 = abs (v1 - v2)
   v4 = reduce_plus (v3)
   operands[0] = v4 + operands[3]

 To my mind, synthesizing the reduce_plus step will be wasteful for targets
 who do not get this for free with their Absolute Difference step. Imagine a
 simple loop where we have synthesized the reduce_plus, we compute partial
 sums each loop iteration, though we would be better to leave the reduce_plus
 step until after the loop. REDUC_PLUS_EXPR would be the appropriate
 Tree code for this.

What do you mean when you use synthesizing here? For each pattern,
the only synthesized operation is the one being returned from the
pattern recognizer. In this case, it is USAD_EXPR. The recognition of
reduce sum is necessary as we need corresponding prolog and epilog for
reductions, which is already done before pattern recognition. Note
that reduction is not a pattern but is a type of vector definition. A
vectorization pattern can still be a reduction operation as long as
STMT_VINFO_RELATED_STMT of this pattern is a reduction operation. You
can check the other two reduction patterns: widen_sum_pattern and
dot_prod_pattern for reference.

Thank you for your comment!


Cong


 I would prefer to see this Tree code not imply the reduce_plus.

 Thanks,
 James



[PATCH] Small fix: add { dg-require-effective-target vect_int } to testsuite/gcc.dg/vect/pr58508.c

2013-11-01 Thread Cong Hou
It seems that on some platforms the loops in
testsuite/gcc.dg/vect/pr58508.c may be unable to be vectorized. This
small patch added { dg-require-effective-target vect_int } to make
sure all loops can be vectorized.


thanks,
Cong


diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 9d0f4a5..3d9916d 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,7 @@
+2013-10-29  Cong Hou  co...@google.com
+
+   * gcc.dg/vect/pr58508.c: Update.
+
 2013-10-15  Cong Hou  co...@google.com

* gcc.dg/vect/pr58508.c: New test.
diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c
b/gcc/testsuite/gcc.dg/vect/pr58508.c
index 6484a65..fff7a04 100644
--- a/gcc/testsuite/gcc.dg/vect/pr58508.c
+++ b/gcc/testsuite/gcc.dg/vect/pr58508.c
@@ -1,3 +1,4 @@
+/* { dg-require-effective-target vect_int } */
 /* { dg-do compile } */
 /* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */


Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-31 Thread Cong Hou
This update makes it more safe. You showed me how to write better
expand code. Thank you for the improvement!



thanks,
Cong


On Thu, Oct 31, 2013 at 11:43 AM, Uros Bizjak ubiz...@gmail.com wrote:
 On Wed, Oct 30, 2013 at 9:02 PM, Cong Hou co...@google.com wrote:
 I have run check_GNU_style.sh on my patch.

 The patch is submitted. Thank you for your comments and help on this patch!

 I have committed a couple of fixes/improvements to your expander in
 i386.c. There is no need to check for the result of
 expand_simple_binop. Also, there is no guarantee that
 expand_simple_binop will expand to the target. It can return different
 RTX. Also, unhandled modes are now marked with gcc_unreachable.

 2013-10-31  Uros Bizjak  ubiz...@gmail.com

 * config/i386/i386.c (ix86_expand_sse2_abs): Rename function arguments.
 Use gcc_unreachable for unhandled modes.  Do not check results of
 expand_simple_binop.  If not expanded to target, move the result.

 Tested on x86_64-pc-linux-gnu and committed.

 Uros.


[PATCH] Handling == or != comparisons that may affect range test optimization.

2013-10-31 Thread Cong Hou
(This patch is for the bug 58728:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58728)

As in the bug report, consider the following loop:

int foo(unsigned int n)
{
  if (n != 0)
  if (n != 1)
  if (n != 2)
  if (n != 3)
  if (n != 4)
return ++n;
  return n;
}

The range test optimization should be able to merge all those five
conditions into one in reassoc pass, but I fails to do so. The reason
is that the phi arg of n is replaced by the constant it compares to in
case of == or != comparisons (in vrp pass). GCC checks there is no
side effect on n between any two neighboring conditions by examining
if they defined the same phi arg in the join node. But as the phi arg
is replace by a constant, the check fails.

This patch deals with this situation by considering the existence of
== or != comparisons, which is attached below (a text file is also
attached with proper tabs). Bootstrap and make check both get passed.

Any comment?


thanks,
Cong




diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 8a38316..9247222 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,11 @@
+2013-10-31  Cong Hou  co...@google.com
+
+ PR tree-optimization/58728
+ * tree-ssa-reassoc.c (suitable_cond_bb): Consider the situtation
+ that ==/!= comparisons between a variable and a constant may lead
+ to that the later phi arg of the variable is substitued by the
+ constant from prior passes, during range test optimization.
+
 2013-10-14  David Malcolm  dmalc...@redhat.com

  * dumpfile.h (gcc::dump_manager): New class, to hold state
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 075d071..44a5e70 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2013-10-31  Cong Hou  co...@google.com
+
+ PR tree-optimization/58728
+ * gcc.dg/tree-ssa/pr58728: New test.
+
 2013-10-14  Tobias Burnus  bur...@net-b.de

  PR fortran/58658
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr58728.c
b/gcc/testsuite/gcc.dg/tree-ssa/pr58728.c
new file mode 100644
index 000..312aebc
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr58728.c
@@ -0,0 +1,25 @@
+/* { dg-do compile } */
+/* { dg-options -O2 -fdump-tree-reassoc1-details } */
+
+int foo (unsigned int n)
+{
+  if (n != 0)
+if (n != 1)
+  return ++n;
+  return n;
+}
+
+int bar (unsigned int n)
+{
+  if (n == 0)
+;
+  else if (n == 1)
+;
+  else
+return ++n;
+  return n;
+}
+
+
+/* { dg-final { scan-tree-dump-times Optimizing range tests 2
reassoc1 } } */
+/* { dg-final { cleanup-tree-dump reassoc1 } } */
diff --git a/gcc/tree-ssa-reassoc.c b/gcc/tree-ssa-reassoc.c
index 6859518..bccf99f 100644
--- a/gcc/tree-ssa-reassoc.c
+++ b/gcc/tree-ssa-reassoc.c
@@ -2426,11 +2426,70 @@ suitable_cond_bb (basic_block bb, basic_block
test_bb, basic_block *other_bb,
   for (gsi = gsi_start_phis (e-dest); !gsi_end_p (gsi); gsi_next (gsi))
 {
   gimple phi = gsi_stmt (gsi);
+  tree phi_arg = gimple_phi_arg_def (phi, e-dest_idx);
+  tree phi_arg2 = gimple_phi_arg_def (phi, e2-dest_idx);
+
   /* If both BB and TEST_BB end with GIMPLE_COND, all PHI arguments
  corresponding to BB and TEST_BB predecessor must be the same.  */
-  if (!operand_equal_p (gimple_phi_arg_def (phi, e-dest_idx),
-gimple_phi_arg_def (phi, e2-dest_idx), 0))
- {
+  if (!operand_equal_p (phi_arg, phi_arg2, 0))
+  {
+ /* If the condition in BB or TEST_BB is an NE or EQ comparison like
+   if (n != N) or if (n == N), it is possible that the corresponding
+   def of n in the phi function is replaced by N.  We should still allow
+   range test optimization in this case.  */
+
+ tree lhs = NULL, rhs = NULL,
+ lhs2 = NULL, rhs2 = NULL;
+ bool is_eq_expr = is_cond  (gimple_cond_code (stmt) == NE_EXPR
+ || gimple_cond_code (stmt) == EQ_EXPR)
+   TREE_CODE (phi_arg) == INTEGER_CST;
+
+ if (is_eq_expr)
+  {
+lhs = gimple_cond_lhs (stmt);
+rhs = gimple_cond_rhs (stmt);
+
+if (operand_equal_p (lhs, phi_arg, 0))
+  {
+ tree t = lhs;
+ lhs = rhs;
+ rhs = t;
+  }
+if (operand_equal_p (rhs, phi_arg, 0)
+  operand_equal_p (lhs, phi_arg2, 0))
+  continue;
+  }
+
+ gimple stmt2 = last_stmt (test_bb);
+ bool is_eq_expr2 = gimple_code (stmt2) == GIMPLE_COND
+  (gimple_cond_code (stmt2) == NE_EXPR
+ || gimple_cond_code (stmt2) == EQ_EXPR)
+  TREE_CODE (phi_arg2) == INTEGER_CST;
+
+ if (is_eq_expr2)
+  {
+lhs2 = gimple_cond_lhs (stmt2);
+rhs2 = gimple_cond_rhs (stmt2);
+
+if (operand_equal_p (lhs2, phi_arg2, 0))
+  {
+ tree t = lhs2;
+ lhs2 = rhs2;
+ rhs2 = t;
+  }
+if (operand_equal_p (rhs2, phi_arg2, 0)
+  operand_equal_p (lhs2, phi_arg, 0))
+  continue;
+  }
+
+ if (is_eq_expr  is_eq_expr2)
+  {
+if (operand_equal_p (rhs, phi_arg, 0)
+  operand_equal_p (rhs2, phi_arg2, 0)
+  operand_equal_p (lhs, lhs2, 0))
+  continue;
+  }
+
   /* Otherwise, if one of the blocks doesn't end with GIMPLE_COND,
  one of the PHIs should have the lhs of the last stmt in
  that block as PHI arg

Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-30 Thread Cong Hou
I found my problem: I put DONE outside of if not inside. You are
right. I have updated my patch.

I appreciate your comment and test on it!


thanks,
Cong



diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 8a38316..84c7ab5 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,10 @@
+2013-10-22  Cong Hou  co...@google.com
+
+ PR target/58762
+ * config/i386/i386-protos.h (ix86_expand_sse2_abs): New function.
+ * config/i386/i386.c (ix86_expand_sse2_abs): New function.
+ * config/i386/sse.md: Add SSE2 support to abs (8/16/32-bit-int).
+
 2013-10-14  David Malcolm  dmalc...@redhat.com

  * dumpfile.h (gcc::dump_manager): New class, to hold state
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 3ab2f3a..ca31224 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx,
rtx, rtx, bool, bool);
 extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool);
 extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx);
 extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx);
+extern void ix86_expand_sse2_abs (rtx, rtx);

 /* In i386-c.c  */
 extern void ix86_target_macros (void);
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 02cbbbd..71905fc 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2)
gen_rtx_MULT (mode, op1, op2));
 }

+void
+ix86_expand_sse2_abs (rtx op0, rtx op1)
+{
+  enum machine_mode mode = GET_MODE (op0);
+  rtx tmp0, tmp1;
+
+  switch (mode)
+{
+  /* For 32-bit signed integer X, the best way to calculate the absolute
+ value of X is (((signed) X  (W-1)) ^ X) - ((signed) X  (W-1)).  */
+  case V4SImode:
+ tmp0 = expand_simple_binop (mode, ASHIFTRT, op1,
+GEN_INT (GET_MODE_BITSIZE
+ (GET_MODE_INNER (mode)) - 1),
+NULL, 0, OPTAB_DIRECT);
+ if (tmp0)
+  tmp1 = expand_simple_binop (mode, XOR, op1, tmp0,
+  NULL, 0, OPTAB_DIRECT);
+ if (tmp0  tmp1)
+  expand_simple_binop (mode, MINUS, tmp1, tmp0,
+   op0, 0, OPTAB_DIRECT);
+ break;
+
+  /* For 16-bit signed integer X, the best way to calculate the absolute
+ value of X is max (X, -X), as SSE2 provides the PMAXSW insn.  */
+  case V8HImode:
+ tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0);
+ if (tmp0)
+  expand_simple_binop (mode, SMAX, op1, tmp0, op0, 0,
+   OPTAB_DIRECT);
+ break;
+
+  /* For 8-bit signed integer X, the best way to calculate the absolute
+ value of X is min ((unsigned char) X, (unsigned char) (-X)),
+ as SSE2 provides the PMINUB insn.  */
+  case V16QImode:
+ tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0);
+ if (tmp0)
+  expand_simple_binop (V16QImode, UMIN, op1, tmp0, op0, 0,
+   OPTAB_DIRECT);
+ break;
+
+  default:
+ break;
+}
+}
+
 /* Expand an insert into a vector register through pinsr insn.
Return true if successful.  */

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index c3f6c94..46e1df4 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -8721,7 +8721,7 @@
(set (attr prefix_rex) (symbol_ref x86_extended_reg_mentioned_p (insn)))
(set_attr mode DI)])

-(define_insn absmode2
+(define_insn *absmode2
   [(set (match_operand:VI124_AVX2_48_AVX512F 0 register_operand =v)
  (abs:VI124_AVX2_48_AVX512F
   (match_operand:VI124_AVX2_48_AVX512F 1 nonimmediate_operand vm)))]
@@ -8733,6 +8733,19 @@
(set_attr prefix maybe_vex)
(set_attr mode sseinsnmode)])

+(define_expand absmode2
+  [(set (match_operand:VI124_AVX2_48_AVX512F 0 register_operand)
+ (abs:VI124_AVX2_48_AVX512F
+  (match_operand:VI124_AVX2_48_AVX512F 1 nonimmediate_operand)))]
+  TARGET_SSE2
+{
+  if (!TARGET_SSSE3)
+{
+  ix86_expand_sse2_abs (operands[0], operands[1]);
+  DONE;
+}
+})
+
 (define_insn absmode2
   [(set (match_operand:MMXMODEI 0 register_operand =y)
  (abs:MMXMODEI
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 075d071..cf5b942 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2013-10-22  Cong Hou  co...@google.com
+
+ PR target/58762
+ * gcc.dg/vect/pr58762.c: New test.
+
 2013-10-14  Tobias Burnus  bur...@net-b.de

  PR fortran/58658
diff --git a/gcc/testsuite/gcc.dg/vect/pr58762.c
b/gcc/testsuite/gcc.dg/vect/pr58762.c
new file mode 100644
index 000..6468d0a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr58762.c
@@ -0,0 +1,28 @@
+/* { dg-require-effective-target vect_int } */
+/* { dg-do compile } */
+/* { dg-options -O2 -ftree-vectorize } */
+
+void test1 (char* a, char* b)
+{
+  int i;
+  for (i = 0; i  1; ++i)
+a[i] = abs (b[i]);
+}
+
+void test2 (short* a, short* b)
+{
+  int i;
+  for (i = 0; i  1; ++i)
+a[i] = abs (b[i]);
+}
+
+void test3 (int* a, int* b)
+{
+  int i;
+  for (i = 0; i  1; ++i)
+a[i] = abs (b[i]);
+}
+
+/* { dg-final { scan-tree-dump-times vectorized 1 loops 3 vect

Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-30 Thread Cong Hou
Forget to attach the patch file.



thanks,
Cong


On Wed, Oct 30, 2013 at 10:01 AM, Cong Hou co...@google.com wrote:
 I found my problem: I put DONE outside of if not inside. You are
 right. I have updated my patch.

 I appreciate your comment and test on it!


 thanks,
 Cong



 diff --git a/gcc/ChangeLog b/gcc/ChangeLog
 index 8a38316..84c7ab5 100644
 --- a/gcc/ChangeLog
 +++ b/gcc/ChangeLog
 @@ -1,3 +1,10 @@
 +2013-10-22  Cong Hou  co...@google.com
 +
 + PR target/58762
 + * config/i386/i386-protos.h (ix86_expand_sse2_abs): New function.
 + * config/i386/i386.c (ix86_expand_sse2_abs): New function.
 + * config/i386/sse.md: Add SSE2 support to abs (8/16/32-bit-int).
 +
  2013-10-14  David Malcolm  dmalc...@redhat.com

   * dumpfile.h (gcc::dump_manager): New class, to hold state
 diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
 index 3ab2f3a..ca31224 100644
 --- a/gcc/config/i386/i386-protos.h
 +++ b/gcc/config/i386/i386-protos.h
 @@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx,
 rtx, rtx, bool, bool);
  extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool);
  extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx);
  extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx);
 +extern void ix86_expand_sse2_abs (rtx, rtx);

  /* In i386-c.c  */
  extern void ix86_target_macros (void);
 diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
 index 02cbbbd..71905fc 100644
 --- a/gcc/config/i386/i386.c
 +++ b/gcc/config/i386/i386.c
 @@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2)
 gen_rtx_MULT (mode, op1, op2));
  }

 +void
 +ix86_expand_sse2_abs (rtx op0, rtx op1)
 +{
 +  enum machine_mode mode = GET_MODE (op0);
 +  rtx tmp0, tmp1;
 +
 +  switch (mode)
 +{
 +  /* For 32-bit signed integer X, the best way to calculate the absolute
 + value of X is (((signed) X  (W-1)) ^ X) - ((signed) X  (W-1)).  */
 +  case V4SImode:
 + tmp0 = expand_simple_binop (mode, ASHIFTRT, op1,
 +GEN_INT (GET_MODE_BITSIZE
 + (GET_MODE_INNER (mode)) - 1),
 +NULL, 0, OPTAB_DIRECT);
 + if (tmp0)
 +  tmp1 = expand_simple_binop (mode, XOR, op1, tmp0,
 +  NULL, 0, OPTAB_DIRECT);
 + if (tmp0  tmp1)
 +  expand_simple_binop (mode, MINUS, tmp1, tmp0,
 +   op0, 0, OPTAB_DIRECT);
 + break;
 +
 +  /* For 16-bit signed integer X, the best way to calculate the absolute
 + value of X is max (X, -X), as SSE2 provides the PMAXSW insn.  */
 +  case V8HImode:
 + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0);
 + if (tmp0)
 +  expand_simple_binop (mode, SMAX, op1, tmp0, op0, 0,
 +   OPTAB_DIRECT);
 + break;
 +
 +  /* For 8-bit signed integer X, the best way to calculate the absolute
 + value of X is min ((unsigned char) X, (unsigned char) (-X)),
 + as SSE2 provides the PMINUB insn.  */
 +  case V16QImode:
 + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0);
 + if (tmp0)
 +  expand_simple_binop (V16QImode, UMIN, op1, tmp0, op0, 0,
 +   OPTAB_DIRECT);
 + break;
 +
 +  default:
 + break;
 +}
 +}
 +
  /* Expand an insert into a vector register through pinsr insn.
 Return true if successful.  */

 diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
 index c3f6c94..46e1df4 100644
 --- a/gcc/config/i386/sse.md
 +++ b/gcc/config/i386/sse.md
 @@ -8721,7 +8721,7 @@
 (set (attr prefix_rex) (symbol_ref x86_extended_reg_mentioned_p 
 (insn)))
 (set_attr mode DI)])

 -(define_insn absmode2
 +(define_insn *absmode2
[(set (match_operand:VI124_AVX2_48_AVX512F 0 register_operand =v)
   (abs:VI124_AVX2_48_AVX512F
(match_operand:VI124_AVX2_48_AVX512F 1 nonimmediate_operand vm)))]
 @@ -8733,6 +8733,19 @@
 (set_attr prefix maybe_vex)
 (set_attr mode sseinsnmode)])

 +(define_expand absmode2
 +  [(set (match_operand:VI124_AVX2_48_AVX512F 0 register_operand)
 + (abs:VI124_AVX2_48_AVX512F
 +  (match_operand:VI124_AVX2_48_AVX512F 1 nonimmediate_operand)))]
 +  TARGET_SSE2
 +{
 +  if (!TARGET_SSSE3)
 +{
 +  ix86_expand_sse2_abs (operands[0], operands[1]);
 +  DONE;
 +}
 +})
 +
  (define_insn absmode2
[(set (match_operand:MMXMODEI 0 register_operand =y)
   (abs:MMXMODEI
 diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
 index 075d071..cf5b942 100644
 --- a/gcc/testsuite/ChangeLog
 +++ b/gcc/testsuite/ChangeLog
 @@ -1,3 +1,8 @@
 +2013-10-22  Cong Hou  co...@google.com
 +
 + PR target/58762
 + * gcc.dg/vect/pr58762.c: New test.
 +
  2013-10-14  Tobias Burnus  bur...@net-b.de

   PR fortran/58658
 diff --git a/gcc/testsuite/gcc.dg/vect/pr58762.c
 b/gcc/testsuite/gcc.dg/vect/pr58762.c
 new file mode 100644
 index 000..6468d0a
 --- /dev/null
 +++ b/gcc/testsuite/gcc.dg/vect/pr58762.c
 @@ -0,0 +1,28 @@
 +/* { dg-require-effective-target vect_int } */
 +/* { dg-do compile } */
 +/* { dg-options -O2 -ftree-vectorize } */
 +
 +void test1 (char* a, char* b)
 +{
 +  int i;
 +  for (i = 0; i  1; ++i)
 +a[i] = abs (b[i]);
 +}
 +
 +void test2

Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-30 Thread Cong Hou
On Wed, Oct 30, 2013 at 10:22 AM, Uros Bizjak ubiz...@gmail.com wrote:
 On Wed, Oct 30, 2013 at 6:01 PM, Cong Hou co...@google.com wrote:
 I found my problem: I put DONE outside of if not inside. You are
 right. I have updated my patch.

 OK, great that we put things in order ;)

 Does this patch need some extra middle-end functionality? I was not
 able to vectorize char and short part of your patch.


In the original patch, I converted abs() on short and char values to
their own types by removing type casts. That is, originally char_val1
= abs(char_val2) will be converted to char_val1 = (char) abs((int)
char_val2) in the frontend, and I would like to convert it back to
char_val1 = abs(char_val2). But after several discussions, it seems
this conversion has some problems such as overflow converns, and I
thereby removed that part.

Now you should still be able to vectorize abs(char) and abs(short) but
with packing and unpacking. Later I will consider to write pattern
recognizer for abs(char) and abs(short) and then the expand on
abs(char)/abs(short) in this patch will be used during vectorization.



 Regarding the testcase - please put it to gcc.target/i386/ directory.
 There is nothing generic in the test, as confirmed by target-dependent
 scan test. You will find plenty of examples in the mentioned
 directory. I'd suggest to split the testcase in three files, and to
 simplify it to something like the testcase with global variables I
 used earlier.


I have done it. The test case is split into three for s8/s16/s32 in
gcc.target/i386.


Thank you!

Cong



diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 8a38316..84c7ab5 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,10 @@
+2013-10-22  Cong Hou  co...@google.com
+
+ PR target/58762
+ * config/i386/i386-protos.h (ix86_expand_sse2_abs): New function.
+ * config/i386/i386.c (ix86_expand_sse2_abs): New function.
+ * config/i386/sse.md: Add SSE2 support to abs (8/16/32-bit-int).
+
 2013-10-14  David Malcolm  dmalc...@redhat.com

  * dumpfile.h (gcc::dump_manager): New class, to hold state
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 3ab2f3a..ca31224 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx,
rtx, rtx, bool, bool);
 extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool);
 extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx);
 extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx);
+extern void ix86_expand_sse2_abs (rtx, rtx);

 /* In i386-c.c  */
 extern void ix86_target_macros (void);
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 02cbbbd..71905fc 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2)
gen_rtx_MULT (mode, op1, op2));
 }

+void
+ix86_expand_sse2_abs (rtx op0, rtx op1)
+{
+  enum machine_mode mode = GET_MODE (op0);
+  rtx tmp0, tmp1;
+
+  switch (mode)
+{
+  /* For 32-bit signed integer X, the best way to calculate the absolute
+ value of X is (((signed) X  (W-1)) ^ X) - ((signed) X  (W-1)).  */
+  case V4SImode:
+ tmp0 = expand_simple_binop (mode, ASHIFTRT, op1,
+GEN_INT (GET_MODE_BITSIZE
+ (GET_MODE_INNER (mode)) - 1),
+NULL, 0, OPTAB_DIRECT);
+ if (tmp0)
+  tmp1 = expand_simple_binop (mode, XOR, op1, tmp0,
+  NULL, 0, OPTAB_DIRECT);
+ if (tmp0  tmp1)
+  expand_simple_binop (mode, MINUS, tmp1, tmp0,
+   op0, 0, OPTAB_DIRECT);
+ break;
+
+  /* For 16-bit signed integer X, the best way to calculate the absolute
+ value of X is max (X, -X), as SSE2 provides the PMAXSW insn.  */
+  case V8HImode:
+ tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0);
+ if (tmp0)
+  expand_simple_binop (mode, SMAX, op1, tmp0, op0, 0,
+   OPTAB_DIRECT);
+ break;
+
+  /* For 8-bit signed integer X, the best way to calculate the absolute
+ value of X is min ((unsigned char) X, (unsigned char) (-X)),
+ as SSE2 provides the PMINUB insn.  */
+  case V16QImode:
+ tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0);
+ if (tmp0)
+  expand_simple_binop (V16QImode, UMIN, op1, tmp0, op0, 0,
+   OPTAB_DIRECT);
+ break;
+
+  default:
+ break;
+}
+}
+
 /* Expand an insert into a vector register through pinsr insn.
Return true if successful.  */

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index c3f6c94..46e1df4 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -8721,7 +8721,7 @@
(set (attr prefix_rex) (symbol_ref x86_extended_reg_mentioned_p (insn)))
(set_attr mode DI)])

-(define_insn absmode2
+(define_insn *absmode2
   [(set (match_operand:VI124_AVX2_48_AVX512F 0 register_operand =v)
  (abs:VI124_AVX2_48_AVX512F
   (match_operand:VI124_AVX2_48_AVX512F 1 nonimmediate_operand vm)))]
@@ -8733,6 +8733,19 @@
(set_attr prefix maybe_vex)
(set_attr mode sseinsnmode)])

+(define_expand absmode2

Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-30 Thread Cong Hou
Also, as the current expand for abs() on 8/16bit integer is not used
at all, should I comment them temporarily now? Later I can uncomment
them once I finished the pattern recognizer.



thanks,
Cong


On Wed, Oct 30, 2013 at 10:22 AM, Uros Bizjak ubiz...@gmail.com wrote:
 On Wed, Oct 30, 2013 at 6:01 PM, Cong Hou co...@google.com wrote:
 I found my problem: I put DONE outside of if not inside. You are
 right. I have updated my patch.

 OK, great that we put things in order ;)

 Does this patch need some extra middle-end functionality? I was not
 able to vectorize char and short part of your patch.

 Regarding the testcase - please put it to gcc.target/i386/ directory.
 There is nothing generic in the test, as confirmed by target-dependent
 scan test. You will find plenty of examples in the mentioned
 directory. I'd suggest to split the testcase in three files, and to
 simplify it to something like the testcase with global variables I
 used earlier.

 Modulo testcase, the patch is OK otherwise, but middle-end parts
 should be committed first.

 Thanks,
 Uros.


Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-30 Thread Cong Hou
I have run check_GNU_style.sh on my patch.

The patch is submitted. Thank you for your comments and help on this patch!



thanks,
Cong


On Wed, Oct 30, 2013 at 11:13 AM, Uros Bizjak ubiz...@gmail.com wrote:
 On Wed, Oct 30, 2013 at 7:01 PM, Cong Hou co...@google.com wrote:

 I found my problem: I put DONE outside of if not inside. You are
 right. I have updated my patch.

 OK, great that we put things in order ;)

 Does this patch need some extra middle-end functionality? I was not
 able to vectorize char and short part of your patch.


 In the original patch, I converted abs() on short and char values to
 their own types by removing type casts. That is, originally char_val1
 = abs(char_val2) will be converted to char_val1 = (char) abs((int)
 char_val2) in the frontend, and I would like to convert it back to
 char_val1 = abs(char_val2). But after several discussions, it seems
 this conversion has some problems such as overflow converns, and I
 thereby removed that part.

 Now you should still be able to vectorize abs(char) and abs(short) but
 with packing and unpacking. Later I will consider to write pattern
 recognizer for abs(char) and abs(short) and then the expand on
 abs(char)/abs(short) in this patch will be used during vectorization.

 OK, this seems reasonable. We already have unused SSSE3 8/16 bit abs
 pattern, so I think we can commit SSE2 expanders, even if they will be
 unused for now. The proposed recognizer will benefit SSE2 as well as
 existing SSSE3 patterns.

 Regarding the testcase - please put it to gcc.target/i386/ directory.
 There is nothing generic in the test, as confirmed by target-dependent
 scan test. You will find plenty of examples in the mentioned
 directory. I'd suggest to split the testcase in three files, and to
 simplify it to something like the testcase with global variables I
 used earlier.


 I have done it. The test case is split into three for s8/s16/s32 in
 gcc.target/i386.

 OK.

 The patch is OK for mainline, but please check formatting and
 whitespace before the patch is committed.

 Thanks,
 Uros.


Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-10-30 Thread Cong Hou
On Wed, Oct 30, 2013 at 4:27 AM, Richard Biener rguent...@suse.de wrote:
 On Tue, 29 Oct 2013, Cong Hou wrote:

 Hi

 SAD (Sum of Absolute Differences) is a common and important algorithm
 in image processing and other areas. SSE2 even introduced a new
 instruction PSADBW for it. A SAD loop can be greatly accelerated by
 this instruction after being vectorized. This patch introduced a new
 operation SAD_EXPR and a SAD pattern recognizer in vectorizer.

 The pattern of SAD is shown below:

  unsigned type x_t, y_t;
  signed TYPE1 diff, abs_diff;
  TYPE2 sum = init;
loop:
  sum_0 = phi init, sum_1
  S1  x_t = ...
  S2  y_t = ...
  S3  x_T = (TYPE1) x_t;
  S4  y_T = (TYPE1) y_t;
  S5  diff = x_T - y_T;
  S6  abs_diff = ABS_EXPR diff;
  [S7  abs_diff = (TYPE2) abs_diff;  #optional]
  S8  sum_1 = abs_diff + sum_0;

where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is 
 the
same size of 'TYPE1' or bigger. This is a special case of a reduction
computation.

 For SSE2, type is char, and TYPE1 and TYPE2 are int.


 In order to express this new operation, a new expression SAD_EXPR is
 introduced in tree.def, and the corresponding entry in optabs is
 added. The patch also added the define_expand for SSE2 and AVX2
 platforms for i386.

 The patch is pasted below and also attached as a text file (in which
 you can see tabs). Bootstrap and make check got passed on x86. Please
 give me your comments.

 Apart from the testcase comment made earlier

 +++ b/gcc/tree-cfg.c
 @@ -3797,6 +3797,7 @@ verify_gimple_assign_ternary (gimple stmt)
return false;

  case DOT_PROD_EXPR:
 +case SAD_EXPR:
  case REALIGN_LOAD_EXPR:
/* FIXME.  */
return false;

 please add proper verification of the operand types.

OK.


 +/* Widening sad (sum of absolute differences).
 +   The first two arguments are of type t1 which should be unsigned
 integer.
 +   The third argument and the result are of type t2, such that t2 is at
 least
 +   twice the size of t1. SAD_EXPR(arg1,arg2,arg3) is equivalent to:
 +   tmp1 = WIDEN_MINUS_EXPR (arg1, arg2);
 +   tmp2 = ABS_EXPR (tmp1);
 +   arg3 = PLUS_EXPR (tmp2, arg3);   */
 +DEFTREECODE (SAD_EXPR, sad_expr, tcc_expression, 3)

 WIDEN_MINUS_EXPR doesn't exist so you have to explain on its
 operation (it returns a signed wide difference?).  Why should
 the first two arguments be unsigned?  I cannot see a good reason
 to require that (other than that maybe the x86 target only has
 support for widened unsigned difference?).  So if you want to
 make that restriction maybe change the name to SADU_EXPR
 (sum of absolute differences of unsigned)?

 I suppose you tried introducing WIDEN_MINUS_EXPR instead and
 letting combine do it's work, avoiding the very special optab?

I may use the wrong representation here. I think the behavior of
WIDEN_MINUS_EXPR in SAD is different from the general one. SAD
usually works on unsigned integers (see
http://en.wikipedia.org/wiki/Sum_of_absolute_differences), and before
getting the difference between two unsigned integers, they are
promoted to bigger signed integers. And the result of (int)(char)(1) -
(int)(char)(-1) is different from (int)(unsigned char)(1) -
(int)(unsigned char)(-1). So we cannot implement SAD using
WIDEN_MINUS_EXPR.

Also, the SSE2 instruction PSADBW also requires the operands to be
unsigned 8-bit integers.

I will remove the improper description as you pointed out.



thanks,
Cong



 Thanks,
 Richard.



 thanks,
 Cong



 diff --git a/gcc/ChangeLog b/gcc/ChangeLog
 index 8a38316..d528307 100644
 --- a/gcc/ChangeLog
 +++ b/gcc/ChangeLog
 @@ -1,3 +1,23 @@
 +2013-10-29  Cong Hou  co...@google.com
 +
 + * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD
 + pattern recognition.
 + (type_conversion_p): PROMOTION is true if it's a type promotion
 + conversion, and false otherwise.  Return true if the given expression
 + is a type conversion one.
 + * tree-vectorizer.h: Adjust the number of patterns.
 + * tree.def: Add SAD_EXPR.
 + * optabs.def: Add sad_optab.
 + * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case.
 + * expr.c (expand_expr_real_2): Likewise.
 + * gimple-pretty-print.c (dump_ternary_rhs): Likewise.
 + * gimple.c (get_gimple_rhs_num_ops): Likewise.
 + * optabs.c (optab_for_tree_code): Likewise.
 + * tree-cfg.c (estimate_operator_cost): Likewise.
 + * tree-ssa-operands.c (get_expr_operands): Likewise.
 + * tree-vect-loop.c (get_initial_def_for_reduction): Likewise.
 + * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD.
 +
  2013-10-14  David Malcolm  dmalc...@redhat.com

   * dumpfile.h (gcc::dump_manager): New class, to hold state
 diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
 index 7ed29f5..9ec761a 100644
 --- a/gcc/cfgexpand.c
 +++ b/gcc/cfgexpand.c
 @@ -2730,6 +2730,7 @@ expand_debug_expr (tree exp)
   {
   case COND_EXPR:
   case DOT_PROD_EXPR:
 + case SAD_EXPR:
   case

Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-10-30 Thread Cong Hou
On Tue, Oct 29, 2013 at 4:49 PM, Ramana Radhakrishnan
ramana@googlemail.com wrote:
 Cong,

 Please don't do the following.

+++ b/gcc/testsuite/gcc.dg/vect/
 vect-reduc-sad.c
 @@ -0,0 +1,54 @@
 +/* { dg-require-effective-target sse2 { target { i?86-*-* x86_64-*-* } } } */

 you are adding a test to gcc.dg/vect - It's a common directory
 containing tests that need to run on multiple architectures and such
 tests should be keyed by the feature they enable which can be turned
 on for ports that have such an instruction.

 The correct way of doing this is to key this on the feature something
 like dg-require-effective-target vect_sad_char . And define the
 equivalent routine in testsuite/lib/target-supports.exp and enable it
 for sse2 for the x86 port. If in doubt look at
 check_effective_target_vect_int and a whole family of such functions
 in testsuite/lib/target-supports.exp

 This makes life easy for other port maintainers who want to turn on
 this support. And for bonus points please update the testcase writing
 wiki page with this information if it isn't already there.


OK, I will likely move the test case to gcc.target/i386 as currently
only SSE2 provides SAD instruction. But your suggestion also helps!


 You are also missing documentation updates for SAD_EXPR, md.texi for
 the new standard pattern name. Shouldn't it be called sadmode4
 really ?



I will add the documentation for the new operation SAD_EXPR.

I use sadmode by just following udot_prodmode as those two
operations are quite similar:

 OPTAB_D (udot_prod_optab, udot_prod$I$a)


thanks,
Cong



 regards
 Ramana





 On Tue, Oct 29, 2013 at 10:23 PM, Cong Hou co...@google.com wrote:
 Hi

 SAD (Sum of Absolute Differences) is a common and important algorithm
 in image processing and other areas. SSE2 even introduced a new
 instruction PSADBW for it. A SAD loop can be greatly accelerated by
 this instruction after being vectorized. This patch introduced a new
 operation SAD_EXPR and a SAD pattern recognizer in vectorizer.

 The pattern of SAD is shown below:

  unsigned type x_t, y_t;
  signed TYPE1 diff, abs_diff;
  TYPE2 sum = init;
loop:
  sum_0 = phi init, sum_1
  S1  x_t = ...
  S2  y_t = ...
  S3  x_T = (TYPE1) x_t;
  S4  y_T = (TYPE1) y_t;
  S5  diff = x_T - y_T;
  S6  abs_diff = ABS_EXPR diff;
  [S7  abs_diff = (TYPE2) abs_diff;  #optional]
  S8  sum_1 = abs_diff + sum_0;

where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is 
 the
same size of 'TYPE1' or bigger. This is a special case of a reduction
computation.

 For SSE2, type is char, and TYPE1 and TYPE2 are int.


 In order to express this new operation, a new expression SAD_EXPR is
 introduced in tree.def, and the corresponding entry in optabs is
 added. The patch also added the define_expand for SSE2 and AVX2
 platforms for i386.

 The patch is pasted below and also attached as a text file (in which
 you can see tabs). Bootstrap and make check got passed on x86. Please
 give me your comments.



 thanks,
 Cong



 diff --git a/gcc/ChangeLog b/gcc/ChangeLog
 index 8a38316..d528307 100644
 --- a/gcc/ChangeLog
 +++ b/gcc/ChangeLog
 @@ -1,3 +1,23 @@
 +2013-10-29  Cong Hou  co...@google.com
 +
 + * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD
 + pattern recognition.
 + (type_conversion_p): PROMOTION is true if it's a type promotion
 + conversion, and false otherwise.  Return true if the given expression
 + is a type conversion one.
 + * tree-vectorizer.h: Adjust the number of patterns.
 + * tree.def: Add SAD_EXPR.
 + * optabs.def: Add sad_optab.
 + * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case.
 + * expr.c (expand_expr_real_2): Likewise.
 + * gimple-pretty-print.c (dump_ternary_rhs): Likewise.
 + * gimple.c (get_gimple_rhs_num_ops): Likewise.
 + * optabs.c (optab_for_tree_code): Likewise.
 + * tree-cfg.c (estimate_operator_cost): Likewise.
 + * tree-ssa-operands.c (get_expr_operands): Likewise.
 + * tree-vect-loop.c (get_initial_def_for_reduction): Likewise.
 + * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD.
 +
  2013-10-14  David Malcolm  dmalc...@redhat.com

   * dumpfile.h (gcc::dump_manager): New class, to hold state
 diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
 index 7ed29f5..9ec761a 100644
 --- a/gcc/cfgexpand.c
 +++ b/gcc/cfgexpand.c
 @@ -2730,6 +2730,7 @@ expand_debug_expr (tree exp)
   {
   case COND_EXPR:
   case DOT_PROD_EXPR:
 + case SAD_EXPR:
   case WIDEN_MULT_PLUS_EXPR:
   case WIDEN_MULT_MINUS_EXPR:
   case FMA_EXPR:
 diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
 index c3f6c94..ca1ab70 100644
 --- a/gcc/config/i386/sse.md
 +++ b/gcc/config/i386/sse.md
 @@ -6052,6 +6052,40 @@
DONE;
  })

 +(define_expand sadv16qi
 +  [(match_operand:V4SI 0 register_operand)
 +   (match_operand:V16QI 1 register_operand)
 +   (match_operand:V16QI 2 register_operand)
 +   (match_operand:V4SI 3 register_operand)]
 +  TARGET_SSE2
 +{
 +  rtx

Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-29 Thread Cong Hou
On Tue, Oct 29, 2013 at 1:38 AM, Uros Bizjak ubiz...@gmail.com wrote:
 Hello!

 For the define_expand I added as below, the else body is there to
 avoid fall-through transformations to ABS operation in optabs.c.
 Otherwise ABS will be converted to other operations even that we have
 corresponding instructions from SSSE3.

 No, it wont be.

 Fallthrough will generate the pattern that will be matched by the insn
 pattern above, just like you are doing by hand below.


I think the case is special for abs(). In optabs.c, there is a
function expand_abs() in which the function expand_abs_nojump() is
called. This function first tries the expand function defined for the
target and if it fails it will try max(v, -v) then shift-xor-sub
method. If I don't generate any instruction for SSSE3, the
fall-through will be max(v, -v). I have tested it on my machine.



 (define_expand absmode2
   [(set (match_operand:VI124_AVX2_48_AVX512F 0 register_operand)
 (abs:VI124_AVX2_48_AVX512F
  (match_operand:VI124_AVX2_48_AVX512F 1 nonimmediate_operand)))]
   TARGET_SSE2
 {
   if (!TARGET_SSSE3)
 ix86_expand_sse2_abs (operands[0], force_reg (MODEmode, operands[1]));

 Do you really need force_reg here? You are using generic expanders in
 ix86_expand_sse2_abs that can handle non-registers operands just as
 well.

You are right. I have removed force_reg.



   else
 emit_insn (gen_rtx_SET (VOIDmode, operands[0],
gen_rtx_ABS (MODEmode, operands[1])));
   DONE;
 })

 Please note that your mailer mangles indents. Please indent your code 
 correctly.

Right.. I also attached a text file in which all tabs are there.


The updated patch is pasted below (and also in the attached file).
Thank you very much for your comment!


Cong




diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 8a38316..84c7ab5 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,10 @@
+2013-10-22  Cong Hou  co...@google.com
+
+ PR target/58762
+ * config/i386/i386-protos.h (ix86_expand_sse2_abs): New function.
+ * config/i386/i386.c (ix86_expand_sse2_abs): New function.
+ * config/i386/sse.md: Add SSE2 support to abs (8/16/32-bit-int).
+
 2013-10-14  David Malcolm  dmalc...@redhat.com

  * dumpfile.h (gcc::dump_manager): New class, to hold state
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 3ab2f3a..ca31224 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx,
rtx, rtx, bool, bool);
 extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool);
 extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx);
 extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx);
+extern void ix86_expand_sse2_abs (rtx, rtx);

 /* In i386-c.c  */
 extern void ix86_target_macros (void);
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 02cbbbd..71905fc 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2)
gen_rtx_MULT (mode, op1, op2));
 }

+void
+ix86_expand_sse2_abs (rtx op0, rtx op1)
+{
+  enum machine_mode mode = GET_MODE (op0);
+  rtx tmp0, tmp1;
+
+  switch (mode)
+{
+  /* For 32-bit signed integer X, the best way to calculate the absolute
+ value of X is (((signed) X  (W-1)) ^ X) - ((signed) X  (W-1)).  */
+  case V4SImode:
+ tmp0 = expand_simple_binop (mode, ASHIFTRT, op1,
+GEN_INT (GET_MODE_BITSIZE
+ (GET_MODE_INNER (mode)) - 1),
+NULL, 0, OPTAB_DIRECT);
+ if (tmp0)
+  tmp1 = expand_simple_binop (mode, XOR, op1, tmp0,
+  NULL, 0, OPTAB_DIRECT);
+ if (tmp0  tmp1)
+  expand_simple_binop (mode, MINUS, tmp1, tmp0,
+   op0, 0, OPTAB_DIRECT);
+ break;
+
+  /* For 16-bit signed integer X, the best way to calculate the absolute
+ value of X is max (X, -X), as SSE2 provides the PMAXSW insn.  */
+  case V8HImode:
+ tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0);
+ if (tmp0)
+  expand_simple_binop (mode, SMAX, op1, tmp0, op0, 0,
+   OPTAB_DIRECT);
+ break;
+
+  /* For 8-bit signed integer X, the best way to calculate the absolute
+ value of X is min ((unsigned char) X, (unsigned char) (-X)),
+ as SSE2 provides the PMINUB insn.  */
+  case V16QImode:
+ tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0);
+ if (tmp0)
+  expand_simple_binop (V16QImode, UMIN, op1, tmp0, op0, 0,
+   OPTAB_DIRECT);
+ break;
+
+  default:
+ break;
+}
+}
+
 /* Expand an insert into a vector register through pinsr insn.
Return true if successful.  */

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index c3f6c94..0d9cefe 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -8721,7 +8721,7 @@
(set (attr prefix_rex) (symbol_ref x86_extended_reg_mentioned_p (insn)))
(set_attr mode DI)])

-(define_insn absmode2
+(define_insn *absmode2
   [(set (match_operand:VI124_AVX2_48_AVX512F 0 register_operand =v)
  (abs:VI124_AVX2_48_AVX512F

Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-29 Thread Cong Hou
On Tue, Oct 29, 2013 at 10:34 AM, Uros Bizjak ubiz...@gmail.com wrote:
 On Tue, Oct 29, 2013 at 6:18 PM, Cong Hou co...@google.com wrote:

 For the define_expand I added as below, the else body is there to
 avoid fall-through transformations to ABS operation in optabs.c.
 Otherwise ABS will be converted to other operations even that we have
 corresponding instructions from SSSE3.

 No, it wont be.

 Fallthrough will generate the pattern that will be matched by the insn
 pattern above, just like you are doing by hand below.


 I think the case is special for abs(). In optabs.c, there is a
 function expand_abs() in which the function expand_abs_nojump() is
 called. This function first tries the expand function defined for the
 target and if it fails it will try max(v, -v) then shift-xor-sub
 method. If I don't generate any instruction for SSSE3, the
 fall-through will be max(v, -v). I have tested it on my machine.

 Huh, strange.

 Then you can rename previous pattern to absmode2_1 and call it from
 the new expander instead of expanding it manually. Please also add a
 small comment, describing the situation to prevent future
 optimizations in this place.

Could you tell me how to do that? The renamed pattern absmode2_1 is
also a define_expand? How to call this expander?

Thank you!


Cong




 Thanks,
 Uros.


[PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-10-29 Thread Cong Hou
Hi

SAD (Sum of Absolute Differences) is a common and important algorithm
in image processing and other areas. SSE2 even introduced a new
instruction PSADBW for it. A SAD loop can be greatly accelerated by
this instruction after being vectorized. This patch introduced a new
operation SAD_EXPR and a SAD pattern recognizer in vectorizer.

The pattern of SAD is shown below:

 unsigned type x_t, y_t;
 signed TYPE1 diff, abs_diff;
 TYPE2 sum = init;
   loop:
 sum_0 = phi init, sum_1
 S1  x_t = ...
 S2  y_t = ...
 S3  x_T = (TYPE1) x_t;
 S4  y_T = (TYPE1) y_t;
 S5  diff = x_T - y_T;
 S6  abs_diff = ABS_EXPR diff;
 [S7  abs_diff = (TYPE2) abs_diff;  #optional]
 S8  sum_1 = abs_diff + sum_0;

   where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the
   same size of 'TYPE1' or bigger. This is a special case of a reduction
   computation.

For SSE2, type is char, and TYPE1 and TYPE2 are int.


In order to express this new operation, a new expression SAD_EXPR is
introduced in tree.def, and the corresponding entry in optabs is
added. The patch also added the define_expand for SSE2 and AVX2
platforms for i386.

The patch is pasted below and also attached as a text file (in which
you can see tabs). Bootstrap and make check got passed on x86. Please
give me your comments.



thanks,
Cong



diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 8a38316..d528307 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,23 @@
+2013-10-29  Cong Hou  co...@google.com
+
+ * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD
+ pattern recognition.
+ (type_conversion_p): PROMOTION is true if it's a type promotion
+ conversion, and false otherwise.  Return true if the given expression
+ is a type conversion one.
+ * tree-vectorizer.h: Adjust the number of patterns.
+ * tree.def: Add SAD_EXPR.
+ * optabs.def: Add sad_optab.
+ * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case.
+ * expr.c (expand_expr_real_2): Likewise.
+ * gimple-pretty-print.c (dump_ternary_rhs): Likewise.
+ * gimple.c (get_gimple_rhs_num_ops): Likewise.
+ * optabs.c (optab_for_tree_code): Likewise.
+ * tree-cfg.c (estimate_operator_cost): Likewise.
+ * tree-ssa-operands.c (get_expr_operands): Likewise.
+ * tree-vect-loop.c (get_initial_def_for_reduction): Likewise.
+ * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD.
+
 2013-10-14  David Malcolm  dmalc...@redhat.com

  * dumpfile.h (gcc::dump_manager): New class, to hold state
diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
index 7ed29f5..9ec761a 100644
--- a/gcc/cfgexpand.c
+++ b/gcc/cfgexpand.c
@@ -2730,6 +2730,7 @@ expand_debug_expr (tree exp)
  {
  case COND_EXPR:
  case DOT_PROD_EXPR:
+ case SAD_EXPR:
  case WIDEN_MULT_PLUS_EXPR:
  case WIDEN_MULT_MINUS_EXPR:
  case FMA_EXPR:
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index c3f6c94..ca1ab70 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -6052,6 +6052,40 @@
   DONE;
 })

+(define_expand sadv16qi
+  [(match_operand:V4SI 0 register_operand)
+   (match_operand:V16QI 1 register_operand)
+   (match_operand:V16QI 2 register_operand)
+   (match_operand:V4SI 3 register_operand)]
+  TARGET_SSE2
+{
+  rtx t1 = gen_reg_rtx (V2DImode);
+  rtx t2 = gen_reg_rtx (V4SImode);
+  emit_insn (gen_sse2_psadbw (t1, operands[1], operands[2]));
+  convert_move (t2, t1, 0);
+  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
+  gen_rtx_PLUS (V4SImode,
+ operands[3], t2)));
+  DONE;
+})
+
+(define_expand sadv32qi
+  [(match_operand:V8SI 0 register_operand)
+   (match_operand:V32QI 1 register_operand)
+   (match_operand:V32QI 2 register_operand)
+   (match_operand:V8SI 3 register_operand)]
+  TARGET_AVX2
+{
+  rtx t1 = gen_reg_rtx (V4DImode);
+  rtx t2 = gen_reg_rtx (V8SImode);
+  emit_insn (gen_avx2_psadbw (t1, operands[1], operands[2]));
+  convert_move (t2, t1, 0);
+  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
+  gen_rtx_PLUS (V8SImode,
+ operands[3], t2)));
+  DONE;
+})
+
 (define_insn ashrmode3
   [(set (match_operand:VI24_AVX2 0 register_operand =x,x)
  (ashiftrt:VI24_AVX2
diff --git a/gcc/expr.c b/gcc/expr.c
index 4975a64..1db8a49 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9026,6 +9026,20 @@ expand_expr_real_2 (sepops ops, rtx target,
enum machine_mode tmode,
  return target;
   }

+  case SAD_EXPR:
+  {
+ tree oprnd0 = treeop0;
+ tree oprnd1 = treeop1;
+ tree oprnd2 = treeop2;
+ rtx op2;
+
+ expand_operands (oprnd0, oprnd1, NULL_RTX, op0, op1, EXPAND_NORMAL);
+ op2 = expand_normal (oprnd2);
+ target = expand_widen_pattern_expr (ops, op0, op1, op2,
+target, unsignedp);
+ return target;
+  }
+
 case REALIGN_LOAD_EXPR:
   {
 tree oprnd0 = treeop0;
diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c
index f0f8166..514ddd1 100644
--- a/gcc/gimple-pretty-print.c
+++ b/gcc/gimple-pretty-print.c
@@ -425,6 +425,16 @@ dump_ternary_rhs (pretty_printer *buffer, gimple
gs, int spc, int flags

Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-28 Thread Cong Hou
As there are some issues with abs() type conversions, I removed the
related content from the patch but only kept the SSE2 support for
abs(int).

For the define_expand I added as below, the else body is there to
avoid fall-through transformations to ABS operation in optabs.c.
Otherwise ABS will be converted to other operations even that we have
corresponding instructions from SSSE3.


(define_expand absmode2
  [(set (match_operand:VI124_AVX2_48_AVX512F 0 register_operand)
(abs:VI124_AVX2_48_AVX512F
 (match_operand:VI124_AVX2_48_AVX512F 1 nonimmediate_operand)))]
  TARGET_SSE2
{
  if (!TARGET_SSSE3)
ix86_expand_sse2_abs (operands[0], force_reg (MODEmode, operands[1]));
  else
emit_insn (gen_rtx_SET (VOIDmode, operands[0],
   gen_rtx_ABS (MODEmode, operands[1])));
  DONE;
})


The patch is attached here. Please give me your comments.


thanks,
Cong



diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 8a38316..84c7ab5 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,10 @@
+2013-10-22  Cong Hou  co...@google.com
+
+ PR target/58762
+ * config/i386/i386-protos.h (ix86_expand_sse2_abs): New function.
+ * config/i386/i386.c (ix86_expand_sse2_abs): New function.
+ * config/i386/sse.md: Add SSE2 support to abs (8/16/32-bit-int).
+
 2013-10-14  David Malcolm  dmalc...@redhat.com

  * dumpfile.h (gcc::dump_manager): New class, to hold state
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 3ab2f3a..ca31224 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx,
rtx, rtx, bool, bool);
 extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool);
 extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx);
 extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx);
+extern void ix86_expand_sse2_abs (rtx, rtx);

 /* In i386-c.c  */
 extern void ix86_target_macros (void);
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 02cbbbd..71905fc 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2)
gen_rtx_MULT (mode, op1, op2));
 }

+void
+ix86_expand_sse2_abs (rtx op0, rtx op1)
+{
+  enum machine_mode mode = GET_MODE (op0);
+  rtx tmp0, tmp1;
+
+  switch (mode)
+{
+  /* For 32-bit signed integer X, the best way to calculate the absolute
+ value of X is (((signed) X  (W-1)) ^ X) - ((signed) X  (W-1)).  */
+  case V4SImode:
+ tmp0 = expand_simple_binop (mode, ASHIFTRT, op1,
+GEN_INT (GET_MODE_BITSIZE
+ (GET_MODE_INNER (mode)) - 1),
+NULL, 0, OPTAB_DIRECT);
+ if (tmp0)
+  tmp1 = expand_simple_binop (mode, XOR, op1, tmp0,
+  NULL, 0, OPTAB_DIRECT);
+ if (tmp0  tmp1)
+  expand_simple_binop (mode, MINUS, tmp1, tmp0,
+   op0, 0, OPTAB_DIRECT);
+ break;
+
+  /* For 16-bit signed integer X, the best way to calculate the absolute
+ value of X is max (X, -X), as SSE2 provides the PMAXSW insn.  */
+  case V8HImode:
+ tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0);
+ if (tmp0)
+  expand_simple_binop (mode, SMAX, op1, tmp0, op0, 0,
+   OPTAB_DIRECT);
+ break;
+
+  /* For 8-bit signed integer X, the best way to calculate the absolute
+ value of X is min ((unsigned char) X, (unsigned char) (-X)),
+ as SSE2 provides the PMINUB insn.  */
+  case V16QImode:
+ tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0);
+ if (tmp0)
+  expand_simple_binop (V16QImode, UMIN, op1, tmp0, op0, 0,
+   OPTAB_DIRECT);
+ break;
+
+  default:
+ break;
+}
+}
+
 /* Expand an insert into a vector register through pinsr insn.
Return true if successful.  */

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index c3f6c94..b85ded4 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -8721,7 +8721,7 @@
(set (attr prefix_rex) (symbol_ref x86_extended_reg_mentioned_p (insn)))
(set_attr mode DI)])

-(define_insn absmode2
+(define_insn *absmode2
   [(set (match_operand:VI124_AVX2_48_AVX512F 0 register_operand =v)
  (abs:VI124_AVX2_48_AVX512F
   (match_operand:VI124_AVX2_48_AVX512F 1 nonimmediate_operand vm)))]
@@ -8733,6 +8733,20 @@
(set_attr prefix maybe_vex)
(set_attr mode sseinsnmode)])

+(define_expand absmode2
+  [(set (match_operand:VI124_AVX2_48_AVX512F 0 register_operand)
+ (abs:VI124_AVX2_48_AVX512F
+  (match_operand:VI124_AVX2_48_AVX512F 1 nonimmediate_operand)))]
+  TARGET_SSE2
+{
+  if (!TARGET_SSSE3)
+ix86_expand_sse2_abs (operands[0], force_reg (MODEmode, operands[1]));
+  else
+emit_insn (gen_rtx_SET (VOIDmode, operands[0],
+gen_rtx_ABS (MODEmode, operands[1])));
+  DONE;
+})
+
 (define_insn absmode2
   [(set (match_operand:MMXMODEI 0 register_operand =y)
  (abs:MMXMODEI
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 075d071..cf5b942 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2013-10-22  Cong Hou  co...@google.com
+
+ PR

Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.

2013-10-24 Thread Cong Hou
I have updated the patch according to your suggestion, and have
committed the patch as the bootstrapping and make check both get
passed.

Thank you for your patient help on this patch! I learned a lot from it.


thanks,
Cong


On Wed, Oct 23, 2013 at 1:13 PM, Joseph S. Myers
jos...@codesourcery.com wrote:
 On Mon, 7 Oct 2013, Cong Hou wrote:

 +  if (type != newtype)
 +break;

 That comparison would wrongly treat as different cases where the types
 differ only in one being a typedef, having qualifiers, etc. - or if in
 future GCC implemented proposed TS 18661-3, cases where they differ in
 e.g. one being float and the other _Float32 (defined as distinct types
 that are not compatible although they have the same representation and
 alignment).  I think the right test here, bearing in mind the _Float32
 case where types may not be compatible, is TYPE_MODE (type) != TYPE_MODE
 (newtype) - if the types have the same mode, they have the same set of
 values and so are not different in any way that matters for this
 optimization.  OK with that change.

 --
 Joseph S. Myers
 jos...@codesourcery.com


Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-24 Thread Cong Hou
On Wed, Oct 23, 2013 at 11:18 PM, Jakub Jelinek ja...@redhat.com wrote:
 On Wed, Oct 23, 2013 at 09:40:21PM -0700, Cong Hou wrote:
 On Wed, Oct 23, 2013 at 8:52 AM, Joseph S. Myers
 jos...@codesourcery.com wrote:
  On Tue, 22 Oct 2013, Cong Hou wrote:
 
  For abs(char/short), type conversions are needed as the current abs()
  function/operation does not accept argument of char/short type.
  Therefore when we want to get the absolute value of a char_val using
  abs (char_val), it will be converted into abs ((int) char_val). It
  then can be vectorized, but the generated code is not efficient as
  lots of packings and unpackings are envolved. But if we convert
  (char) abs ((int) char_val) to abs (char_val), the vectorizer will be
  able to generate better code. Same for short.
 
  ABS_EXPR has undefined overflow behavior.  Thus, abs ((int) -128) is
  defined (and we also define the subsequent conversion of +128 to signed
  char, which ISO C makes implementation-defined not undefined), and
  converting to an ABS_EXPR on char would wrongly make it undefined.  For
  such a transformation to be valid (in the absence of VRP saying that -128
  isn't a possible value) you'd need a GIMPLE representation for
  ABS_EXPRoverflow:wrap, as distinct from ABS_EXPRoverflow:undefined.
  You don't have the option there is for some arithmetic operations of
  converting to a corresponding operation on unsigned types.
 

 Yes, you are right. The method I use can guarantee wrapping on
 overflow (either shift-xor-sub or max(x, -x)). Can I just add the
 condition if (flag_wrapv) before the conversion I made to prevent the
 undefined behavior on overflow?

 What HW insns you expand to is one thing, but if some GCC pass assumes that
 ABS_EXPR always returns non-negative value (many do, look e.g. at
 tree_unary_nonnegative_warnv_p, extract_range_from_unary_expr_1,
 simplify_const_relational_operation, etc., you'd need to grep for all
 ABS_EXPR/ABS occurrences) and optimizes code based on that fact, you get
 wrong code because (char) abs((char) -128) is well defined.
 If we change ABS_EXPR/ABS definition that it is well defined on the most
 negative value of the typ (resp. mode), then we loose all those
 optimizations, if we do that only for the char/short types, it would be
 quite weird, though we could keep the benefits, but at the RTL level we'd
 need to treat that way all the modes equal to short's mode and smaller (so,
 for sizeof(short) == sizeof(int) target even int's mode).


I checked those functions and they all consider the possibility of
overflow. For example, tree_unary_nonnegative_warnv_p only returns
true for ABS_EXPR on integers if overflow is undefined. If the
consequence of overflow is wrapping, I think converting (char)
abs((int)-128) to abs(-128) (-128 has char type) is safe. Can we do it
by checking flag_wrapv?

I could also first remove the abs conversion content from this patch
but only keep the content of expanding abs() for i386. I will submit
it later.



 The other possibility is not to create the ABS_EXPRs of char/short anywhere,
 solve the vectorization issues either through tree-vect-patterns.c or
 as part of the vectorization type demotion/promotions, see the recent
 discussions for that, you'd represent the short/char abs for the vectorized
 loop say using the shift-xor-sub or builtin etc. and if you want to do the
 same thing for scalar code, you'd just have combiner try to match some
 sequence.


Yes, I could do it through tree-vect-patterns.c, if the abs conversion
is prohibited. Currently the only reason I need the abs conversion is
for vectorization.

Vectorization type demotion/promotions is interesting, but I am afraid
we will face the same problem there.


Thank you for your comment!


Cong



 Jakub


Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-23 Thread Cong Hou
On Tue, Oct 22, 2013 at 8:11 PM,  pins...@gmail.com wrote:


 Sent from my iPad

 On Oct 22, 2013, at 7:23 PM, Cong Hou co...@google.com wrote:

 This patch aims at PR58762.

 Currently GCC could not vectorize abs() operation for integers on x86
 with only SSE2 support. For int type, the reason is that the expand on
 abs() is not defined for vector type. This patch defines such an
 expand so that abs(int) will be vectorized with only SSE2.

 For abs(char/short), type conversions are needed as the current abs()
 function/operation does not accept argument of char/short type.
 Therefore when we want to get the absolute value of a char_val using
 abs (char_val), it will be converted into abs ((int) char_val). It
 then can be vectorized, but the generated code is not efficient as
 lots of packings and unpackings are envolved. But if we convert
 (char) abs ((int) char_val) to abs (char_val), the vectorizer will be
 able to generate better code. Same for short.

 This conversion also enables vectorizing abs(char/short) operation
 with PABSB and PABSW instructions in SSE3.

 With only SSE2 support, I developed three methods to expand
 abs(char/short/int) seperately:

 1. For 32 bit int value x, we can get abs (x) from (((signed) x 
 (W-1)) ^ x) - ((signed) x  (W-1)). This is better than max (x, -x),
 which needs bit masking.

 2. For 16 bit int value x, we can get abs (x) from max (x, -x), as
 SSE2 provides PMAXSW instruction.

 3. For 8 bit int value x, we can get abs (x) from min ((unsigned char)
 x, (unsigned char) (-x)), as SSE2 provides PMINUB instruction.


 The patch is pasted below. Please point out any problem in my patch
 and analysis.


 thanks,
 Cong




 diff --git a/gcc/ChangeLog b/gcc/ChangeLog
 index 8a38316..e0f33ee 100644
 --- a/gcc/ChangeLog
 +++ b/gcc/ChangeLog
 @@ -1,3 +1,13 @@
 +2013-10-22  Cong Hou  co...@google.com
 +
 + PR target/58762
 + * convert.c (convert_to_integer): Convert (char) abs ((int) char_val)
 + into abs (char_val).  Also convert (short) abs ((int) short_val)
 + into abs (short_val).

 I don't like this optimization in convert.  I think it should be submitted 
 separately and should be done in tree-ssa-forwprop.


Yes. This patch can be split into two: one for vectorization and one
for abs conversion.

The reason why I put abs conversion to convert.c is because fabs
conversion is also done there.



 Also I think you should have a generic (non x86) test case for the above 
 optimization.


For vectorization I need to do it on x86 since the define_expand is
only for it. But for abs conversion, yes, I should make a generic test
case.


Thank you for your comments!


Cong



 Thanks,
 Andrew


 + * config/i386/i386-protos.h (ix86_expand_sse2_absvxsi2): New function.
 + * config/i386/i386.c (ix86_expand_sse2_absvxsi2): New function.
 + * config/i386/sse.md: Add SSE2 support to abs (char/int/short).



 +
 2013-10-14  David Malcolm  dmalc...@redhat.com

  * dumpfile.h (gcc::dump_manager): New class, to hold state
 diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
 index 3ab2f3a..e85f663 100644
 --- a/gcc/config/i386/i386-protos.h
 +++ b/gcc/config/i386/i386-protos.h
 @@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx,
 rtx, rtx, bool, bool);
 extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool);
 extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx);
 extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx);
 +extern void ix86_expand_sse2_absvxsi2 (rtx, rtx);

 /* In i386-c.c  */
 extern void ix86_target_macros (void);
 diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
 index 02cbbbd..8050e02 100644
 --- a/gcc/config/i386/i386.c
 +++ b/gcc/config/i386/i386.c
 @@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx 
 op2)
gen_rtx_MULT (mode, op1, op2));
 }

 +void
 +ix86_expand_sse2_absvxsi2 (rtx op0, rtx op1)
 +{
 +  enum machine_mode mode = GET_MODE (op0);
 +  rtx tmp0, tmp1;
 +
 +  switch (mode)
 +{
 +  /* For 32-bit signed integer X, the best way to calculate the absolute
 + value of X is (((signed) X  (W-1)) ^ X) - ((signed) X  (W-1)).  */
 +  case V4SImode:
 + tmp0 = expand_simple_binop (mode, ASHIFTRT, op1,
 +GEN_INT (GET_MODE_BITSIZE
 + (GET_MODE_INNER (mode)) - 1),
 +NULL, 0, OPTAB_DIRECT);
 + if (tmp0)
 +  tmp1 = expand_simple_binop (mode, XOR, op1, tmp0,
 +  NULL, 0, OPTAB_DIRECT);
 + if (tmp0  tmp1)
 +  expand_simple_binop (mode, MINUS, tmp1, tmp0,
 +   op0, 0, OPTAB_DIRECT);
 + break;
 +
 +  /* For 16-bit signed integer X, the best way to calculate the absolute
 + value of X is max (X, -X), as SSE2 provides the PMAXSW insn.  */
 +  case V8HImode:
 + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0);
 + if (tmp0)
 +  expand_simple_binop (mode, SMAX, op1, tmp0, op0, 0,
 +   OPTAB_DIRECT);
 + break;
 +
 +  /* For 8-bit signed integer X, the best way to calculate the absolute
 + value of X is min ((unsigned char) X, (unsigned char

Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-23 Thread Cong Hou
On Wed, Oct 23, 2013 at 12:20 AM, Uros Bizjak ubiz...@gmail.com wrote:
 Hello!

 Currently GCC could not vectorize abs() operation for integers on x86
 with only SSE2 support. For int type, the reason is that the expand on
 abs() is not defined for vector type. This patch defines such an
 expand so that abs(int) will be vectorized with only SSE2.

 +(define_expand absmode2
 +  [(set (match_operand:VI124_AVX2_48_AVX512F 0 register_operand)
 + (abs:VI124_AVX2_48_AVX512F
 +  (match_operand:VI124_AVX2_48_AVX512F 1 register_operand)))]
 +  TARGET_SSE2
 +{
 +  if (TARGET_SSE2  !TARGET_SSSE3)
 +ix86_expand_sse2_absvxsi2 (operands[0], operands[1]);
 +  else if (TARGET_SSSE3)
 +emit_insn (gen_rtx_SET (VOIDmode, operands[0],
 +gen_rtx_ABS (MODEmode, operands[1])));
 +  DONE;
 +})

 This should be written as:

 (define_expand absmode2
   [(set (match_operand:VI124_AVX2_48_AVX512F 0 register_operand)
(abs:VI124_AVX2_48_AVX512F
  (match_operand:VI124_AVX2_48_AVX512F 1 nonimmediate_operand)))]
   TARGET_SSE2
 {
   if (!TARGET_SSSE3)
 {
   ix86_expand_sse2_absvxsi2 (operands[0], operands[1]);
   DONE;
 }
 })

OK.


 Please note that operands[1] can be a memory operand, so your expander
 should either handle it (this is preferred) or load the operand to the
 register at the beginning of the expansion.


OK. I think I don't have to make any change to
ix86_expand_sse2_absvxsi2(), as operands[1] is always read-only.
Right?



 +void
 +ix86_expand_sse2_absvxsi2 (rtx op0, rtx op1)

 This function name implies SImode operands ... please just name it
 ix86_expand_sse2_abs.


Yes, my bad. At first I only considered V4SI but later forgot to
rename the function.


Thank you very much!


Cong


 Uros.


Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-23 Thread Cong Hou
On Wed, Oct 23, 2013 at 8:52 AM, Joseph S. Myers
jos...@codesourcery.com wrote:
 On Tue, 22 Oct 2013, Cong Hou wrote:

 For abs(char/short), type conversions are needed as the current abs()
 function/operation does not accept argument of char/short type.
 Therefore when we want to get the absolute value of a char_val using
 abs (char_val), it will be converted into abs ((int) char_val). It
 then can be vectorized, but the generated code is not efficient as
 lots of packings and unpackings are envolved. But if we convert
 (char) abs ((int) char_val) to abs (char_val), the vectorizer will be
 able to generate better code. Same for short.

 ABS_EXPR has undefined overflow behavior.  Thus, abs ((int) -128) is
 defined (and we also define the subsequent conversion of +128 to signed
 char, which ISO C makes implementation-defined not undefined), and
 converting to an ABS_EXPR on char would wrongly make it undefined.  For
 such a transformation to be valid (in the absence of VRP saying that -128
 isn't a possible value) you'd need a GIMPLE representation for
 ABS_EXPRoverflow:wrap, as distinct from ABS_EXPRoverflow:undefined.
 You don't have the option there is for some arithmetic operations of
 converting to a corresponding operation on unsigned types.


Yes, you are right. The method I use can guarantee wrapping on
overflow (either shift-xor-sub or max(x, -x)). Can I just add the
condition if (flag_wrapv) before the conversion I made to prevent the
undefined behavior on overflow?

Thank you!

Cong


 --
 Joseph S. Myers
 jos...@codesourcery.com


Re: [PATCH] Vectorizing abs(char/short/int) on x86.

2013-10-23 Thread Cong Hou
I think I did not make it clear. If GCC defines that passing 128 to a
char value makes it the wrapping result -128, then the conversion from
(char) abs ((int) char_val) to abs (char_val) is safe if we can
guarantee abs (char(-128)) = -128 also. Then the subsequent methods
used to get abs() should also guarantee wrapping on overflow.
Shift-xor-sub is OK, but max(x, -x) is OK only if the result of the
negation operation on -128 is also -128 (wrapping). I think that is
right the behavior of SSE2 operation PSUBB ([0,...,0], [x,...,x]), as
PSUBB can operate both signed/unsigned operands.


thanks,
Cong


On Wed, Oct 23, 2013 at 9:40 PM, Cong Hou co...@google.com wrote:
 On Wed, Oct 23, 2013 at 8:52 AM, Joseph S. Myers
 jos...@codesourcery.com wrote:
 On Tue, 22 Oct 2013, Cong Hou wrote:

 For abs(char/short), type conversions are needed as the current abs()
 function/operation does not accept argument of char/short type.
 Therefore when we want to get the absolute value of a char_val using
 abs (char_val), it will be converted into abs ((int) char_val). It
 then can be vectorized, but the generated code is not efficient as
 lots of packings and unpackings are envolved. But if we convert
 (char) abs ((int) char_val) to abs (char_val), the vectorizer will be
 able to generate better code. Same for short.

 ABS_EXPR has undefined overflow behavior.  Thus, abs ((int) -128) is
 defined (and we also define the subsequent conversion of +128 to signed
 char, which ISO C makes implementation-defined not undefined), and
 converting to an ABS_EXPR on char would wrongly make it undefined.  For
 such a transformation to be valid (in the absence of VRP saying that -128
 isn't a possible value) you'd need a GIMPLE representation for
 ABS_EXPRoverflow:wrap, as distinct from ABS_EXPRoverflow:undefined.
 You don't have the option there is for some arithmetic operations of
 converting to a corresponding operation on unsigned types.


 Yes, you are right. The method I use can guarantee wrapping on
 overflow (either shift-xor-sub or max(x, -x)). Can I just add the
 condition if (flag_wrapv) before the conversion I made to prevent the
 undefined behavior on overflow?

 Thank you!

 Cong


 --
 Joseph S. Myers
 jos...@codesourcery.com


Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.

2013-10-21 Thread Cong Hou
Jeff, thank you for installing this patch. Actually I already have the
write privileges. I just came back from a trip.

Thank you again!



thanks,
Cong


On Fri, Oct 18, 2013 at 10:22 PM, Jeff Law l...@redhat.com wrote:
 On 10/18/13 03:56, Richard Biener wrote:

 On Thu, 17 Oct 2013, Cong Hou wrote:

 I tested this case with -fno-tree-loop-im and -fno-tree-pre, and it
 seems that GCC could hoist j+1 outside of the i loop:

 t3.c:5:5: note: hoisting out of the vectorized loop: _10 = (sizetype)
 j_25;
 t3.c:5:5: note: hoisting out of the vectorized loop: _11 = _10 + 1;
 t3.c:5:5: note: hoisting out of the vectorized loop: _12 = _11 * 4;
 t3.c:5:5: note: hoisting out of the vectorized loop: _14 = b_13(D) + _12;
 t3.c:5:5: note: hoisting out of the vectorized loop: _15 = *_14;
 t3.c:5:5: note: hoisting out of the vectorized loop: _16 = _15 + 1;


 But your suggestion is still nice as it can remove a branch and make
 the code more brief. I have updated the patch and also included the
 nested loop example into the test case.


 Ok if it passes bootstrap  regtest.

 Bootstrapped  regression tested on x86_64-unknown-linux-gnu.  Installed on
 Cong's behalf.

 Cong -- if you plan on contributing regularly to GCC, please start the
 process for write privileges.  This form should have everything you need:

 https://sourceware.org/cgi-bin/pdw/ps_form.cgi

 Jeff


Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.

2013-10-21 Thread Cong Hou
OK. Have done that. And this is also a patch, right? ;)


thanks,
Cong



diff --git a/MAINTAINERS b/MAINTAINERS
index 15b6cc7..a6954da 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -406,6 +406,7 @@ Fergus Hendersonf...@cs.mu.oz.au
 Stuart Henderson   shend...@gcc.gnu.org
 Matthew Hiller hil...@redhat.com
 Manfred Hollstein  m...@suse.com
+Cong Hou   co...@google.com
 Falk Hueffner  f...@debian.org
 Andrew John Hughes gnu_and...@member.fsf.org
 Andy Hutchinsonhutchinsona...@aim.com





On Mon, Oct 21, 2013 at 9:46 AM, Jeff Law l...@redhat.com wrote:
 On 10/21/13 10:45, Cong Hou wrote:

 Jeff, thank you for installing this patch. Actually I already have the
 write privileges. I just came back from a trip.

 Ah.  I didn't see you in the MAINTAINERS file.  Can you update that file
 please.

 Thanks,
 jeff



Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.

2013-10-17 Thread Cong Hou
I tested this case with -fno-tree-loop-im and -fno-tree-pre, and it
seems that GCC could hoist j+1 outside of the i loop:

t3.c:5:5: note: hoisting out of the vectorized loop: _10 = (sizetype) j_25;
t3.c:5:5: note: hoisting out of the vectorized loop: _11 = _10 + 1;
t3.c:5:5: note: hoisting out of the vectorized loop: _12 = _11 * 4;
t3.c:5:5: note: hoisting out of the vectorized loop: _14 = b_13(D) + _12;
t3.c:5:5: note: hoisting out of the vectorized loop: _15 = *_14;
t3.c:5:5: note: hoisting out of the vectorized loop: _16 = _15 + 1;


But your suggestion is still nice as it can remove a branch and make
the code more brief. I have updated the patch and also included the
nested loop example into the test case.

Thank you!


Cong



diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 8a38316..2637309 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,8 @@
+2013-10-15  Cong Hou  co...@google.com
+
+ * tree-vect-loop-manip.c (vect_loop_versioning): Hoist loop invariant
+ statement that contains data refs with zero-step.
+
 2013-10-14  David Malcolm  dmalc...@redhat.com

  * dumpfile.h (gcc::dump_manager): New class, to hold state
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 075d071..9d0f4a5 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,7 @@
+2013-10-15  Cong Hou  co...@google.com
+
+ * gcc.dg/vect/pr58508.c: New test.
+
 2013-10-14  Tobias Burnus  bur...@net-b.de

  PR fortran/58658
diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c
b/gcc/testsuite/gcc.dg/vect/pr58508.c
new file mode 100644
index 000..6484a65
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr58508.c
@@ -0,0 +1,70 @@
+/* { dg-do compile } */
+/* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */
+
+
+/* The GCC vectorizer generates loop versioning for the following loop
+   since there may exist aliasing between A and B.  The predicate checks
+   if A may alias with B across all iterations.  Then for the loop in
+   the true body, we can assert that *B is a loop invariant so that
+   we can hoist the load of *B before the loop body.  */
+
+void test1 (int* a, int* b)
+{
+  int i;
+  for (i = 0; i  10; ++i)
+a[i] = *b + 1;
+}
+
+/* A test case with nested loops.  The load of b[j+1] in the inner
+   loop should be hoisted.  */
+
+void test2 (int* a, int* b)
+{
+  int i, j;
+  for (j = 0; j  10; ++j)
+for (i = 0; i  10; ++i)
+  a[i] = b[j+1] + 1;
+}
+
+/* A test case with ifcvt transformation.  */
+
+void test3 (int* a, int* b)
+{
+  int i, t;
+  for (i = 0; i  1; ++i)
+{
+  if (*b  0)
+ t = *b * 2;
+  else
+ t = *b / 2;
+  a[i] = t;
+}
+}
+
+/* A test case in which the store in the loop can be moved outside
+   in the versioned loop with alias checks.  Note this loop won't
+   be vectorized.  */
+
+void test4 (int* a, int* b)
+{
+  int i;
+  for (i = 0; i  10; ++i)
+*a += b[i];
+}
+
+/* A test case in which the load and store in the loop to b
+   can be moved outside in the versioned loop with alias checks.
+   Note this loop won't be vectorized.  */
+
+void test5 (int* a, int* b)
+{
+  int i;
+  for (i = 0; i  10; ++i)
+{
+  *b += a[i];
+  a[i] = *b;
+}
+}
+
+/* { dg-final { scan-tree-dump-times hoist 8 vect } } */
+/* { dg-final { cleanup-tree-dump vect } } */
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 574446a..1cc563c 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -2477,6 +2477,73 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
   adjust_phi_and_debug_stmts (orig_phi, e, PHI_RESULT (new_phi));
 }

+
+  /* Extract load statements on memrefs with zero-stride accesses.  */
+
+  if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
+{
+  /* In the loop body, we iterate each statement to check if it is a load.
+ Then we check the DR_STEP of the data reference.  If DR_STEP is zero,
+ then we will hoist the load statement to the loop preheader.  */
+
+  basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
+  int nbbs = loop-num_nodes;
+
+  for (int i = 0; i  nbbs; ++i)
+ {
+  for (gimple_stmt_iterator si = gsi_start_bb (bbs[i]);
+   !gsi_end_p (si);)
+{
+  gimple stmt = gsi_stmt (si);
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+  struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
+
+  if (is_gimple_assign (stmt)
+   (!dr
+  || (DR_IS_READ (dr)  integer_zerop (DR_STEP (dr)
+ {
+  bool hoist = true;
+  ssa_op_iter iter;
+  tree var;
+
+  /* We hoist a statement if all SSA uses in it are defined
+ outside of the loop.  */
+  FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_USE)
+{
+  gimple def = SSA_NAME_DEF_STMT (var);
+  if (!gimple_nop_p (def)
+   flow_bb_inside_loop_p (loop, gimple_bb (def)))
+ {
+  hoist = false;
+  break;
+ }
+}
+
+  if (hoist)
+{
+  if (dr)
+ gimple_set_vuse (stmt, NULL);
+
+  gsi_remove (si, false

Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.

2013-10-17 Thread Cong Hou
Ping?


thanks,
Cong


On Mon, Oct 7, 2013 at 10:15 AM, Cong Hou co...@google.com wrote:
 You are right. I am not an expert on numerical analysis, but I tested
 your case and it proves the number 4 conversion is not safe.

 Now we have four conversions which are safe once the precision
 requirement is satisfied. I added a condition if (type != newtype) to
 remove the unsafe one, as in this case once more conversion is added
 which leads to the unsafe issue. If you think this condition does not
 make sense please let me know.

 The new patch is shown below (the attached file has tabs).

 Thank you very much!



 thanks,
 Cong



 Index: gcc/convert.c
 ===
 --- gcc/convert.c (revision 203250)
 +++ gcc/convert.c (working copy)
 @@ -135,16 +135,19 @@ convert_to_real (tree type, tree expr)
CASE_MATHFN (COS)
CASE_MATHFN (ERF)
CASE_MATHFN (ERFC)
 -  CASE_MATHFN (FABS)
CASE_MATHFN (LOG)
CASE_MATHFN (LOG10)
CASE_MATHFN (LOG2)
CASE_MATHFN (LOG1P)
 -  CASE_MATHFN (LOGB)
CASE_MATHFN (SIN)
 -  CASE_MATHFN (SQRT)
CASE_MATHFN (TAN)
CASE_MATHFN (TANH)
 +/* The above functions are not safe to do this conversion.  */
 +if (!flag_unsafe_math_optimizations)
 +  break;
 +  CASE_MATHFN (SQRT)
 +  CASE_MATHFN (FABS)
 +  CASE_MATHFN (LOGB)
  #undef CASE_MATHFN
  {
tree arg0 = strip_float_extensions (CALL_EXPR_ARG (expr, 0));
 @@ -155,13 +158,43 @@ convert_to_real (tree type, tree expr)
if (TYPE_PRECISION (TREE_TYPE (arg0))  TYPE_PRECISION (type))
   newtype = TREE_TYPE (arg0);

 +  /* We consider to convert
 +
 + (T1) sqrtT2 ((T2) exprT3)
 + to
 + (T1) sqrtT4 ((T4) exprT3)
 +
 +  , where T1 is TYPE, T2 is ITYPE, T3 is TREE_TYPE (ARG0),
 + and T4 is NEWTYPE. All those types are of floating point types.
 + T4 (NEWTYPE) should be narrower than T2 (ITYPE). This conversion
 + is safe only if P1 = P2*2+2, where P1 and P2 are precisions of
 + T2 and T4. See the following URL for a reference:
 + 
 http://stackoverflow.com/questions/9235456/determining-floating-point-square-root
 + */
 +  if ((fcode == BUILT_IN_SQRT || fcode == BUILT_IN_SQRTL)
 +   !flag_unsafe_math_optimizations)
 + {
 +  /* The following conversion is unsafe even the precision condition
 + below is satisfied:
 +
 + (float) sqrtl ((long double) double_val) - (float) sqrt (double_val)
 +*/
 +  if (type != newtype)
 +break;
 +
 +  int p1 = REAL_MODE_FORMAT (TYPE_MODE (itype))-p;
 +  int p2 = REAL_MODE_FORMAT (TYPE_MODE (newtype))-p;
 +  if (p1  p2 * 2 + 2)
 +break;
 + }
 +
/* Be careful about integer to fp conversions.
   These may overflow still.  */
if (FLOAT_TYPE_P (TREE_TYPE (arg0))
 TYPE_PRECISION (newtype)  TYPE_PRECISION (itype)
 (TYPE_MODE (newtype) == TYPE_MODE (double_type_node)
|| TYPE_MODE (newtype) == TYPE_MODE (float_type_node)))
 -{
 + {
tree fn = mathfn_built_in (newtype, fcode);

if (fn)
 Index: gcc/ChangeLog
 ===
 --- gcc/ChangeLog (revision 203250)
 +++ gcc/ChangeLog (working copy)
 @@ -1,3 +1,9 @@
 +2013-10-07  Cong Hou  co...@google.com
 +
 + * convert.c (convert_to_real): Forbid unsafe math function
 + conversions including sin/cos/log etc. Add precision check
 + for sqrt.
 +
  2013-10-07  Bill Schmidt  wschm...@linux.vnet.ibm.com

   * config/rs6000/rs6000.c (altivec_expand_vec_perm_const_le): New.
 Index: gcc/testsuite/ChangeLog
 ===
 --- gcc/testsuite/ChangeLog (revision 203250)
 +++ gcc/testsuite/ChangeLog (working copy)
 @@ -1,3 +1,7 @@
 +2013-10-07  Cong Hou  co...@google.com
 +
 + * gcc.c-torture/execute/20030125-1.c: Update.
 +
  2013-10-07  Bill Schmidt  wschm...@linux.vnet.ibm.com

   * gcc.target/powerpc/pr43154.c: Skip for ppc64 little endian.
 Index: gcc/testsuite/gcc.c-torture/execute/20030125-1.c
 ===
 --- gcc/testsuite/gcc.c-torture/execute/20030125-1.c (revision 203250)
 +++ gcc/testsuite/gcc.c-torture/execute/20030125-1.c (working copy)
 @@ -44,11 +44,11 @@ __attribute__ ((noinline))
  double
  sin(double a)
  {
 - abort ();
 + return a;
  }
  __attribute__ ((noinline))
  float
  sinf(float a)
  {
 - return a;
 + abort ();
  }




 On Thu, Oct 3, 2013 at 5:06 PM, Joseph S. Myers jos...@codesourcery.com 
 wrote:
 On Fri, 6 Sep 2013, Cong Hou wrote:

 4: (float) sqrtl ((long double) double_val)  -  (float) sqrt (double_val)

 I don't believe this case is in fact safe even if precision (long double)
= precision (double) * 2 + 2 (when your patch would allow it).

 The result that precision (double) * 2 + 2 is sufficient for the result of
 rounding the long double value to double to be the same as the result of
 rounding once from infinite precision to double would I think also mean
 the same when rounding of the infinite-precision

Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.

2013-10-16 Thread Cong Hou
On Wed, Oct 16, 2013 at 2:02 AM, Richard Biener rguent...@suse.de wrote:
 On Tue, 15 Oct 2013, Cong Hou wrote:

 Thank you for your reminder, Jeff! I just noticed Richard's comment. I
 have modified the patch according to that.

 The new patch is attached.

 (posting patches inline is easier for review, now you have to deal
 with no quoting markers ;))

 Comments inline.

 diff --git a/gcc/ChangeLog b/gcc/ChangeLog
 index 8a38316..2637309 100644
 --- a/gcc/ChangeLog
 +++ b/gcc/ChangeLog
 @@ -1,3 +1,8 @@
 +2013-10-15  Cong Hou  co...@google.com
 +
 +   * tree-vect-loop-manip.c (vect_loop_versioning): Hoist loop invariant
 +   statement that contains data refs with zero-step.
 +
  2013-10-14  David Malcolm  dmalc...@redhat.com

 * dumpfile.h (gcc::dump_manager): New class, to hold state
 diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
 index 075d071..9d0f4a5 100644
 --- a/gcc/testsuite/ChangeLog
 +++ b/gcc/testsuite/ChangeLog
 @@ -1,3 +1,7 @@
 +2013-10-15  Cong Hou  co...@google.com
 +
 +   * gcc.dg/vect/pr58508.c: New test.
 +
  2013-10-14  Tobias Burnus  bur...@net-b.de

 PR fortran/58658
 diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c 
 b/gcc/testsuite/gcc.dg/vect/pr58508.c
 new file mode 100644
 index 000..cb22b50
 --- /dev/null
 +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c
 @@ -0,0 +1,20 @@
 +/* { dg-do compile } */
 +/* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */
 +
 +
 +/* The GCC vectorizer generates loop versioning for the following loop
 +   since there may exist aliasing between A and B.  The predicate checks
 +   if A may alias with B across all iterations.  Then for the loop in
 +   the true body, we can assert that *B is a loop invariant so that
 +   we can hoist the load of *B before the loop body.  */
 +
 +void foo (int* a, int* b)
 +{
 +  int i;
 +  for (i = 0; i  10; ++i)
 +a[i] = *b + 1;
 +}
 +
 +
 +/* { dg-final { scan-tree-dump-times hoist 2 vect } } */
 +/* { dg-final { cleanup-tree-dump vect } } */
 diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
 index 574446a..f4fdec2 100644
 --- a/gcc/tree-vect-loop-manip.c
 +++ b/gcc/tree-vect-loop-manip.c
 @@ -2477,6 +2477,92 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
adjust_phi_and_debug_stmts (orig_phi, e, PHI_RESULT (new_phi));
  }


 Note that applying this kind of transform at this point invalidates
 some of the earlier analysis the vectorizer performed (namely the
 def-kind which now effectively gets vect_external_def from
 vect_internal_def).  In this case it doesn't seem to cause any
 issues (we re-compute the def-kind everytime we need it (how wasteful)).

 +  /* Extract load and store statements on pointers with zero-stride
 + accesses.  */
 +  if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
 +{
 +  /* In the loop body, we iterate each statement to check if it is a load
 +or store.  Then we check the DR_STEP of the data reference.  If
 +DR_STEP is zero, then we will hoist the load statement to the loop
 +preheader, and move the store statement to the loop exit.  */

 We don't move the store yet.  Micha has a patch pending that enables
 vectorization of zero-step stores.

 +  for (gimple_stmt_iterator si = gsi_start_bb (loop-header);
 +  !gsi_end_p (si);)

 While technically ok now (vectorized loops contain a single basic block)
 please use LOOP_VINFO_BBS () to get at the vector of basic-blcoks
 and iterate over them like other code does.


Have done it.



 +   {
 + gimple stmt = gsi_stmt (si);
 + stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
 + struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
 +
 + if (dr  integer_zerop (DR_STEP (dr)))
 +   {
 + if (DR_IS_READ (dr))
 +   {
 + if (dump_enabled_p ())
 +   {
 + dump_printf_loc
 + (MSG_NOTE, vect_location,
 +  hoist the statement to outside of the loop );

 hoisting out of the vectorized loop: 

 + dump_gimple_stmt (MSG_NOTE, TDF_SLIM, stmt, 0);
 + dump_printf (MSG_NOTE, \n);
 +   }
 +
 + gsi_remove (si, false);
 + gsi_insert_on_edge_immediate (loop_preheader_edge (loop), 
 stmt);

 Note that this will result in a bogus VUSE on the stmt at this point which
 will be only fixed because of implementation details of loop versioning.
 Either get the correct VUSE from the loop header virtual PHI node
 preheader edge (if there is none then the current VUSE is the correct one
 to use) or clear it.


I just cleared the VUSE since I noticed that after the vectorization
pass the correct VUSE is reassigned to the load.



 +   }
 + /* TODO: We also consider vectorizing loops containing zero-step
 +data refs as writes.  For example

Re: [PATCH] Relax the requirement of reduction pattern in GCC vectorizer.

2013-10-15 Thread Cong Hou
I have corrected the ChangeLog format, and committed this patch.

Thank you!


Cong


On Tue, Oct 15, 2013 at 6:38 AM, Richard Biener
richard.guent...@gmail.com wrote:
 On Sat, Sep 28, 2013 at 3:28 AM, Cong Hou co...@google.com wrote:
 The current GCC vectorizer requires the following pattern as a simple
 reduction computation:

loop_header:
  a1 = phi  a0, a2 
  a3 = ...
  a2 = operation (a3, a1)

 But a3 can also be defined outside of the loop. For example, the
 following loop can benefit from vectorization but the GCC vectorizer
 fails to vectorize it:


 int foo(int v)
 {
   int s = 1;
   ++v;
   for (int i = 0; i  10; ++i)
 s *= v;
   return s;
 }


 This patch relaxes the original requirement by also considering the
 following pattern:


a3 = ...
loop_header:
  a1 = phi  a0, a2 
  a2 = operation (a3, a1)


 A test case is also added. The patch is tested on x86-64.


 thanks,
 Cong

 

 diff --git a/gcc/ChangeLog b/gcc/ChangeLog
 index 39c786e..45c1667 100644
 --- a/gcc/ChangeLog
 +++ b/gcc/ChangeLog
 @@ -1,3 +1,9 @@
 +2013-09-27  Cong Hou  co...@google.com
 +
 + * tree-vect-loop.c: Relax the requirement of the reduction

 ChangeLog format is

 tab* tree-vect-loop.c (vect_is_simple_reduction_1): Relax the
 tabrequirement of the reduction.

 Ok with that change.

 Thanks,
 Richard.

 + pattern so that one operand of the reduction operation can
 + come from outside of the loop.
 +
  2013-09-25  Tom Tromey  tro...@redhat.com

   * Makefile.in (PARTITION_H, LTO_SYMTAB_H, COMMON_TARGET_DEF_H)
 diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
 index 09644d2..90496a2 100644
 --- a/gcc/testsuite/ChangeLog
 +++ b/gcc/testsuite/ChangeLog
 @@ -1,3 +1,7 @@
 +2013-09-27  Cong Hou  co...@google.com
 +
 + * gcc.dg/vect/vect-reduc-pattern-3.c: New test.
 +
  2013-09-25  Marek Polacek  pola...@redhat.com

   PR sanitizer/58413
 diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
 index 2871ba1..3c51c3b 100644
 --- a/gcc/tree-vect-loop.c
 +++ b/gcc/tree-vect-loop.c
 @@ -2091,6 +2091,13 @@ vect_is_slp_reduction (loop_vec_info loop_info,
 gimple phi, gimple first_stmt)
   a3 = ...
   a2 = operation (a3, a1)

 +   or
 +
 +   a3 = ...
 +   loop_header:
 + a1 = phi  a0, a2 
 + a2 = operation (a3, a1)
 +
 such that:
 1. operation is commutative and associative and it is safe to
change the order of the computation (if CHECK_REDUCTION is true)
 @@ -2451,6 +2458,7 @@ vect_is_simple_reduction_1 (loop_vec_info
 loop_info, gimple phi,
if (def2  def2 == phi
 (code == COND_EXPR
|| !def1 || gimple_nop_p (def1)
 +  || !flow_bb_inside_loop_p (loop, gimple_bb (def1))
|| (def1  flow_bb_inside_loop_p (loop, gimple_bb (def1))
 (is_gimple_assign (def1)
|| is_gimple_call (def1)
 @@ -2469,6 +2477,7 @@ vect_is_simple_reduction_1 (loop_vec_info
 loop_info, gimple phi,
if (def1  def1 == phi
 (code == COND_EXPR
|| !def2 || gimple_nop_p (def2)
 +  || !flow_bb_inside_loop_p (loop, gimple_bb (def2))
|| (def2  flow_bb_inside_loop_p (loop, gimple_bb (def2))
 (is_gimple_assign (def2)
|| is_gimple_call (def2)
 diff --git gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c
 gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c
 new file mode 100644
 index 000..06a9416
 --- /dev/null
 +++ gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c
 @@ -0,0 +1,41 @@
 +/* { dg-require-effective-target vect_int } */
 +
 +#include stdarg.h
 +#include tree-vect.h
 +
 +#define N 10
 +#define RES 1024
 +
 +/* A reduction pattern in which there is no data ref in
 +   the loop and one operand is defined outside of the loop.  */
 +
 +__attribute__ ((noinline)) int
 +foo (int v)
 +{
 +  int i;
 +  int result = 1;
 +
 +  ++v;
 +  for (i = 0; i  N; i++)
 +result *= v;
 +
 +  return result;
 +}
 +
 +int
 +main (void)
 +{
 +  int res;
 +
 +  check_vect ();
 +
 +  res = foo (1);
 +  if (res != RES)
 +abort ();
 +
 +  return 0;
 +}
 +
 +/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect } } */
 +/* { dg-final { cleanup-tree-dump vect } } */
 +


Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.

2013-10-15 Thread Cong Hou
Thank you for your reminder, Jeff! I just noticed Richard's comment. I
have modified the patch according to that.

The new patch is attached.


thanks,
Cong


On Tue, Oct 15, 2013 at 12:33 PM, Jeff Law l...@redhat.com wrote:
 On 10/14/13 17:31, Cong Hou wrote:

 Any comment on this patch?

 Richi replied in the BZ you opened.

 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58508

 Essentially he said emit the load on the edge rather than in the block
 itself.
 jeff

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 8a38316..2637309 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,8 @@
+2013-10-15  Cong Hou  co...@google.com
+
+   * tree-vect-loop-manip.c (vect_loop_versioning): Hoist loop invariant
+   statement that contains data refs with zero-step.
+
 2013-10-14  David Malcolm  dmalc...@redhat.com
 
* dumpfile.h (gcc::dump_manager): New class, to hold state
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 075d071..9d0f4a5 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,7 @@
+2013-10-15  Cong Hou  co...@google.com
+
+   * gcc.dg/vect/pr58508.c: New test.
+
 2013-10-14  Tobias Burnus  bur...@net-b.de
 
PR fortran/58658
diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c 
b/gcc/testsuite/gcc.dg/vect/pr58508.c
new file mode 100644
index 000..cb22b50
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr58508.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */
+
+
+/* The GCC vectorizer generates loop versioning for the following loop
+   since there may exist aliasing between A and B.  The predicate checks
+   if A may alias with B across all iterations.  Then for the loop in
+   the true body, we can assert that *B is a loop invariant so that
+   we can hoist the load of *B before the loop body.  */
+
+void foo (int* a, int* b)
+{
+  int i;
+  for (i = 0; i  10; ++i)
+a[i] = *b + 1;
+}
+
+
+/* { dg-final { scan-tree-dump-times hoist 2 vect } } */
+/* { dg-final { cleanup-tree-dump vect } } */
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 574446a..f4fdec2 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -2477,6 +2477,92 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
   adjust_phi_and_debug_stmts (orig_phi, e, PHI_RESULT (new_phi));
 }
 
+
+  /* Extract load and store statements on pointers with zero-stride
+ accesses.  */
+  if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
+{
+  /* In the loop body, we iterate each statement to check if it is a load
+or store.  Then we check the DR_STEP of the data reference.  If
+DR_STEP is zero, then we will hoist the load statement to the loop
+preheader, and move the store statement to the loop exit.  */
+
+  for (gimple_stmt_iterator si = gsi_start_bb (loop-header);
+  !gsi_end_p (si);)
+   {
+ gimple stmt = gsi_stmt (si);
+ stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+ struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
+
+ if (dr  integer_zerop (DR_STEP (dr)))
+   {
+ if (DR_IS_READ (dr))
+   {
+ if (dump_enabled_p ())
+   {
+ dump_printf_loc
+ (MSG_NOTE, vect_location,
+  hoist the statement to outside of the loop );
+ dump_gimple_stmt (MSG_NOTE, TDF_SLIM, stmt, 0);
+ dump_printf (MSG_NOTE, \n);
+   }
+
+ gsi_remove (si, false);
+ gsi_insert_on_edge_immediate (loop_preheader_edge (loop), 
stmt);
+   }
+ /* TODO: We also consider vectorizing loops containing zero-step
+data refs as writes.  For example:
+
+int a[N], *s;
+for (i = 0; i  N; i++)
+  *s += a[i];
+
+In this case the write to *s can be also moved after the
+loop.  */
+
+ continue;
+   }
+ else if (!dr)
+ {
+   bool hoist = true;
+   for (size_t i = 0; i  gimple_num_ops (stmt); i++)
+ {
+   tree op = gimple_op (stmt, i);
+   if (TREE_CODE (op) == INTEGER_CST
+   || TREE_CODE (op) == REAL_CST)
+ continue;
+   if (TREE_CODE (op) == SSA_NAME)
+ {
+   gimple def = SSA_NAME_DEF_STMT (op);
+   if (def == stmt
+   || gimple_nop_p (def)
+   || !flow_bb_inside_loop_p (loop, gimple_bb (def)))
+ continue;
+ }
+   hoist = false;
+   break;
+ }
+
+   if (hoist)
+ {
+   gsi_remove (si, false);
+   gsi_insert_on_edge_immediate (loop_preheader_edge

Fwd: [PATCH] Reducing number of alias checks in vectorization.

2013-10-14 Thread Cong Hou
Sorry for forgetting using plain-text mode. Resend it.


-- Forwarded message --
From: Cong Hou co...@google.com
Date: Mon, Oct 14, 2013 at 3:29 PM
Subject: Re: [PATCH] Reducing number of alias checks in vectorization.
To: Richard Biener rguent...@suse.de, GCC Patches gcc-patches@gcc.gnu.org
Cc: Jakub Jelinek ja...@redhat.com


I have made a new patch for this issue according to your comments.

There are several modifications to my previous patch:


1. Remove the use of STL features such as vector and sort. Use GCC's
vec and qsort instead.

2. Comparisons between tree nodes are not based on their addresses any
more. Use compare_tree() function instead.

3. The function vect_create_cond_for_alias_checks() now returns the
number of alias checks. If its second parameter cond_expr is NULL,
then this function only calculate the number of alias checks after the
merging and won't generate comparison expressions.

4. The function vect_prune_runtime_alias_test_list() now uses
vect_create_cond_for_alias_checks() to get the number of alias checks.


The patch is attached as a text file.

Please give me your comment on this patch. Thank you!


Cong


On Thu, Oct 3, 2013 at 2:35 PM, Cong Hou co...@google.com wrote:

 Forget about this aux idea as the segment length for one data ref
 can be different in different dr pairs.

 In my patch I created a struct as shown below:

 struct dr_addr_with_seg_len
 {
   data_reference *dr;
   tree basic_addr;
   tree offset;
   tree seg_len;
 };


 Note that basic_addr and offset can always obtained from dr, but we
 need to store two segment lengths for each dr pair. It is improper to
 add a field to data_dependence_relation as it is defined outside of
 vectorizer. We can change the type (a new one combining
 data_dependence_relation and segment length) of may_alias_ddrs in
 loop_vec_info to include such information, but we have to add a new
 type to tree-vectorizer.h which is only used in two places - still too
 much.

 One possible solution is that we create a local struct as shown above
 and a new function which returns the merged alias check information.
 This function will be called twice: once during analysis phase and
 once in transformation phase. Then we don't have to store the merged
 alias check information during those two phases. The additional time
 cost is minimal as there will not be too many data dependent dr pairs
 in a loop.

 Any comment?


 thanks,
 Cong


 On Thu, Oct 3, 2013 at 10:57 AM, Cong Hou co...@google.com wrote:
  I noticed that there is a struct dataref_aux defined in
  tree-vectorizer.h which is specific to the vectorizer pass and is
  stored in (void*)aux in struct data_reference. Can we add one more
  field segment_length to dataref_aux so that we can pass this
  information for merging alias checks? Then we can avoid to modify or
  create other structures.
 
 
  thanks,
  Cong
 
 
  On Wed, Oct 2, 2013 at 2:34 PM, Cong Hou co...@google.com wrote:
  On Wed, Oct 2, 2013 at 4:24 AM, Richard Biener rguent...@suse.de wrote:
  On Tue, 1 Oct 2013, Cong Hou wrote:
 
  When alias exists between data refs in a loop, to vectorize it GCC
  does loop versioning and adds runtime alias checks. Basically for each
  pair of data refs with possible data dependence, there will be two
  comparisons generated to make sure there is no aliasing between them
  in each iteration of the vectorized loop. If there are many such data
  refs pairs, the number of comparisons can be very large, which is a
  big overhead.
 
  However, in some cases it is possible to reduce the number of those
  comparisons. For example, for the following loop, we can detect that
  b[0] and b[1] are two consecutive member accesses so that we can
  combine the alias check between a[0:100]b[0] and a[0:100]b[1] into
  checking a[0:100]b[0:2]:
 
  void foo(int*a, int* b)
  {
 for (int i = 0; i  100; ++i)
  a[i] = b[0] + b[1];
  }
 
  Actually, the requirement of consecutive memory accesses is too
  strict. For the following loop, we can still combine the alias checks
  between a[0:100]b[0] and a[0:100]b[100]:
 
  void foo(int*a, int* b)
  {
 for (int i = 0; i  100; ++i)
  a[i] = b[0] + b[100];
  }
 
  This is because if b[0] is not in a[0:100] and b[100] is not in
  a[0:100] then a[0:100] cannot be between b[0] and b[100]. We only need
  to check a[0:100] and b[0:101] don't overlap.
 
  More generally, consider two pairs of data refs (a, b1) and (a, b2).
  Suppose addr_b1 and addr_b2 are basic addresses of data ref b1 and b2;
  offset_b1 and offset_b2 (offset_b1  offset_b2) are offsets of b1 and
  b2, and segment_length_a, segment_length_b1, and segment_length_b2 are
  segment length of a, b1, and b2. Then we can combine the two
  comparisons into one if the following condition is satisfied:
 
  offset_b2- offset_b1 - segment_length_b1  segment_length_a
 
 
  This patch detects those combination opportunities to reduce the
  number of alias checks. It is tested

Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.

2013-10-14 Thread Cong Hou
Any comment on this patch?


thanks,
Cong


On Thu, Oct 3, 2013 at 3:59 PM, Cong Hou co...@google.com wrote:
 During loop versioning in vectorization, the alias check guarantees
 that any load of a data reference with zero-step is a loop invariant,
 which can be hoisted outside of the loop. After hoisting the load
 statement, there may exist more loop invariant statements. This patch
 tries to find all those statements and hoists them before the loop.

 An example is shown below:


 for (i = 0; i  N; ++i)
   a[i] = *b + 1;


 After loop versioning the loop to be vectorized is guarded by

 if (b + 1  a  a + N  b)

 which means there is no aliasing between *b and a[i]. The GIMPLE code
 of the loop body is:

   bb 5:
   # i_18 = PHI 0(4), i_29(6)
   # ivtmp_22 = PHI 1(4), ivtmp_30(6)
   _23 = (long unsigned int) i_18;
   _24 = _23 * 4;
   _25 = a_6(D) + _24;
   _26 = *b_8(D);= loop invariant
   _27 = _26 + 1;= loop invariant
   *_25 = _27;
   i_29 = i_18 + 1;
   ivtmp_30 = ivtmp_22 - 1;
   if (ivtmp_30 != 0)
 goto bb 6;
   else
 goto bb 21;


 After hoisting loop invariant statements:


   _26 = *b_8(D);
   _27 = _26 + 1;

   bb 5:
   # i_18 = PHI 0(4), i_29(6)
   # ivtmp_22 = PHI 1(4), ivtmp_30(6)
   _23 = (long unsigned int) i_18;
   _24 = _23 * 4;
   _25 = a_6(D) + _24;
   *_25 = _27;
   i_29 = i_18 + 1;
   ivtmp_30 = ivtmp_22 - 1;
   if (ivtmp_30 != 0)
 goto bb 6;
   else
 goto bb 21;


 This patch is related to the bug report
 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58508


 thanks,
 Cong


Re: [PATCH] Relax the requirement of reduction pattern in GCC vectorizer.

2013-10-14 Thread Cong Hou
Ping...


thanks,
Cong


On Wed, Oct 2, 2013 at 11:18 AM, Cong Hou co...@google.com wrote:
 Ping..  Any comment on this patch?


 thanks,
 Cong


 On Sat, Sep 28, 2013 at 9:34 AM, Xinliang David Li davi...@google.com wrote:
 You can also add a test case of this form:

 int foo( int t, int n, int *dst)
 {
int j = 0;
int s = 1;
t++;
for (j = 0; j  n; j++)
  {
  dst[j] = t;
  s *= t;
  }

return s;
 }

 where without the fix the loop vectorization is missed.

 David

 On Fri, Sep 27, 2013 at 6:28 PM, Cong Hou co...@google.com wrote:
 The current GCC vectorizer requires the following pattern as a simple
 reduction computation:

loop_header:
  a1 = phi  a0, a2 
  a3 = ...
  a2 = operation (a3, a1)

 But a3 can also be defined outside of the loop. For example, the
 following loop can benefit from vectorization but the GCC vectorizer
 fails to vectorize it:


 int foo(int v)
 {
   int s = 1;
   ++v;
   for (int i = 0; i  10; ++i)
 s *= v;
   return s;
 }


 This patch relaxes the original requirement by also considering the
 following pattern:


a3 = ...
loop_header:
  a1 = phi  a0, a2 
  a2 = operation (a3, a1)


 A test case is also added. The patch is tested on x86-64.


 thanks,
 Cong

 

 diff --git a/gcc/ChangeLog b/gcc/ChangeLog
 index 39c786e..45c1667 100644
 --- a/gcc/ChangeLog
 +++ b/gcc/ChangeLog
 @@ -1,3 +1,9 @@
 +2013-09-27  Cong Hou  co...@google.com
 +
 + * tree-vect-loop.c: Relax the requirement of the reduction
 + pattern so that one operand of the reduction operation can
 + come from outside of the loop.
 +
  2013-09-25  Tom Tromey  tro...@redhat.com

   * Makefile.in (PARTITION_H, LTO_SYMTAB_H, COMMON_TARGET_DEF_H)
 diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
 index 09644d2..90496a2 100644
 --- a/gcc/testsuite/ChangeLog
 +++ b/gcc/testsuite/ChangeLog
 @@ -1,3 +1,7 @@
 +2013-09-27  Cong Hou  co...@google.com
 +
 + * gcc.dg/vect/vect-reduc-pattern-3.c: New test.
 +
  2013-09-25  Marek Polacek  pola...@redhat.com

   PR sanitizer/58413
 diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
 index 2871ba1..3c51c3b 100644
 --- a/gcc/tree-vect-loop.c
 +++ b/gcc/tree-vect-loop.c
 @@ -2091,6 +2091,13 @@ vect_is_slp_reduction (loop_vec_info loop_info,
 gimple phi, gimple first_stmt)
   a3 = ...
   a2 = operation (a3, a1)

 +   or
 +
 +   a3 = ...
 +   loop_header:
 + a1 = phi  a0, a2 
 + a2 = operation (a3, a1)
 +
 such that:
 1. operation is commutative and associative and it is safe to
change the order of the computation (if CHECK_REDUCTION is true)
 @@ -2451,6 +2458,7 @@ vect_is_simple_reduction_1 (loop_vec_info
 loop_info, gimple phi,
if (def2  def2 == phi
 (code == COND_EXPR
|| !def1 || gimple_nop_p (def1)
 +  || !flow_bb_inside_loop_p (loop, gimple_bb (def1))
|| (def1  flow_bb_inside_loop_p (loop, gimple_bb (def1))
 (is_gimple_assign (def1)
|| is_gimple_call (def1)
 @@ -2469,6 +2477,7 @@ vect_is_simple_reduction_1 (loop_vec_info
 loop_info, gimple phi,
if (def1  def1 == phi
 (code == COND_EXPR
|| !def2 || gimple_nop_p (def2)
 +  || !flow_bb_inside_loop_p (loop, gimple_bb (def2))
|| (def2  flow_bb_inside_loop_p (loop, gimple_bb (def2))
 (is_gimple_assign (def2)
|| is_gimple_call (def2)
 diff --git gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c
 gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c
 new file mode 100644
 index 000..06a9416
 --- /dev/null
 +++ gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c
 @@ -0,0 +1,41 @@
 +/* { dg-require-effective-target vect_int } */
 +
 +#include stdarg.h
 +#include tree-vect.h
 +
 +#define N 10
 +#define RES 1024
 +
 +/* A reduction pattern in which there is no data ref in
 +   the loop and one operand is defined outside of the loop.  */
 +
 +__attribute__ ((noinline)) int
 +foo (int v)
 +{
 +  int i;
 +  int result = 1;
 +
 +  ++v;
 +  for (i = 0; i  N; i++)
 +result *= v;
 +
 +  return result;
 +}
 +
 +int
 +main (void)
 +{
 +  int res;
 +
 +  check_vect ();
 +
 +  res = foo (1);
 +  if (res != RES)
 +abort ();
 +
 +  return 0;
 +}
 +
 +/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect } } */
 +/* { dg-final { cleanup-tree-dump vect } } */
 +


Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.

2013-10-07 Thread Cong Hou
You are right. I am not an expert on numerical analysis, but I tested
your case and it proves the number 4 conversion is not safe.

Now we have four conversions which are safe once the precision
requirement is satisfied. I added a condition if (type != newtype) to
remove the unsafe one, as in this case once more conversion is added
which leads to the unsafe issue. If you think this condition does not
make sense please let me know.

The new patch is shown below (the attached file has tabs).

Thank you very much!



thanks,
Cong



Index: gcc/convert.c
===
--- gcc/convert.c (revision 203250)
+++ gcc/convert.c (working copy)
@@ -135,16 +135,19 @@ convert_to_real (tree type, tree expr)
   CASE_MATHFN (COS)
   CASE_MATHFN (ERF)
   CASE_MATHFN (ERFC)
-  CASE_MATHFN (FABS)
   CASE_MATHFN (LOG)
   CASE_MATHFN (LOG10)
   CASE_MATHFN (LOG2)
   CASE_MATHFN (LOG1P)
-  CASE_MATHFN (LOGB)
   CASE_MATHFN (SIN)
-  CASE_MATHFN (SQRT)
   CASE_MATHFN (TAN)
   CASE_MATHFN (TANH)
+/* The above functions are not safe to do this conversion.  */
+if (!flag_unsafe_math_optimizations)
+  break;
+  CASE_MATHFN (SQRT)
+  CASE_MATHFN (FABS)
+  CASE_MATHFN (LOGB)
 #undef CASE_MATHFN
 {
   tree arg0 = strip_float_extensions (CALL_EXPR_ARG (expr, 0));
@@ -155,13 +158,43 @@ convert_to_real (tree type, tree expr)
   if (TYPE_PRECISION (TREE_TYPE (arg0))  TYPE_PRECISION (type))
  newtype = TREE_TYPE (arg0);

+  /* We consider to convert
+
+ (T1) sqrtT2 ((T2) exprT3)
+ to
+ (T1) sqrtT4 ((T4) exprT3)
+
+  , where T1 is TYPE, T2 is ITYPE, T3 is TREE_TYPE (ARG0),
+ and T4 is NEWTYPE. All those types are of floating point types.
+ T4 (NEWTYPE) should be narrower than T2 (ITYPE). This conversion
+ is safe only if P1 = P2*2+2, where P1 and P2 are precisions of
+ T2 and T4. See the following URL for a reference:
+ 
http://stackoverflow.com/questions/9235456/determining-floating-point-square-root
+ */
+  if ((fcode == BUILT_IN_SQRT || fcode == BUILT_IN_SQRTL)
+   !flag_unsafe_math_optimizations)
+ {
+  /* The following conversion is unsafe even the precision condition
+ below is satisfied:
+
+ (float) sqrtl ((long double) double_val) - (float) sqrt (double_val)
+*/
+  if (type != newtype)
+break;
+
+  int p1 = REAL_MODE_FORMAT (TYPE_MODE (itype))-p;
+  int p2 = REAL_MODE_FORMAT (TYPE_MODE (newtype))-p;
+  if (p1  p2 * 2 + 2)
+break;
+ }
+
   /* Be careful about integer to fp conversions.
  These may overflow still.  */
   if (FLOAT_TYPE_P (TREE_TYPE (arg0))
TYPE_PRECISION (newtype)  TYPE_PRECISION (itype)
(TYPE_MODE (newtype) == TYPE_MODE (double_type_node)
   || TYPE_MODE (newtype) == TYPE_MODE (float_type_node)))
-{
+ {
   tree fn = mathfn_built_in (newtype, fcode);

   if (fn)
Index: gcc/ChangeLog
===
--- gcc/ChangeLog (revision 203250)
+++ gcc/ChangeLog (working copy)
@@ -1,3 +1,9 @@
+2013-10-07  Cong Hou  co...@google.com
+
+ * convert.c (convert_to_real): Forbid unsafe math function
+ conversions including sin/cos/log etc. Add precision check
+ for sqrt.
+
 2013-10-07  Bill Schmidt  wschm...@linux.vnet.ibm.com

  * config/rs6000/rs6000.c (altivec_expand_vec_perm_const_le): New.
Index: gcc/testsuite/ChangeLog
===
--- gcc/testsuite/ChangeLog (revision 203250)
+++ gcc/testsuite/ChangeLog (working copy)
@@ -1,3 +1,7 @@
+2013-10-07  Cong Hou  co...@google.com
+
+ * gcc.c-torture/execute/20030125-1.c: Update.
+
 2013-10-07  Bill Schmidt  wschm...@linux.vnet.ibm.com

  * gcc.target/powerpc/pr43154.c: Skip for ppc64 little endian.
Index: gcc/testsuite/gcc.c-torture/execute/20030125-1.c
===
--- gcc/testsuite/gcc.c-torture/execute/20030125-1.c (revision 203250)
+++ gcc/testsuite/gcc.c-torture/execute/20030125-1.c (working copy)
@@ -44,11 +44,11 @@ __attribute__ ((noinline))
 double
 sin(double a)
 {
- abort ();
+ return a;
 }
 __attribute__ ((noinline))
 float
 sinf(float a)
 {
- return a;
+ abort ();
 }




On Thu, Oct 3, 2013 at 5:06 PM, Joseph S. Myers jos...@codesourcery.com wrote:
 On Fri, 6 Sep 2013, Cong Hou wrote:

 4: (float) sqrtl ((long double) double_val)  -  (float) sqrt (double_val)

 I don't believe this case is in fact safe even if precision (long double)
= precision (double) * 2 + 2 (when your patch would allow it).

 The result that precision (double) * 2 + 2 is sufficient for the result of
 rounding the long double value to double to be the same as the result of
 rounding once from infinite precision to double would I think also mean
 the same when rounding of the infinite-precision result to float happens
 once - that is, if instead of (float) sqrt (double_val) you have fsqrt
 (double_val) (fsqrt being the proposed function in draft TS 18661-1 for
 computing a square root of a double value

Re: [PATCH] Reducing number of alias checks in vectorization.

2013-10-03 Thread Cong Hou
I noticed that there is a struct dataref_aux defined in
tree-vectorizer.h which is specific to the vectorizer pass and is
stored in (void*)aux in struct data_reference. Can we add one more
field segment_length to dataref_aux so that we can pass this
information for merging alias checks? Then we can avoid to modify or
create other structures.


thanks,
Cong


On Wed, Oct 2, 2013 at 2:34 PM, Cong Hou co...@google.com wrote:
 On Wed, Oct 2, 2013 at 4:24 AM, Richard Biener rguent...@suse.de wrote:
 On Tue, 1 Oct 2013, Cong Hou wrote:

 When alias exists between data refs in a loop, to vectorize it GCC
 does loop versioning and adds runtime alias checks. Basically for each
 pair of data refs with possible data dependence, there will be two
 comparisons generated to make sure there is no aliasing between them
 in each iteration of the vectorized loop. If there are many such data
 refs pairs, the number of comparisons can be very large, which is a
 big overhead.

 However, in some cases it is possible to reduce the number of those
 comparisons. For example, for the following loop, we can detect that
 b[0] and b[1] are two consecutive member accesses so that we can
 combine the alias check between a[0:100]b[0] and a[0:100]b[1] into
 checking a[0:100]b[0:2]:

 void foo(int*a, int* b)
 {
for (int i = 0; i  100; ++i)
 a[i] = b[0] + b[1];
 }

 Actually, the requirement of consecutive memory accesses is too
 strict. For the following loop, we can still combine the alias checks
 between a[0:100]b[0] and a[0:100]b[100]:

 void foo(int*a, int* b)
 {
for (int i = 0; i  100; ++i)
 a[i] = b[0] + b[100];
 }

 This is because if b[0] is not in a[0:100] and b[100] is not in
 a[0:100] then a[0:100] cannot be between b[0] and b[100]. We only need
 to check a[0:100] and b[0:101] don't overlap.

 More generally, consider two pairs of data refs (a, b1) and (a, b2).
 Suppose addr_b1 and addr_b2 are basic addresses of data ref b1 and b2;
 offset_b1 and offset_b2 (offset_b1  offset_b2) are offsets of b1 and
 b2, and segment_length_a, segment_length_b1, and segment_length_b2 are
 segment length of a, b1, and b2. Then we can combine the two
 comparisons into one if the following condition is satisfied:

 offset_b2- offset_b1 - segment_length_b1  segment_length_a


 This patch detects those combination opportunities to reduce the
 number of alias checks. It is tested on an x86-64 machine.

 Apart from the other comments you got (to which I agree) the patch
 seems to do two things, namely also:

 +  /* Extract load and store statements on pointers with zero-stride
 + accesses.  */
 +  if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
 +{

 which I'd rather see in a separate patch (and done also when
 the loop doesn't require versioning for alias).



 My mistake.. I am working on those two patches at the same time and
 pasted that one also here by mistake. I will send another patch about
 the hoist topic.


 Also combining the alias checks in vect_create_cond_for_alias_checks
 is nice but doesn't properly fix the use of the
 vect-max-version-for-alias-checks param which currently inhibits
 vectorization of the HIMENO benchmark by default (and make us look bad
 compared to LLVM).

 So I believe this merging should be done incrementally when
 we collect the DDRs we need to test in vect_mark_for_runtime_alias_test.



 I agree that vect-max-version-for-alias-checks param should count the
 number of checks after the merge. However, the struct
 data_dependence_relation could not record the new information produced
 by the merge. The new information I mentioned contains the new segment
 length for comparisons. This length is calculated right in
 vect_create_cond_for_alias_checks() function. Since
 vect-max-version-for-alias-checks is used during analysis phase, shall
 we move all those (get segment length for each data ref and merge
 alias checks) from transformation to analysis phase? If we cannot
 store the result properly (data_dependence_relation is not enough),
 shall we do it twice in both phases?

 I also noticed a possible bug in the function vect_same_range_drs()
 called by vect_prune_runtime_alias_test_list(). For the following code
 I get two pairs of data refs after
 vect_prune_runtime_alias_test_list(), but in
 vect_create_cond_for_alias_checks() after detecting grouped accesses I
 got two identical pairs of data refs. The consequence is two identical
 alias checks are produced.


 void yuv2yuyv_ref (int *d, int *src, int n)
 {
   char *dest = (char *)d;
   int i;

   for(i=0;in/2;i++){
 dest[i*4 + 0] = (src[i*2 + 0])16;
 dest[i*4 + 1] = (src[i*2 + 1])8;
 dest[i*4 + 2] = (src[i*2 + 0])16;
 dest[i*4 + 3] = (src[i*2 + 0])0;
   }
 }


 I think the solution to this problem is changing

 GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt_i))
 == GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt_j)

 into

 STMT_VINFO_DATA_REF (vinfo_for_stmt (GROUP_FIRST_ELEMENT
 (vinfo_for_stmt (stmt_i

Re: [PATCH] Reducing number of alias checks in vectorization.

2013-10-03 Thread Cong Hou
On Thu, Oct 3, 2013 at 2:06 PM, Joseph S. Myers jos...@codesourcery.com wrote:
 On Tue, 1 Oct 2013, Cong Hou wrote:

 +#include vector
 +#include utility
 +#include algorithm
 +
  #include config.h

 Whatever the other issues about including these headers at all, any system
 header (C or C++) must always be included *after* config.h, as config.h
 may define feature test macros that are only properly effective if defined
 before any system headers are included, and these macros (affecting such
 things as the size of off_t) need to be consistent throughout GCC.


OK. Actually I did meet some conflicts when I put those three C++
headers after all other includes.

Thank you for the comments.


Cong


 --
 Joseph S. Myers
 jos...@codesourcery.com


Re: [PATCH] Reducing number of alias checks in vectorization.

2013-10-03 Thread Cong Hou
Forget about this aux idea as the segment length for one data ref
can be different in different dr pairs.

In my patch I created a struct as shown below:

struct dr_addr_with_seg_len
{
  data_reference *dr;
  tree basic_addr;
  tree offset;
  tree seg_len;
};


Note that basic_addr and offset can always obtained from dr, but we
need to store two segment lengths for each dr pair. It is improper to
add a field to data_dependence_relation as it is defined outside of
vectorizer. We can change the type (a new one combining
data_dependence_relation and segment length) of may_alias_ddrs in
loop_vec_info to include such information, but we have to add a new
type to tree-vectorizer.h which is only used in two places - still too
much.

One possible solution is that we create a local struct as shown above
and a new function which returns the merged alias check information.
This function will be called twice: once during analysis phase and
once in transformation phase. Then we don't have to store the merged
alias check information during those two phases. The additional time
cost is minimal as there will not be too many data dependent dr pairs
in a loop.

Any comment?


thanks,
Cong


On Thu, Oct 3, 2013 at 10:57 AM, Cong Hou co...@google.com wrote:
 I noticed that there is a struct dataref_aux defined in
 tree-vectorizer.h which is specific to the vectorizer pass and is
 stored in (void*)aux in struct data_reference. Can we add one more
 field segment_length to dataref_aux so that we can pass this
 information for merging alias checks? Then we can avoid to modify or
 create other structures.


 thanks,
 Cong


 On Wed, Oct 2, 2013 at 2:34 PM, Cong Hou co...@google.com wrote:
 On Wed, Oct 2, 2013 at 4:24 AM, Richard Biener rguent...@suse.de wrote:
 On Tue, 1 Oct 2013, Cong Hou wrote:

 When alias exists between data refs in a loop, to vectorize it GCC
 does loop versioning and adds runtime alias checks. Basically for each
 pair of data refs with possible data dependence, there will be two
 comparisons generated to make sure there is no aliasing between them
 in each iteration of the vectorized loop. If there are many such data
 refs pairs, the number of comparisons can be very large, which is a
 big overhead.

 However, in some cases it is possible to reduce the number of those
 comparisons. For example, for the following loop, we can detect that
 b[0] and b[1] are two consecutive member accesses so that we can
 combine the alias check between a[0:100]b[0] and a[0:100]b[1] into
 checking a[0:100]b[0:2]:

 void foo(int*a, int* b)
 {
for (int i = 0; i  100; ++i)
 a[i] = b[0] + b[1];
 }

 Actually, the requirement of consecutive memory accesses is too
 strict. For the following loop, we can still combine the alias checks
 between a[0:100]b[0] and a[0:100]b[100]:

 void foo(int*a, int* b)
 {
for (int i = 0; i  100; ++i)
 a[i] = b[0] + b[100];
 }

 This is because if b[0] is not in a[0:100] and b[100] is not in
 a[0:100] then a[0:100] cannot be between b[0] and b[100]. We only need
 to check a[0:100] and b[0:101] don't overlap.

 More generally, consider two pairs of data refs (a, b1) and (a, b2).
 Suppose addr_b1 and addr_b2 are basic addresses of data ref b1 and b2;
 offset_b1 and offset_b2 (offset_b1  offset_b2) are offsets of b1 and
 b2, and segment_length_a, segment_length_b1, and segment_length_b2 are
 segment length of a, b1, and b2. Then we can combine the two
 comparisons into one if the following condition is satisfied:

 offset_b2- offset_b1 - segment_length_b1  segment_length_a


 This patch detects those combination opportunities to reduce the
 number of alias checks. It is tested on an x86-64 machine.

 Apart from the other comments you got (to which I agree) the patch
 seems to do two things, namely also:

 +  /* Extract load and store statements on pointers with zero-stride
 + accesses.  */
 +  if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
 +{

 which I'd rather see in a separate patch (and done also when
 the loop doesn't require versioning for alias).



 My mistake.. I am working on those two patches at the same time and
 pasted that one also here by mistake. I will send another patch about
 the hoist topic.


 Also combining the alias checks in vect_create_cond_for_alias_checks
 is nice but doesn't properly fix the use of the
 vect-max-version-for-alias-checks param which currently inhibits
 vectorization of the HIMENO benchmark by default (and make us look bad
 compared to LLVM).

 So I believe this merging should be done incrementally when
 we collect the DDRs we need to test in vect_mark_for_runtime_alias_test.



 I agree that vect-max-version-for-alias-checks param should count the
 number of checks after the merge. However, the struct
 data_dependence_relation could not record the new information produced
 by the merge. The new information I mentioned contains the new segment
 length for comparisons. This length is calculated right

[PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.

2013-10-03 Thread Cong Hou
During loop versioning in vectorization, the alias check guarantees
that any load of a data reference with zero-step is a loop invariant,
which can be hoisted outside of the loop. After hoisting the load
statement, there may exist more loop invariant statements. This patch
tries to find all those statements and hoists them before the loop.

An example is shown below:


for (i = 0; i  N; ++i)
  a[i] = *b + 1;


After loop versioning the loop to be vectorized is guarded by

if (b + 1  a  a + N  b)

which means there is no aliasing between *b and a[i]. The GIMPLE code
of the loop body is:

  bb 5:
  # i_18 = PHI 0(4), i_29(6)
  # ivtmp_22 = PHI 1(4), ivtmp_30(6)
  _23 = (long unsigned int) i_18;
  _24 = _23 * 4;
  _25 = a_6(D) + _24;
  _26 = *b_8(D);= loop invariant
  _27 = _26 + 1;= loop invariant
  *_25 = _27;
  i_29 = i_18 + 1;
  ivtmp_30 = ivtmp_22 - 1;
  if (ivtmp_30 != 0)
goto bb 6;
  else
goto bb 21;


After hoisting loop invariant statements:


  _26 = *b_8(D);
  _27 = _26 + 1;

  bb 5:
  # i_18 = PHI 0(4), i_29(6)
  # ivtmp_22 = PHI 1(4), ivtmp_30(6)
  _23 = (long unsigned int) i_18;
  _24 = _23 * 4;
  _25 = a_6(D) + _24;
  *_25 = _27;
  i_29 = i_18 + 1;
  ivtmp_30 = ivtmp_22 - 1;
  if (ivtmp_30 != 0)
goto bb 6;
  else
goto bb 21;


This patch is related to the bug report
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58508


thanks,
Cong
diff --git gcc/testsuite/gcc.dg/vect/pr58508.c 
gcc/testsuite/gcc.dg/vect/pr58508.c
new file mode 100644
index 000..cb22b50
--- /dev/null
+++ gcc/testsuite/gcc.dg/vect/pr58508.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */
+
+
+/* The GCC vectorizer generates loop versioning for the following loop
+   since there may exist aliasing between A and B.  The predicate checks
+   if A may alias with B across all iterations.  Then for the loop in
+   the true body, we can assert that *B is a loop invariant so that
+   we can hoist the load of *B before the loop body.  */
+
+void foo (int* a, int* b)
+{
+  int i;
+  for (i = 0; i  10; ++i)
+a[i] = *b + 1;
+}
+
+
+/* { dg-final { scan-tree-dump-times hoist 2 vect } } */
+/* { dg-final { cleanup-tree-dump vect } } */


Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.

2013-10-03 Thread Cong Hou
Ping...


thanks,
Cong


On Fri, Sep 20, 2013 at 9:49 AM, Cong Hou co...@google.com wrote:
 Any comment or more suggestions on this patch?


 thanks,
 Cong

 On Mon, Sep 9, 2013 at 7:28 PM, Cong Hou co...@google.com wrote:
 On Mon, Sep 9, 2013 at 6:26 PM, Xinliang David Li davi...@google.com wrote:
 On Fri, Sep 6, 2013 at 3:24 PM, Cong Hou co...@google.com wrote:
 First, thank you for your detailed comments again! Then I deeply
 apologize for not explaining my patch properly and responding to your
 previous comment. I didn't understand thoroughly the problem before
 submitting the patch.

 Previously I only considered the following three conversions for sqrt():


 1: (float) sqrt ((double) float_val)  -  sqrtf (float_val)
 2: (float) sqrtl ((long double) float_val)  -  sqrtf (float_val)
 3: (double) sqrtl ((long double) double_val)  -  sqrt (double_val)


 We have four types here:

 TYPE is the type to which the result of the function call is converted.
 ITYPE is the type of the math call expression.
 TREE_TYPE(arg0) is the type of the function argument (before type 
 conversion).
 NEWTYPE is chosen from TYPE and TREE_TYPE(arg0) with higher precision.
 It will be the type of the new math call expression after conversion.

 For all three cases above, TYPE is always the same as NEWTYPE. That is
 why I only considered TYPE during the precision comparison. ITYPE can
 only be double_type_node or long_double_type_node depending on the
 type of the math function. That is why I explicitly used those two
 types instead of ITYPE (no correctness issue). But you are right,
 ITYPE is more elegant and better here.

 After further analysis, I found I missed two more cases. Note that we
 have the following conditions according to the code in convert.c:

 TYPE_PRECISION(NEWTYPE) = TYPE_PRECISION(TYPE)
 TYPE_PRECISION(NEWTYPE) = TYPE_PRECISION(TREE_TYPE(arg0))
 TYPE_PRECISION (NEWTYPE)  TYPE_PRECISION (ITYPE)

 the last condition comes from the fact that we only consider
 converting a math function call into another one with narrower type.
 Therefore we have

 TYPE_PRECISION(TYPE)  TYPE_PRECISION (ITYPE)
 TYPE_PRECISION(TREE_TYPE(arg0))  TYPE_PRECISION (ITYPE)

 So for sqrt(), TYPE and TREE_TYPE(arg0) can only be float, and for
 sqrtl(), TYPE and TREE_TYPE(arg0) can be either float or double with
 four possible combinations. Therefore we have two more conversions to
 consider besides the three ones I mentioned above:


 4: (float) sqrtl ((long double) double_val)  -  (float) sqrt (double_val)
 5: (double) sqrtl ((long double) float_val)  -  sqrt ((double) float_val)


 For the first conversion here, TYPE (float) is different from NEWTYPE
 (double), and my previous patch doesn't handle this case.The correct
 way is to compare precisions of ITYPE and NEWTYPE now.

 To sum up, we are converting the expression

 (TYPE) sqrtITYPE ((ITYPE) expr)

 to

 (TYPE) sqrtNEWTYPE ((NEWTYPE) expr)

 and we require

 PRECISION (ITYPE) = PRECISION (NEWTYPE) * 2 + 2

 to make it a safe conversion.


 The new patch is pasted below.

 I appreciate your detailed comments and analysis, and next time when I
 submit a patch I will be more carefully about the reviewer's comment.


 Thank you!

 Cong



 Index: gcc/convert.c
 ===
 --- gcc/convert.c (revision 201891)
 +++ gcc/convert.c (working copy)
 @@ -135,16 +135,19 @@ convert_to_real (tree type, tree expr)
CASE_MATHFN (COS)
CASE_MATHFN (ERF)
CASE_MATHFN (ERFC)
 -  CASE_MATHFN (FABS)
CASE_MATHFN (LOG)
CASE_MATHFN (LOG10)
CASE_MATHFN (LOG2)
CASE_MATHFN (LOG1P)
 -  CASE_MATHFN (LOGB)
CASE_MATHFN (SIN)
 -  CASE_MATHFN (SQRT)
CASE_MATHFN (TAN)
CASE_MATHFN (TANH)
 +/* The above functions are not safe to do this conversion. */
 +if (!flag_unsafe_math_optimizations)
 +  break;
 +  CASE_MATHFN (SQRT)
 +  CASE_MATHFN (FABS)
 +  CASE_MATHFN (LOGB)
  #undef CASE_MATHFN
  {
tree arg0 = strip_float_extensions (CALL_EXPR_ARG (expr, 0));
 @@ -155,6 +158,27 @@ convert_to_real (tree type, tree expr)
if (TYPE_PRECISION (TREE_TYPE (arg0))  TYPE_PRECISION (type))
   newtype = TREE_TYPE (arg0);

 +  /* We consider to convert
 +
 + (T1) sqrtT2 ((T2) exprT3)
 + to
 + (T1) sqrtT4 ((T4) exprT3)

 Should this be

   (T4) sqrtT4 ((T4) exprT3)

 T4 is not necessarily the same as T1. For the conversion:

  (float) sqrtl ((long double) double_val)  -  (float) sqrt (double_val)

 T4 is double and T1 is float.


 +
 +  , where T1 is TYPE, T2 is ITYPE, T3 is TREE_TYPE (ARG0),
 + and T4 is NEWTYPE.

 NEWTYPE is also the wider one between T1 and T3.

 Right. Actually this is easy to catch from the context just before
 this comment.

 tree newtype = type;
 if (TYPE_PRECISION (TREE_TYPE (arg0))  TYPE_PRECISION (type))
 newtype = TREE_TYPE (arg0);



 thanks,
 Cong



 David

 All those types are of floating point types.
 + T4 (NEWTYPE) should be narrower than T2 (ITYPE

Re: [PATCH] Reducing number of alias checks in vectorization.

2013-10-02 Thread Cong Hou
On Tue, Oct 1, 2013 at 11:35 PM, Jakub Jelinek ja...@redhat.com wrote:
 On Tue, Oct 01, 2013 at 07:12:54PM -0700, Cong Hou wrote:
 --- gcc/tree-vect-loop-manip.c (revision 202662)
 +++ gcc/tree-vect-loop-manip.c (working copy)

 Your mailer ate all the tabs, so the formatting of the whole patch
 can't be checked.



I'll pay attention to this problem in my later patch submission.


 @@ -19,6 +19,10 @@ You should have received a copy of the G
  along with GCC; see the file COPYING3.  If not see
  http://www.gnu.org/licenses/.  */

 +#include vector
 +#include utility
 +#include algorithm

 Why?  GCC has it's vec.h vectors, why don't you use those?
 There is even qsort method for you in there.  And for pairs, you can
 easily just use structs with two members as structure elements in the
 vector.



GCC is now restructured using C++ and STL is one of the most important
part of C++. I am new to GCC community and more familiar to STL (and I
think allowing STL in GCC could attract more new developers for GCC).
I agree using GCC's vec can maintain a uniform style but STL is just
so powerful and easy to use...

I just did a search in GCC source tree and found vector is not used
yet. I will change std::vector to GCC's vec for now (and also qsort),
but am still wondering if one day GCC would accept STL.


 +struct dr_addr_with_seg_len
 +{
 +  dr_addr_with_seg_len (data_reference* d, tree addr, tree off, tree len)
 +: dr (d), basic_addr (addr), offset (off), seg_len (len) {}
 +
 +  data_reference* dr;

 Space should be before *, not after it.

 +  if (TREE_CODE (p11.offset) != INTEGER_CST
 +  || TREE_CODE (p21.offset) != INTEGER_CST)
 +return p11.offset  p21.offset;

 If offset isn't INTEGER_CST, you are comparing the pointer values?
 That is never a good idea, then compilation will depend on how say address
 space randomization randomizes virtual address space.  GCC needs to have
 reproduceable compilations.


I this scenario comparing pointers is safe. The sort is used to put
together any two pairs of data refs which can be merged. For example,
if we have (a, b) (a, c), (a, b+1), then after sorting them we should
have either (a, b), (a, b+1), (a, c) or (a, c), (a, b), (a, b+1). We
don't care the relative order of non-mergable dr pairs here. So
although the sorting result may vary the final result we get should
not change.



 +  if (int_cst_value (p11.offset) != int_cst_value (p21.offset))
 +return int_cst_value (p11.offset)  int_cst_value (p21.offset);

 This is going to ICE whenever the offsets wouldn't fit into a
 HOST_WIDE_INT.

 I'd say you just shouldn't put into the vector entries where offset isn't
 host_integerp, those would never be merged with other checks, or something
 similar.

Do you mean I should use widest_int_cst_value()? Then I will replace
all int_cst_value() here with it. I also changed the type of diff
variable into HOST_WIDEST_INT.



Thank you very much for your comments!

Cong




 Jakub


  1   2   >