Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.
Ping? thanks, Cong On Tue, Jul 8, 2014 at 8:23 PM, Xinliang David Li davi...@google.com wrote: Cong, can you ping this patch again? There does not seem to be pending comments left. David On Tue, Dec 17, 2013 at 10:05 AM, Cong Hou co...@google.com wrote: Ping? thanks, Cong On Mon, Dec 2, 2013 at 5:02 PM, Cong Hou co...@google.com wrote: Any comment on this patch? thanks, Cong On Fri, Nov 22, 2013 at 11:40 AM, Cong Hou co...@google.com wrote: On Fri, Nov 22, 2013 at 3:57 AM, Marc Glisse marc.gli...@inria.fr wrote: On Thu, 21 Nov 2013, Cong Hou wrote: On Thu, Nov 21, 2013 at 4:39 PM, Marc Glisse marc.gli...@inria.fr wrote: On Thu, 21 Nov 2013, Cong Hou wrote: While I added the new define_insn_and_split for vec_merge, a bug is exposed: in config/i386/sse.md, [ define_expand xop_vmfrczmode2 ] only takes one input, but the corresponding builtin functions have two inputs, which are shown in i386.c: { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv4sf2, __builtin_ia32_vfrczss, IX86_BUILTIN_VFRCZSS, UNKNOWN, (int)MULTI_ARG_2_SF }, { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv2df2, __builtin_ia32_vfrczsd, IX86_BUILTIN_VFRCZSD, UNKNOWN, (int)MULTI_ARG_2_DF }, In consequence, the ix86_expand_multi_arg_builtin() function tries to check two args but based on the define_expand of xop_vmfrczmode2, the content of insn_data[CODE_FOR_xop_vmfrczv4sf2].operand[2] may be incorrect (because it only needs one input). The patch below fixed this issue. Bootstrapped and tested on ax x86-64 machine. Note that this patch should be applied before the one I sent earlier (sorry for sending them in wrong order). This is PR 56788. Your patch seems strange to me and I don't think it fixes the real issue, but I'll let more knowledgeable people answer. Thank you for pointing out the bug report. This patch is not intended to fix PR56788. IMHO, if PR56788 was fixed, you wouldn't have this issue, and if PR56788 doesn't get fixed, I'll post a patch to remove _mm_frcz_sd and the associated builtin, which would solve your issue as well. I agree. Then I will wait until your patch is merged to the trunk, otherwise my patch could not pass the test. For your function: #include x86intrin.h __m128d f(__m128d x, __m128d y){ return _mm_frcz_sd(x,y); } Note that the second parameter is ignored intentionally, but the prototype of this function contains two parameters. My fix is explicitly telling GCC that the optab xop_vmfrczv4sf3 should have three operands instead of two, to let it have the correct information in insn_data[CODE_FOR_xop_vmfrczv4sf3].operand[2] which is used to match the type of the second parameter in the builtin function in ix86_expand_multi_arg_builtin(). I disagree that this is intentional, it is a bug. AFAIK there is no AMD documentation that could be used as a reference for what _mm_frcz_sd is supposed to do. The only existing documentations are by Microsoft (which does *not* ignore the second argument) and by LLVM (which has a single argument). Whatever we chose for _mm_frcz_sd, the builtin should take a single argument, and if necessary we'll use 2 builtins to implement _mm_frcz_sd. I also only found the one by Microsoft.. If the second argument is ignored, we could just remove it, as long as there is no standard that requires two arguments. Hopefully it won't break current projects using _mm_frcz_sd. Thank you for your comments! Cong -- Marc Glisse
Re: [PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.
On Tue, Jun 24, 2014 at 4:05 AM, Richard Biener richard.guent...@gmail.com wrote: On Sat, May 3, 2014 at 2:39 AM, Cong Hou co...@google.com wrote: On Mon, Apr 28, 2014 at 4:04 AM, Richard Biener rguent...@suse.de wrote: On Thu, 24 Apr 2014, Cong Hou wrote: Given the following loop: int a[N]; short b[N*2]; for (int i = 0; i N; ++i) a[i] = b[i*2]; After being vectorized, the access to b[i*2] will be compiled into several packing statements, while the type promotion from short to int will be compiled into several unpacking statements. With this patch, each pair of pack/unpack statements will be replaced by less expensive statements (with shift or bit-and operations). On x86_64, the loop above will be compiled into the following assembly (with -O2 -ftree-vectorize): movdqu 0x10(%rcx),%xmm3 movdqu -0x20(%rcx),%xmm0 movdqa %xmm0,%xmm2 punpcklwd %xmm3,%xmm0 punpckhwd %xmm3,%xmm2 movdqa %xmm0,%xmm3 punpcklwd %xmm2,%xmm0 punpckhwd %xmm2,%xmm3 movdqa %xmm1,%xmm2 punpcklwd %xmm3,%xmm0 pcmpgtw %xmm0,%xmm2 movdqa %xmm0,%xmm3 punpckhwd %xmm2,%xmm0 punpcklwd %xmm2,%xmm3 movups %xmm0,-0x10(%rdx) movups %xmm3,-0x20(%rdx) With this patch, the generated assembly is shown below: movdqu 0x10(%rcx),%xmm0 movdqu -0x20(%rcx),%xmm1 pslld $0x10,%xmm0 psrad $0x10,%xmm0 pslld $0x10,%xmm1 movups %xmm0,-0x10(%rdx) psrad $0x10,%xmm1 movups %xmm1,-0x20(%rdx) Bootstrapped and tested on x86-64. OK for trunk? This is an odd place to implement such transform. Also if it is faster or not depends on the exact ISA you target - for example ppc has constraints on the maximum number of shifts carried out in parallel and the above has 4 in very short succession. Esp. for the sign-extend path. Thank you for the information about ppc. If this is an issue, I think we can do it in a target dependent way. So this looks more like an opportunity for a post-vectorizer transform on RTL or for the vectorizer special-casing widening loads with a vectorizer pattern. I am not sure if the RTL transform is more difficult to implement. I prefer the widening loads method, which can be detected in a pattern recognizer. The target related issue will be resolved by only expanding the widening load on those targets where this pattern is beneficial. But this requires new tree operations to be defined. What is your suggestion? I apologize for the delayed reply. Likewise ;) I suggest to implement this optimization in vector lowering in tree-vect-generic.c. This sees for your example vect__5.7_32 = MEM[symbol: b, index: ivtmp.15_13, offset: 0B]; vect__5.8_34 = MEM[symbol: b, index: ivtmp.15_13, offset: 16B]; vect_perm_even_35 = VEC_PERM_EXPR vect__5.7_32, vect__5.8_34, { 0, 2, 4, 6, 8, 10, 12, 14 }; vect__6.9_37 = [vec_unpack_lo_expr] vect_perm_even_35; vect__6.9_38 = [vec_unpack_hi_expr] vect_perm_even_35; where you can apply the pattern matching and transform (after checking with the target, of course). This sounds good to me! I'll try to make a patch following your suggestion. Thank you! Cong Richard. thanks, Cong Richard.
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
OK. Thank you very much for your review, Richard! thanks, Cong On Tue, Jun 24, 2014 at 4:19 AM, Richard Biener richard.guent...@gmail.com wrote: On Tue, Dec 3, 2013 at 2:06 AM, Cong Hou co...@google.com wrote: Hi Richard Could you please take a look at this patch and see if it is ready for the trunk? The patch is pasted as a text file here again. (found it) The patch is ok for trunk. (please consider re-testing before you commit) Thanks, Richard. Thank you very much! Cong On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou co...@google.com wrote: Hi James Sorry for the late reply. On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh james.greenha...@arm.com wrote: On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou co...@google.com wrote: Thank you for your detailed explanation. Once GCC detects a reduction operation, it will automatically accumulate all elements in the vector after the loop. In the loop the reduction variable is always a vector whose elements are reductions of corresponding values from other vectors. Therefore in your case the only instruction you need to generate is: VABAL ops[3], ops[1], ops[2] It is OK if you accumulate the elements into one in the vector inside of the loop (if one instruction can do this), but you have to make sure other elements in the vector should remain zero so that the final result is correct. If you are confused about the documentation, check the one for udot_prod (just above usad in md.texi), as it has very similar behavior as usad. Actually I copied the text from there and did some changes. As those two instruction patterns are both for vectorization, their behavior should not be difficult to explain. If you have more questions or think that the documentation is still improper please let me know. Hi Cong, Thanks for your reply. I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and DOT_PROD_EXPR and I see that the same ambiguity exists for DOT_PROD_EXPR. Can you please add a note in your tree.def that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either: tmp = WIDEN_MINUS_EXPR (arg1, arg2) tmp2 = ABS_EXPR (tmp) arg3 = PLUS_EXPR (tmp2, arg3) or: tmp = WIDEN_MINUS_EXPR (arg1, arg2) tmp2 = ABS_EXPR (tmp) arg3 = WIDEN_SUM_EXPR (tmp2, arg3) Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a a value of the same (widened) type as arg3. I have added it, although we currently don't have WIDEN_MINUS_EXPR (I mentioned it in tree.def). Also, while looking for the history of DOT_PROD_EXPR I spotted this patch: [autovect] [patch] detect mult-hi and sad patterns http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html I wonder what the reason was for that patch to be dropped? It has been 8 years.. I have no idea why this patch is not accepted finally. There is even no reply in that thread. But I believe the SAD pattern is very important to be recognized. ARM also provides instructions for it. Thank you for your comment again! thanks, Cong Thanks, James
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
It has been 8 months since this patch is posted. I have addressed all comments to this patch. The SAD pattern is very useful for some multimedia algorithms like ffmpeg. This patch will greatly improve the performance of such algorithms. Could you please have a look again and check if it is OK for the trunk? If it is necessary I can re-post this patch in a new thread. Thank you! Cong On Tue, Dec 17, 2013 at 10:04 AM, Cong Hou co...@google.com wrote: Ping? thanks, Cong On Mon, Dec 2, 2013 at 5:06 PM, Cong Hou co...@google.com wrote: Hi Richard Could you please take a look at this patch and see if it is ready for the trunk? The patch is pasted as a text file here again. Thank you very much! Cong On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou co...@google.com wrote: Hi James Sorry for the late reply. On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh james.greenha...@arm.com wrote: On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou co...@google.com wrote: Thank you for your detailed explanation. Once GCC detects a reduction operation, it will automatically accumulate all elements in the vector after the loop. In the loop the reduction variable is always a vector whose elements are reductions of corresponding values from other vectors. Therefore in your case the only instruction you need to generate is: VABAL ops[3], ops[1], ops[2] It is OK if you accumulate the elements into one in the vector inside of the loop (if one instruction can do this), but you have to make sure other elements in the vector should remain zero so that the final result is correct. If you are confused about the documentation, check the one for udot_prod (just above usad in md.texi), as it has very similar behavior as usad. Actually I copied the text from there and did some changes. As those two instruction patterns are both for vectorization, their behavior should not be difficult to explain. If you have more questions or think that the documentation is still improper please let me know. Hi Cong, Thanks for your reply. I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and DOT_PROD_EXPR and I see that the same ambiguity exists for DOT_PROD_EXPR. Can you please add a note in your tree.def that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either: tmp = WIDEN_MINUS_EXPR (arg1, arg2) tmp2 = ABS_EXPR (tmp) arg3 = PLUS_EXPR (tmp2, arg3) or: tmp = WIDEN_MINUS_EXPR (arg1, arg2) tmp2 = ABS_EXPR (tmp) arg3 = WIDEN_SUM_EXPR (tmp2, arg3) Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a a value of the same (widened) type as arg3. I have added it, although we currently don't have WIDEN_MINUS_EXPR (I mentioned it in tree.def). Also, while looking for the history of DOT_PROD_EXPR I spotted this patch: [autovect] [patch] detect mult-hi and sad patterns http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html I wonder what the reason was for that patch to be dropped? It has been 8 years.. I have no idea why this patch is not accepted finally. There is even no reply in that thread. But I believe the SAD pattern is very important to be recognized. ARM also provides instructions for it. Thank you for your comment again! thanks, Cong Thanks, James
Re: [PATCH] A new reload-rewrite pattern recognizer for GCC vectorizer.
Ping? thanks, Cong On Wed, Apr 30, 2014 at 1:28 PM, Cong Hou co...@google.com wrote: Thank you for reminding me the omp possibility. Yes, in this case my pattern is incorrect, because I assume all aliases will be resolved by alias checks, which may not be true with omp. LOOP_VINFO_NO_DATA_DEPENDENCIES or LOOP_REQUIRES_VERSIONING_FOR_ALIAS may not help here because vect_pattern_recog() is called prior to vect_analyze_data_ref_dependences() in vect_analyze_loop_2(). So can we detect if omp is used in the pattern recognizer? Like checking loop-force_vectorize? Is there any other case in which my assumption does not hold? thanks, Cong On Sat, Apr 26, 2014 at 12:54 AM, Jakub Jelinek ja...@redhat.com wrote: On Thu, Apr 24, 2014 at 05:32:54PM -0700, Cong Hou wrote: In this patch a new reload-rewrite pattern detector is composed to handle the following pattern in the loop being vectorized: x = *p; ... y = *p; or *p = x; ... y = *p; In both cases, *p is reloaded because there may exist other defs to another memref that may alias with p. However, aliasing is eliminated with alias checks. Then we can safely replace the last statement in above cases by y = x. Not safely, at least not for #pragma omp simd/#pragma simd/#pragma ivdep loops if alias analysis hasn't proven there is no aliasing. So, IMNSHO you need to guard this with LOOP_VINFO_NO_DATA_DEPENDENCIES, assuming it has been computed at that point already (otherwise you need to do it elsewhere). Consider: void foo (int *p, int *q) { int i; #pragma omp simd safelen(16) for (i = 0; i 128; i++) { int x = *p; *q += 8; *p = *p + x; p++; q++; } } It is valid to call the above with completely unrelated p and q, but also e.g. p == q, or q == p + 16 or p == q + 16. Your patch would certainly break it e.g. for p == q. Jakub
Re: [PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.
On Mon, Apr 28, 2014 at 4:04 AM, Richard Biener rguent...@suse.de wrote: On Thu, 24 Apr 2014, Cong Hou wrote: Given the following loop: int a[N]; short b[N*2]; for (int i = 0; i N; ++i) a[i] = b[i*2]; After being vectorized, the access to b[i*2] will be compiled into several packing statements, while the type promotion from short to int will be compiled into several unpacking statements. With this patch, each pair of pack/unpack statements will be replaced by less expensive statements (with shift or bit-and operations). On x86_64, the loop above will be compiled into the following assembly (with -O2 -ftree-vectorize): movdqu 0x10(%rcx),%xmm3 movdqu -0x20(%rcx),%xmm0 movdqa %xmm0,%xmm2 punpcklwd %xmm3,%xmm0 punpckhwd %xmm3,%xmm2 movdqa %xmm0,%xmm3 punpcklwd %xmm2,%xmm0 punpckhwd %xmm2,%xmm3 movdqa %xmm1,%xmm2 punpcklwd %xmm3,%xmm0 pcmpgtw %xmm0,%xmm2 movdqa %xmm0,%xmm3 punpckhwd %xmm2,%xmm0 punpcklwd %xmm2,%xmm3 movups %xmm0,-0x10(%rdx) movups %xmm3,-0x20(%rdx) With this patch, the generated assembly is shown below: movdqu 0x10(%rcx),%xmm0 movdqu -0x20(%rcx),%xmm1 pslld $0x10,%xmm0 psrad $0x10,%xmm0 pslld $0x10,%xmm1 movups %xmm0,-0x10(%rdx) psrad $0x10,%xmm1 movups %xmm1,-0x20(%rdx) Bootstrapped and tested on x86-64. OK for trunk? This is an odd place to implement such transform. Also if it is faster or not depends on the exact ISA you target - for example ppc has constraints on the maximum number of shifts carried out in parallel and the above has 4 in very short succession. Esp. for the sign-extend path. Thank you for the information about ppc. If this is an issue, I think we can do it in a target dependent way. So this looks more like an opportunity for a post-vectorizer transform on RTL or for the vectorizer special-casing widening loads with a vectorizer pattern. I am not sure if the RTL transform is more difficult to implement. I prefer the widening loads method, which can be detected in a pattern recognizer. The target related issue will be resolved by only expanding the widening load on those targets where this pattern is beneficial. But this requires new tree operations to be defined. What is your suggestion? I apologize for the delayed reply. thanks, Cong Richard.
Re: [PATCH] A new reload-rewrite pattern recognizer for GCC vectorizer.
Thank you for reminding me the omp possibility. Yes, in this case my pattern is incorrect, because I assume all aliases will be resolved by alias checks, which may not be true with omp. LOOP_VINFO_NO_DATA_DEPENDENCIES or LOOP_REQUIRES_VERSIONING_FOR_ALIAS may not help here because vect_pattern_recog() is called prior to vect_analyze_data_ref_dependences() in vect_analyze_loop_2(). So can we detect if omp is used in the pattern recognizer? Like checking loop-force_vectorize? Is there any other case in which my assumption does not hold? thanks, Cong On Sat, Apr 26, 2014 at 12:54 AM, Jakub Jelinek ja...@redhat.com wrote: On Thu, Apr 24, 2014 at 05:32:54PM -0700, Cong Hou wrote: In this patch a new reload-rewrite pattern detector is composed to handle the following pattern in the loop being vectorized: x = *p; ... y = *p; or *p = x; ... y = *p; In both cases, *p is reloaded because there may exist other defs to another memref that may alias with p. However, aliasing is eliminated with alias checks. Then we can safely replace the last statement in above cases by y = x. Not safely, at least not for #pragma omp simd/#pragma simd/#pragma ivdep loops if alias analysis hasn't proven there is no aliasing. So, IMNSHO you need to guard this with LOOP_VINFO_NO_DATA_DEPENDENCIES, assuming it has been computed at that point already (otherwise you need to do it elsewhere). Consider: void foo (int *p, int *q) { int i; #pragma omp simd safelen(16) for (i = 0; i 128; i++) { int x = *p; *q += 8; *p = *p + x; p++; q++; } } It is valid to call the above with completely unrelated p and q, but also e.g. p == q, or q == p + 16 or p == q + 16. Your patch would certainly break it e.g. for p == q. Jakub
[PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.
Given the following loop: int a[N]; short b[N*2]; for (int i = 0; i N; ++i) a[i] = b[i*2]; After being vectorized, the access to b[i*2] will be compiled into several packing statements, while the type promotion from short to int will be compiled into several unpacking statements. With this patch, each pair of pack/unpack statements will be replaced by less expensive statements (with shift or bit-and operations). On x86_64, the loop above will be compiled into the following assembly (with -O2 -ftree-vectorize): movdqu 0x10(%rcx),%xmm3 movdqu -0x20(%rcx),%xmm0 movdqa %xmm0,%xmm2 punpcklwd %xmm3,%xmm0 punpckhwd %xmm3,%xmm2 movdqa %xmm0,%xmm3 punpcklwd %xmm2,%xmm0 punpckhwd %xmm2,%xmm3 movdqa %xmm1,%xmm2 punpcklwd %xmm3,%xmm0 pcmpgtw %xmm0,%xmm2 movdqa %xmm0,%xmm3 punpckhwd %xmm2,%xmm0 punpcklwd %xmm2,%xmm3 movups %xmm0,-0x10(%rdx) movups %xmm3,-0x20(%rdx) With this patch, the generated assembly is shown below: movdqu 0x10(%rcx),%xmm0 movdqu -0x20(%rcx),%xmm1 pslld $0x10,%xmm0 psrad $0x10,%xmm0 pslld $0x10,%xmm1 movups %xmm0,-0x10(%rdx) psrad $0x10,%xmm1 movups %xmm1,-0x20(%rdx) Bootstrapped and tested on x86-64. OK for trunk? thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 117cdd0..e7143f1 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2014-04-23 Cong Hou co...@google.com + + * tree-vect-stmts.c (detect_pack_unpack_pattern): New function. + (vect_gen_widened_results_half): Call detect_pack_unpack_pattern. + 2014-04-23 David Malcolm dmalc...@redhat.com * is-a.h: Update comments to reflect the following changes to the diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 62b07f4..a8755b3 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,7 @@ +2014-04-23 Cong Hou co...@google.com + + * gcc.dg/vect/vect-125.c: New test. + 2014-04-23 Jeff Law l...@redhat.com PR tree-optimization/60902 diff --git a/gcc/testsuite/gcc.dg/vect/vect-125.c b/gcc/testsuite/gcc.dg/vect/vect-125.c new file mode 100644 index 000..988dea6 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-125.c @@ -0,0 +1,122 @@ +/* { dg-require-effective-target vect_int } */ + +#include limits.h +#include tree-vect.h + +#define N 64 + +char b[N]; +unsigned char c[N]; +short d[N]; +unsigned short e[N]; + +__attribute__((noinline)) void +test1 () +{ + int a[N]; + int i; + for (i = 0; i N/2; i++) +{ + a[i] = b[i*2]; + a[i+N/2] = b[i*2+1]; +} + for (i = 0; i N/2; i++) +if (a[i] != b[i*2] || a[i+N/2] != b[i*2+1]) + abort (); + + for (i = 0; i N/2; i++) +{ + a[i] = c[i*2]; + a[i+N/2] = c[i*2+1]; +} + for (i = 0; i N/2; i++) +if (a[i] != c[i*2] || a[i+N/2] != c[i*2+1]) + abort (); + + for (i = 0; i N/2; i++) +{ + a[i] = d[i*2]; + a[i+N/2] = d[i*2+1]; +} + for (i = 0; i N/2; i++) +if (a[i] != d[i*2] || a[i+N/2] != d[i*2+1]) + abort (); + + for (i = 0; i N/2; i++) +{ + a[i] = e[i*2]; + a[i+N/2] = e[i*2+1]; +} + for (i = 0; i N/2; i++) +if (a[i] != e[i*2] || a[i+N/2] != e[i*2+1]) + abort (); +} + +__attribute__((noinline)) void +test2 () +{ + unsigned int a[N]; + int i; + for (i = 0; i N/2; i++) +{ + a[i] = b[i*2]; + a[i+N/2] = b[i*2+1]; +} + for (i = 0; i N/2; i++) +if (a[i] != b[i*2] || a[i+N/2] != b[i*2+1]) + abort (); + + for (i = 0; i N/2; i++) +{ + a[i] = c[i*2]; + a[i+N/2] = c[i*2+1]; +} + for (i = 0; i N/2; i++) +if (a[i] != c[i*2] || a[i+N/2] != c[i*2+1]) + abort (); + + for (i = 0; i N/2; i++) +{ + a[i] = d[i*2]; + a[i+N/2] = d[i*2+1]; +} + for (i = 0; i N/2; i++) +if (a[i] != d[i*2] || a[i+N/2] != d[i*2+1]) + abort (); + + for (i = 0; i N/2; i++) +{ + a[i] = e[i*2]; + a[i+N/2] = e[i*2+1]; +} + for (i = 0; i N/2; i++) +if (a[i] != e[i*2] || a[i+N/2] != e[i*2+1]) + abort (); +} + +int +main () +{ + b[0] = CHAR_MIN; + c[0] = UCHAR_MAX; + d[0] = SHRT_MIN; + e[0] = USHRT_MAX; + + int i; + for (i = 1; i N; i++) +{ + b[i] = b[i-1] + 1; + c[i] = c[i-1] - 1; + d[i] = d[i-1] + 1; + e[i] = e[i-1] - 1; +} + + test1 (); + test2 (); + return 0; +} + +/* { dg-final { scan-tree-dump-times vectorized 4 loops 2 vect } } */ +/* { dg-final { scan-tree-dump-times A pack-unpack pattern is recognized 32 vect } } */ +/* { dg-final { cleanup-tree-dump vect } } */ + diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c index 1a51d6d..d0cf1f4 100644 --- a/gcc/tree-vect-stmts.c +++ b/gcc/tree-vect-stmts.c @@ -3191,6 +3191,174 @@ vectorizable_simd_clone_call (gimple stmt, gimple_stmt_iterator *gsi, } +/* Function detect_pack_unpack_pattern + + Detect the following pattern: + + S1 vect3 = VEC_PERM_EXPR vect1, vect2, { 0, 2, 4, ... }; + or + S1 vect3 = VEC_PERM_EXPR vect1, vect2, { 1, 3, 5, ... }; + + S2 vect4 = [vec_unpack_lo_expr
[PATCH] A new reload-rewrite pattern recognizer for GCC vectorizer.
In this patch a new reload-rewrite pattern detector is composed to handle the following pattern in the loop being vectorized: x = *p; ... y = *p; or *p = x; ... y = *p; In both cases, *p is reloaded because there may exist other defs to another memref that may alias with p. However, aliasing is eliminated with alias checks. Then we can safely replace the last statement in above cases by y = x. The following rewrite pattern is also detected: *p = x; ... *p = y; The first write is redundant due to the fact that there is no aliasing between p and other pointers. In this case we don't need to vectorize this write. Here we replace it with a dummy statement z = x. Bootstrapped and tested on x86-64. OK for trunk? thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 117cdd0..59a4388 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,10 @@ +2014-04-23 Cong Hou co...@google.com + + * tree-vect-patterns.c (vect_recog_reload_rewrite_pattern): + New function. + (vect_vect_recog_func_ptrs): Add new pattern. + * tree-vectorizer.h (NUM_PATTERNS): Update the pattern count. + 2014-04-23 David Malcolm dmalc...@redhat.com * is-a.h: Update comments to reflect the following changes to the diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 62b07f4..2116cd3 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,7 @@ +2014-04-23 Cong Hou co...@google.com + + * gcc.dg/vect/vect-reload-rewrite-pattern.c: New test. + 2014-04-23 Jeff Law l...@redhat.com PR tree-optimization/60902 diff --git a/gcc/testsuite/gcc.dg/vect/vect-reload-rewrite-pattern.c b/gcc/testsuite/gcc.dg/vect/vect-reload-rewrite-pattern.c new file mode 100644 index 000..e75f969 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-reload-rewrite-pattern.c @@ -0,0 +1,61 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target vect_int } */ + +#define N 1000 +int a[N]; + +void test1 (int *b, int *c) +{ + int i; + for (i = 0; i N; ++i) +{ + a[i] = c[i]; + /* Reload of c[i]. */ + b[i] = c[i]; +} +} + +void test2 (int *b, int *c) +{ + int i; + for (i = 0; i N; ++i) +{ + c[i] = a[i] + 10; + /* Reload of a[i]. */ + a[i]++; + /* Reload of c[i]. */ + b[i] = c[i]; +} +} + +void test3 (int *b, int *c) +{ + int i; + for (i = 0; i N; ++i) +{ + c[i] = a[i] 63; + /* Reload of a[i]. */ + a[i]++; + /* Reload of c[i]. */ + /* Rewrite to c[i]. */ + c[i]--; +} +} + +void test4 (_Complex int *b, _Complex int *c, _Complex int *d) +{ + int i; + for (i = 0; i N; ++i) +{ + b[i] = c[i] + d[i]; + /* Reload of REALPART_EXPR (c[i]). */ + /* Reload of IMAGPART_EXPR (c[i]). */ + /* Reload of REALPART_EXPR (d[i]). */ + /* Reload of IMAGPART_EXPR (d[i]). */ + c[i] = c[i] - d[i]; +} +} + +/* { dg-final { scan-tree-dump-times vectorized 1 loops 4 vect} } */ +/* { dg-final { scan-tree-dump-times vect_recog_reload_rewrite_pattern: detected 10 vect } } */ +/* { dg-final { cleanup-tree-dump vect } } */ diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c index 5daaf24..38a0fec 100644 --- a/gcc/tree-vect-patterns.c +++ b/gcc/tree-vect-patterns.c @@ -40,6 +40,7 @@ along with GCC; see the file COPYING3. If not see #include ssa-iterators.h #include stringpool.h #include tree-ssanames.h +#include tree-ssa-sccvn.h #include cfgloop.h #include expr.h #include optabs.h @@ -70,6 +71,7 @@ static gimple vect_recog_divmod_pattern (vecgimple *, static gimple vect_recog_mixed_size_cond_pattern (vecgimple *, tree *, tree *); static gimple vect_recog_bool_pattern (vecgimple *, tree *, tree *); +static gimple vect_recog_reload_rewrite_pattern (vecgimple *, tree *, tree *); static vect_recog_func_ptr vect_vect_recog_func_ptrs[NUM_PATTERNS] = { vect_recog_widen_mult_pattern, vect_recog_widen_sum_pattern, @@ -81,6 +83,7 @@ static vect_recog_func_ptr vect_vect_recog_func_ptrs[NUM_PATTERNS] = { vect_recog_vector_vector_shift_pattern, vect_recog_divmod_pattern, vect_recog_mixed_size_cond_pattern, + vect_recog_reload_rewrite_pattern, vect_recog_bool_pattern}; static inline void @@ -3019,6 +3022,160 @@ vect_recog_bool_pattern (vecgimple *stmts, tree *type_in, return NULL; } +/* Function vect_recog_reload_rewrite_pattern + + Try to find the following reload pattern: + + x = *p; + ... + y = *p; + + or + + *p = x; + ... + y = *p; + + In both cases, *p is reloaded because there may exist other defs to another + memref that may alias with p. However, aliasing is eliminated with alias + checks. Then we can safely replace the last statement in above cases by + y = x. + + Also try to detect rewrite pattern: + + *p = x; + ... + *p = y
[PATCH] Fix PR60896
See http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60896 for bug report. The cause of PR60896 is that those statements in PATTERN_DEF_SEQ in pre-recognized widen-mult pattern are not forwarded to later recognized dot-product pattern. Another issue is that the def types of statements in PATTERN_DEF_SEQ are assigned with the def type of the pattern statement. This is incorrect for reduction pattern statement, in which case all statements in PATTERN_DEF_SEQ will all be vect_reduction_def, and none of them will be vectorized later. The def type of statement in PATTERN_DEF_SEQ should always be vect_internal_def. The patch is attached. Bootstrapped and tested on a x86_64 machine. OK for trunk? thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 117cdd0..0af5e16 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,11 @@ +2014-04-23 Cong Hou co...@google.com + + PR tree-optimization/60896 + * tree-vect-patterns.c (vect_recog_dot_prod_pattern): Pick up + all statements in PATTERN_DEF_SEQ in recognized widen-mult pattern. + (vect_mark_pattern_stmts): Set the def type of all statements in + PATTERN_DEF_SEQ as vect_internal_def. + 2014-04-23 David Malcolm dmalc...@redhat.com * is-a.h: Update comments to reflect the following changes to the diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 62b07f4..55bc842 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,8 @@ +2014-04-23 Cong Hou co...@google.com + + PR tree-optimization/60896 + * g++.dg/vect/pr60896.cc: New test. + 2014-04-23 Jeff Law l...@redhat.com PR tree-optimization/60902 diff --git a/gcc/testsuite/g++.dg/vect/pr60896.cc b/gcc/testsuite/g++.dg/vect/pr60896.cc new file mode 100644 index 000..c6ce68b --- /dev/null +++ b/gcc/testsuite/g++.dg/vect/pr60896.cc @@ -0,0 +1,44 @@ +/* { dg-do compile } */ +/* { dg-options -O3 } */ + +struct A +{ + int m_fn1 (); + short *m_fn2 (); +}; + +struct B +{ + void *fC; +}; + +int a, b; +unsigned char i; +void fn1 (unsigned char *p1, A p2) +{ + int c = p2.m_fn1 (); + for (int d = 0; c; d++) +{ + short *e = p2.m_fn2 (); + unsigned char *f = p1[0]; + for (int g = 0; g a; g++) + { + int h = e[0]; + b += h * f[g]; + } +} +} + +void fn2 (A p1, A p2, B p3) +{ + int j = p2.m_fn1 (); + for (int k = 0; j; k++) +if (0) + ; +else + fn1 (i, p1); + if (p3.fC) +; + else +; +} diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c index 5daaf24..365cf01 100644 --- a/gcc/tree-vect-patterns.c +++ b/gcc/tree-vect-patterns.c @@ -392,6 +392,8 @@ vect_recog_dot_prod_pattern (vecgimple *stmts, tree *type_in, gcc_assert (STMT_VINFO_DEF_TYPE (stmt_vinfo) == vect_internal_def); oprnd00 = gimple_assign_rhs1 (stmt); oprnd01 = gimple_assign_rhs2 (stmt); + STMT_VINFO_PATTERN_DEF_SEQ (vinfo_for_stmt (last_stmt)) + = STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo); } else { @@ -3065,8 +3067,7 @@ vect_mark_pattern_stmts (gimple orig_stmt, gimple pattern_stmt, } gimple_set_bb (def_stmt, gimple_bb (orig_stmt)); STMT_VINFO_RELATED_STMT (def_stmt_info) = orig_stmt; - STMT_VINFO_DEF_TYPE (def_stmt_info) - = STMT_VINFO_DEF_TYPE (orig_stmt_info); + STMT_VINFO_DEF_TYPE (def_stmt_info) = vect_internal_def; if (STMT_VINFO_VECTYPE (def_stmt_info) == NULL_TREE) STMT_VINFO_VECTYPE (def_stmt_info) = pattern_vectype; }
Re: Fixing PR60773
Thanks for the comments, and the attached file is the updated patch. thanks, Cong On Tue, Apr 8, 2014 at 12:28 AM, Rainer Orth r...@cebitec.uni-bielefeld.de wrote: Cong Hou co...@google.com writes: In the patch of PR60656(http://gcc.gnu.org/ml/gcc-patches/2014-03/msg01668.html), the test case requires GCC to vectorize the widen-mult pattern from si to di types. This may result in test failures on some platforms that don't support this pattern. This patch adds a new target vect_widen_mult_si_to_di_pattern to fix this issue. Please document the new keyword in gcc/doc/sourcebuild.texi. diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 414a745..ea860e7 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,11 @@ +2014-04-07 Cong Hou co...@google.com + + PR testsuite/60773 + * testsuite/lib/target-supports.exp: + Add check_effective_target_vect_widen_mult_si_to_di_pattern. + * gcc.dg/vect/pr60656.c: Update the test by checking if the targets + vect_widen_mult_si_to_di_pattern and vect_long are supported. + Your mailer is broken: it swallows tabs and breaks long lines. If you can't fix it, please attach patches instead of sending them inline. Thanks. Rainer -- - Rainer Orth, Center for Biotechnology, Bielefeld University diff --git a/gcc/doc/sourcebuild.texi b/gcc/doc/sourcebuild.texi index 85ef819..9148608 100644 --- a/gcc/doc/sourcebuild.texi +++ b/gcc/doc/sourcebuild.texi @@ -1428,6 +1428,10 @@ Target supports a vector widening multiplication of @code{short} operands into @code{int} results, or can promote (unpack) from @code{short} to @code{int} and perform non-widening multiplication of @code{int}. +@item vect_widen_mult_si_to_di_pattern +Target supports a vector widening multiplication of @code{int} operands +into @code{long} results. + @item vect_sdot_qi Target supports a vector dot-product of @code{signed char}. diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 414a745..d426e29 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,11 @@ +2014-04-07 Cong Hou co...@google.com + + PR testsuite/60773 + * lib/target-supports.exp: + Add check_effective_target_vect_widen_mult_si_to_di_pattern. + * gcc.dg/vect/pr60656.c: Update the test by checking if the targets + vect_widen_mult_si_to_di_pattern and vect_long are supported. + 2014-03-28 Cong Hou co...@google.com PR tree-optimization/60656 diff --git a/gcc/testsuite/gcc.dg/vect/pr60656.c b/gcc/testsuite/gcc.dg/vect/pr60656.c index ebaab62..4950275 100644 --- a/gcc/testsuite/gcc.dg/vect/pr60656.c +++ b/gcc/testsuite/gcc.dg/vect/pr60656.c @@ -1,4 +1,5 @@ /* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target vect_long } */ #include tree-vect.h @@ -12,7 +13,7 @@ foo () for(i = 0; i 4; ++i) { long P = v[i]; - s += P*P*P; + s += P * P * P; } return s; } @@ -27,7 +28,7 @@ bar () for(i = 0; i 4; ++i) { long P = v[i]; - s += P*P*P; + s += P * P * P; __asm__ volatile (); } return s; @@ -35,11 +36,12 @@ bar () int main() { + check_vect (); + if (foo () != bar ()) abort (); return 0; } -/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect } } */ +/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect { target vect_widen_mult_si_to_di_pattern } } } */ /* { dg-final { cleanup-tree-dump vect } } */ - diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp index bee8471..6d9d689 100644 --- a/gcc/testsuite/lib/target-supports.exp +++ b/gcc/testsuite/lib/target-supports.exp @@ -3732,6 +3732,27 @@ proc check_effective_target_vect_widen_mult_hi_to_si_pattern { } { } # Return 1 if the target plus current options supports a vector +# widening multiplication of *int* args into *long* result, 0 otherwise. +# +# This won't change for different subtargets so cache the result. + +proc check_effective_target_vect_widen_mult_si_to_di_pattern { } { +global et_vect_widen_mult_si_to_di_pattern + +if [info exists et_vect_widen_mult_si_to_di_pattern_saved] { +verbose check_effective_target_vect_widen_mult_si_to_di_pattern: using cached result 2 +} else { +if {[istarget ia64-*-*] + || [istarget i?86-*-*] + || [istarget x86_64-*-*] } { +set et_vect_widen_mult_si_to_di_pattern_saved 1 +} +} +verbose check_effective_target_vect_widen_mult_si_to_di_pattern: returning $et_vect_widen_mult_si_to_di_pattern_saved 2 +return $et_vect_widen_mult_si_to_di_pattern_saved +} + +# Return 1 if the target plus current options supports a vector # widening shift, 0 otherwise. # # This won't change for different subtargets so cache the result.
Re: Fixing PR60773
On Tue, Apr 8, 2014 at 12:07 AM, Jakub Jelinek ja...@redhat.com wrote: On Mon, Apr 07, 2014 at 12:16:12PM -0700, Cong Hou wrote: --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,11 @@ +2014-04-07 Cong Hou co...@google.com + + PR testsuite/60773 + * testsuite/lib/target-supports.exp: + Add check_effective_target_vect_widen_mult_si_to_di_pattern. No testsuite/ prefix here. Please write it as: * lib/target-supports.exp (check_effective_target_vect_widen_si_to_di_pattern): New. Thank you for pointing it out. Corrected. --- a/gcc/testsuite/gcc.dg/vect/pr60656.c +++ b/gcc/testsuite/gcc.dg/vect/pr60656.c @@ -1,5 +1,7 @@ /* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target vect_long } */ +#include stdarg.h I fail to see why you need this include, neither your test nor tree-vect.h uses va_*. I have removed this include. thanks, Cong Otherwise looks good to me. Jakub
Fixing PR60773
In the patch of PR60656(http://gcc.gnu.org/ml/gcc-patches/2014-03/msg01668.html), the test case requires GCC to vectorize the widen-mult pattern from si to di types. This may result in test failures on some platforms that don't support this pattern. This patch adds a new target vect_widen_mult_si_to_di_pattern to fix this issue. Bootstrapped and tested on x86_64. OK for trunk? thanks, Cong diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 414a745..ea860e7 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,11 @@ +2014-04-07 Cong Hou co...@google.com + + PR testsuite/60773 + * testsuite/lib/target-supports.exp: + Add check_effective_target_vect_widen_mult_si_to_di_pattern. + * gcc.dg/vect/pr60656.c: Update the test by checking if the targets + vect_widen_mult_si_to_di_pattern and vect_long are supported. + 2014-03-28 Cong Hou co...@google.com PR tree-optimization/60656 diff --git a/gcc/testsuite/gcc.dg/vect/pr60656.c b/gcc/testsuite/gcc.dg/vect/pr60656.c index ebaab62..b80e008 100644 --- a/gcc/testsuite/gcc.dg/vect/pr60656.c +++ b/gcc/testsuite/gcc.dg/vect/pr60656.c @@ -1,5 +1,7 @@ /* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target vect_long } */ +#include stdarg.h #include tree-vect.h __attribute__ ((noinline)) long @@ -12,7 +14,7 @@ foo () for(i = 0; i 4; ++i) { long P = v[i]; - s += P*P*P; + s += P * P * P; } return s; } @@ -27,7 +29,7 @@ bar () for(i = 0; i 4; ++i) { long P = v[i]; - s += P*P*P; + s += P * P * P; __asm__ volatile (); } return s; @@ -35,11 +37,12 @@ bar () int main() { + check_vect (); + if (foo () != bar ()) abort (); return 0; } -/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect } } */ +/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect { target vect_widen_mult_si_to_di_pattern } } } */ /* { dg-final { cleanup-tree-dump vect } } */ - diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp index bee8471..6d9d689 100644 --- a/gcc/testsuite/lib/target-supports.exp +++ b/gcc/testsuite/lib/target-supports.exp @@ -3732,6 +3732,27 @@ proc check_effective_target_vect_widen_mult_hi_to_si_pattern { } { } # Return 1 if the target plus current options supports a vector +# widening multiplication of *int* args into *long* result, 0 otherwise. +# +# This won't change for different subtargets so cache the result. + +proc check_effective_target_vect_widen_mult_si_to_di_pattern { } { +global et_vect_widen_mult_si_to_di_pattern + +if [info exists et_vect_widen_mult_si_to_di_pattern_saved] { +verbose check_effective_target_vect_widen_mult_si_to_di_pattern: using cached result 2 +} else { +if {[istarget ia64-*-*] + || [istarget i?86-*-*] + || [istarget x86_64-*-*] } { +set et_vect_widen_mult_si_to_di_pattern_saved 1 +} +} +verbose check_effective_target_vect_widen_mult_si_to_di_pattern: returning $et_vect_widen_mult_si_to_di_pattern_saved 2 +return $et_vect_widen_mult_si_to_di_pattern_saved +} + +# Return 1 if the target plus current options supports a vector # widening shift, 0 otherwise. # # This won't change for different subtargets so cache the result.
[PATCH] Fixing PR60656
This patch is fixing PR60656. Elements in a vector with vect_used_by_reduction property cannot be reordered if the use chain with this property does not have the same operation. Bootstrapped and tested on a x86-64 machine. OK for trunk? thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index e1d8666..d7d5b82 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,11 @@ +2014-03-28 Cong Hou co...@google.com + + PR tree-optimization/60656 + * tree-vect-stmts.c (supportable_widening_operation): + Fix a bug that elements in a vector with vect_used_by_reduction + property are incorrectly reordered when the operation on it is not + consistant with the one in reduction operation. + 2014-03-10 Jakub Jelinek ja...@redhat.com PR ipa/60457 diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 41b6875..414a745 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,8 @@ +2014-03-28 Cong Hou co...@google.com + + PR tree-optimization/60656 + * gcc.dg/vect/pr60656.c: New test. + 2014-03-10 Jakub Jelinek ja...@redhat.com PR ipa/60457 diff --git a/gcc/testsuite/gcc.dg/vect/pr60656.c b/gcc/testsuite/gcc.dg/vect/pr60656.c new file mode 100644 index 000..ebaab62 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr60656.c @@ -0,0 +1,45 @@ +/* { dg-require-effective-target vect_int } */ + +#include tree-vect.h + +__attribute__ ((noinline)) long +foo () +{ + int v[] = {5000, 5001, 5002, 5003}; + long s = 0; + int i; + + for(i = 0; i 4; ++i) +{ + long P = v[i]; + s += P*P*P; +} + return s; +} + +long +bar () +{ + int v[] = {5000, 5001, 5002, 5003}; + long s = 0; + int i; + + for(i = 0; i 4; ++i) +{ + long P = v[i]; + s += P*P*P; + __asm__ volatile (); +} + return s; +} + +int main() +{ + if (foo () != bar ()) +abort (); + return 0; +} + +/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect } } */ +/* { dg-final { cleanup-tree-dump vect } } */ + diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c index 70fb411..7442d0c 100644 --- a/gcc/tree-vect-stmts.c +++ b/gcc/tree-vect-stmts.c @@ -7827,7 +7827,16 @@ supportable_widening_operation (enum tree_code code, gimple stmt, stmt, vectype_out, vectype_in, code1, code2, multi_step_cvt, interm_types)) - return true; +{ + tree lhs = gimple_assign_lhs (stmt); + use_operand_p dummy; + gimple use_stmt; + stmt_vec_info use_stmt_info = NULL; + if (single_imm_use (lhs, dummy, use_stmt) + (use_stmt_info = vinfo_for_stmt (use_stmt)) + STMT_VINFO_DEF_TYPE (use_stmt_info) == vect_reduction_def) +return true; +} c1 = VEC_WIDEN_MULT_LO_EXPR; c2 = VEC_WIDEN_MULT_HI_EXPR; break;
Re: [PATCH] Fix PR60505
Ping? thanks, Cong On Wed, Mar 19, 2014 at 11:39 AM, Cong Hou co...@google.com wrote: On Tue, Mar 18, 2014 at 4:43 AM, Richard Biener rguent...@suse.de wrote: On Mon, 17 Mar 2014, Cong Hou wrote: On Mon, Mar 17, 2014 at 6:44 AM, Richard Biener rguent...@suse.de wrote: On Fri, 14 Mar 2014, Cong Hou wrote: On Fri, Mar 14, 2014 at 12:58 AM, Richard Biener rguent...@suse.de wrote: On Fri, 14 Mar 2014, Jakub Jelinek wrote: On Fri, Mar 14, 2014 at 08:52:07AM +0100, Richard Biener wrote: Consider this fact and if there are alias checks, we can safely remove the epilogue if the maximum trip count of the loop is less than or equal to the calculated threshold. You have to consider n % vf != 0, so an argument on only maximum trip count or threshold cannot work. Well, if you only check if maximum trip count is = vf and you know that for n vf the vectorized loop + it's epilogue path will not be taken, then perhaps you could, but it is a very special case. Now, the question is when we are guaranteed we enter the scalar versioned loop instead for n vf, is that in case of versioning for alias or versioning for alignment? I think neither - I have plans to do the cost model check together with the versioning condition but didn't get around to implement that. That would allow stronger max bounds for the epilogue loop. In vect_transform_loop(), check_profitability will be set to true if th = VF-1 and the number of iteration is unknown (we only consider unknown trip count here), where th is calculated based on the parameter PARAM_MIN_VECT_LOOP_BOUND and cost model, with the minimum value VF-1. If the loop needs to be versioned, then check_profitability with true value will be passed to vect_loop_versioning(), in which an enhanced loop bound check (considering cost) will be built. So I think if the loop is versioned and n VF, then we must enter the scalar version, and in this case removing epilogue should be safe when the maximum trip count = th+1. You mean exactly in the case where the profitability check ensures that n % vf == 0? Thus effectively if n == maximum trip count? That's quite a special case, no? Yes, it is a special case. But it is in this special case that those warnings are thrown out. Also, I think declaring an array with VF*N as length is not unusual. Ok, but then for the patch compute the cost model threshold once in vect_analyze_loop_2 and store it in a new LOOP_VINFO_COST_MODEL_THRESHOLD. Done. Also you have to check the return value from max_stmt_executions_int as that may return -1 if the number cannot be computed (or isn't representable in a HOST_WIDE_INT). It will be converted to unsigned type so that -1 means infinity. You also should check for LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT which should have the same effect on the cost model check. Done. The existing condition is already complicated enough - adding new stuff warrants comments before the (sub-)checks. OK. Comments added. Below is the revised patch. Bootstrapped and tested on a x86-64 machine. Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index e1d8666..eceefb3 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,18 @@ +2014-03-11 Cong Hou co...@google.com + + PR tree-optimization/60505 + * tree-vectorizer.h (struct _stmt_vec_info): Add th field as the + threshold of number of iterations below which no vectorization will be + done. + * tree-vect-loop.c (new_loop_vec_info): + Initialize LOOP_VINFO_COST_MODEL_THRESHOLD. + * tree-vect-loop.c (vect_analyze_loop_operations): + Set LOOP_VINFO_COST_MODEL_THRESHOLD. + * tree-vect-loop.c (vect_transform_loop): + Use LOOP_VINFO_COST_MODEL_THRESHOLD. + * tree-vect-loop.c (vect_analyze_loop_2): Check the maximum number + of iterations of the loop and see if we should build the epilogue. + 2014-03-10 Jakub Jelinek ja...@redhat.com PR ipa/60457 diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 41b6875..09ec1c0 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,8 @@ +2014-03-11 Cong Hou co...@google.com + + PR tree-optimization/60505 + * gcc.dg/vect/pr60505.c: New test. + 2014-03-10 Jakub Jelinek ja...@redhat.com PR ipa/60457 diff --git a/gcc/testsuite/gcc.dg/vect/pr60505.c b/gcc/testsuite/gcc.dg/vect/pr60505.c new file mode 100644 index 000..6940513 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr60505.c @@ -0,0 +1,12 @@ +/* { dg-do compile } */ +/* { dg-additional-options -Wall -Werror } */ + +void foo(char *in, char *out, int num) +{ + int i; + char ovec[16] = {0}; + + for(i = 0; i num ; ++i) +out[i] = (ovec[i] = in[i]); + out[num] = ovec[num/2]; +} diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index df6ab6f..1c78e11 100644
Re: [PATCH] Fix PR60505
On Tue, Mar 18, 2014 at 4:43 AM, Richard Biener rguent...@suse.de wrote: On Mon, 17 Mar 2014, Cong Hou wrote: On Mon, Mar 17, 2014 at 6:44 AM, Richard Biener rguent...@suse.de wrote: On Fri, 14 Mar 2014, Cong Hou wrote: On Fri, Mar 14, 2014 at 12:58 AM, Richard Biener rguent...@suse.de wrote: On Fri, 14 Mar 2014, Jakub Jelinek wrote: On Fri, Mar 14, 2014 at 08:52:07AM +0100, Richard Biener wrote: Consider this fact and if there are alias checks, we can safely remove the epilogue if the maximum trip count of the loop is less than or equal to the calculated threshold. You have to consider n % vf != 0, so an argument on only maximum trip count or threshold cannot work. Well, if you only check if maximum trip count is = vf and you know that for n vf the vectorized loop + it's epilogue path will not be taken, then perhaps you could, but it is a very special case. Now, the question is when we are guaranteed we enter the scalar versioned loop instead for n vf, is that in case of versioning for alias or versioning for alignment? I think neither - I have plans to do the cost model check together with the versioning condition but didn't get around to implement that. That would allow stronger max bounds for the epilogue loop. In vect_transform_loop(), check_profitability will be set to true if th = VF-1 and the number of iteration is unknown (we only consider unknown trip count here), where th is calculated based on the parameter PARAM_MIN_VECT_LOOP_BOUND and cost model, with the minimum value VF-1. If the loop needs to be versioned, then check_profitability with true value will be passed to vect_loop_versioning(), in which an enhanced loop bound check (considering cost) will be built. So I think if the loop is versioned and n VF, then we must enter the scalar version, and in this case removing epilogue should be safe when the maximum trip count = th+1. You mean exactly in the case where the profitability check ensures that n % vf == 0? Thus effectively if n == maximum trip count? That's quite a special case, no? Yes, it is a special case. But it is in this special case that those warnings are thrown out. Also, I think declaring an array with VF*N as length is not unusual. Ok, but then for the patch compute the cost model threshold once in vect_analyze_loop_2 and store it in a new LOOP_VINFO_COST_MODEL_THRESHOLD. Done. Also you have to check the return value from max_stmt_executions_int as that may return -1 if the number cannot be computed (or isn't representable in a HOST_WIDE_INT). It will be converted to unsigned type so that -1 means infinity. You also should check for LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT which should have the same effect on the cost model check. Done. The existing condition is already complicated enough - adding new stuff warrants comments before the (sub-)checks. OK. Comments added. Below is the revised patch. Bootstrapped and tested on a x86-64 machine. Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index e1d8666..eceefb3 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,18 @@ +2014-03-11 Cong Hou co...@google.com + + PR tree-optimization/60505 + * tree-vectorizer.h (struct _stmt_vec_info): Add th field as the + threshold of number of iterations below which no vectorization will be + done. + * tree-vect-loop.c (new_loop_vec_info): + Initialize LOOP_VINFO_COST_MODEL_THRESHOLD. + * tree-vect-loop.c (vect_analyze_loop_operations): + Set LOOP_VINFO_COST_MODEL_THRESHOLD. + * tree-vect-loop.c (vect_transform_loop): + Use LOOP_VINFO_COST_MODEL_THRESHOLD. + * tree-vect-loop.c (vect_analyze_loop_2): Check the maximum number + of iterations of the loop and see if we should build the epilogue. + 2014-03-10 Jakub Jelinek ja...@redhat.com PR ipa/60457 diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 41b6875..09ec1c0 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,8 @@ +2014-03-11 Cong Hou co...@google.com + + PR tree-optimization/60505 + * gcc.dg/vect/pr60505.c: New test. + 2014-03-10 Jakub Jelinek ja...@redhat.com PR ipa/60457 diff --git a/gcc/testsuite/gcc.dg/vect/pr60505.c b/gcc/testsuite/gcc.dg/vect/pr60505.c new file mode 100644 index 000..6940513 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr60505.c @@ -0,0 +1,12 @@ +/* { dg-do compile } */ +/* { dg-additional-options -Wall -Werror } */ + +void foo(char *in, char *out, int num) +{ + int i; + char ovec[16] = {0}; + + for(i = 0; i num ; ++i) +out[i] = (ovec[i] = in[i]); + out[num] = ovec[num/2]; +} diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index df6ab6f..1c78e11 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -933,6 +933,7 @@ new_loop_vec_info (struct loop *loop) LOOP_VINFO_NITERS (res) = NULL
Re: [PATCH] Fix PR60505
On Mon, Mar 17, 2014 at 6:44 AM, Richard Biener rguent...@suse.de wrote: On Fri, 14 Mar 2014, Cong Hou wrote: On Fri, Mar 14, 2014 at 12:58 AM, Richard Biener rguent...@suse.de wrote: On Fri, 14 Mar 2014, Jakub Jelinek wrote: On Fri, Mar 14, 2014 at 08:52:07AM +0100, Richard Biener wrote: Consider this fact and if there are alias checks, we can safely remove the epilogue if the maximum trip count of the loop is less than or equal to the calculated threshold. You have to consider n % vf != 0, so an argument on only maximum trip count or threshold cannot work. Well, if you only check if maximum trip count is = vf and you know that for n vf the vectorized loop + it's epilogue path will not be taken, then perhaps you could, but it is a very special case. Now, the question is when we are guaranteed we enter the scalar versioned loop instead for n vf, is that in case of versioning for alias or versioning for alignment? I think neither - I have plans to do the cost model check together with the versioning condition but didn't get around to implement that. That would allow stronger max bounds for the epilogue loop. In vect_transform_loop(), check_profitability will be set to true if th = VF-1 and the number of iteration is unknown (we only consider unknown trip count here), where th is calculated based on the parameter PARAM_MIN_VECT_LOOP_BOUND and cost model, with the minimum value VF-1. If the loop needs to be versioned, then check_profitability with true value will be passed to vect_loop_versioning(), in which an enhanced loop bound check (considering cost) will be built. So I think if the loop is versioned and n VF, then we must enter the scalar version, and in this case removing epilogue should be safe when the maximum trip count = th+1. You mean exactly in the case where the profitability check ensures that n % vf == 0? Thus effectively if n == maximum trip count? That's quite a special case, no? Yes, it is a special case. But it is in this special case that those warnings are thrown out. Also, I think declaring an array with VF*N as length is not unusual. thanks, Cong Richard. -- Richard Biener rguent...@suse.de SUSE / SUSE Labs SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746 GF: Jeff Hawn, Jennifer Guild, Felix Imendorffer
Re: [PATCH] Fix PR60505
On Fri, Mar 14, 2014 at 12:58 AM, Richard Biener rguent...@suse.de wrote: On Fri, 14 Mar 2014, Jakub Jelinek wrote: On Fri, Mar 14, 2014 at 08:52:07AM +0100, Richard Biener wrote: Consider this fact and if there are alias checks, we can safely remove the epilogue if the maximum trip count of the loop is less than or equal to the calculated threshold. You have to consider n % vf != 0, so an argument on only maximum trip count or threshold cannot work. Well, if you only check if maximum trip count is = vf and you know that for n vf the vectorized loop + it's epilogue path will not be taken, then perhaps you could, but it is a very special case. Now, the question is when we are guaranteed we enter the scalar versioned loop instead for n vf, is that in case of versioning for alias or versioning for alignment? I think neither - I have plans to do the cost model check together with the versioning condition but didn't get around to implement that. That would allow stronger max bounds for the epilogue loop. In vect_transform_loop(), check_profitability will be set to true if th = VF-1 and the number of iteration is unknown (we only consider unknown trip count here), where th is calculated based on the parameter PARAM_MIN_VECT_LOOP_BOUND and cost model, with the minimum value VF-1. If the loop needs to be versioned, then check_profitability with true value will be passed to vect_loop_versioning(), in which an enhanced loop bound check (considering cost) will be built. So I think if the loop is versioned and n VF, then we must enter the scalar version, and in this case removing epilogue should be safe when the maximum trip count = th+1. thanks, Cong Richard.
Re: [PATCH] Fix PR60505
On Thu, Mar 13, 2014 at 2:27 AM, Richard Biener rguent...@suse.de wrote: On Wed, 12 Mar 2014, Cong Hou wrote: Thank you for pointing it out. I didn't realized that alias analysis has influences on this issue. The current problem is that the epilogue may be unnecessary if the loop bound cannot be larger than the number of iterations of the vectorized loop multiplied by VF when the vectorized loop is supposed to be executed. My method is incorrect because I assume the vectorized loop will be executed which is actually guaranteed by loop bound check (and also alias checks). So if the alias checks exist, my method is fine as both conditions are met. But there is still the loop bound check which, if it fails, uses the epilogue loop as fallback, not the scalar versioned loop. The loop bound check is already performed together with alias checks (assume we need alias checks). Actually, I did observe that the loop bound check in the true body of alias checks may be unnecessary. For example, for the following loop for(i=0; i num ; ++i) out[i] = (ovec[i] = in[i]); GCC now generates the following GIMPLE code after vectorization: bb 3: // loop bound check (with cost model) and alias checks _29 = (unsigned int) num_5(D); _28 = _29 15; _24 = in_9(D) + 16; _23 = out_7(D) = _24; _2 = out_7(D) + 16; _1 = _2 = in_9(D); _32 = _1 | _23; _31 = _28 _32; if (_31 != 0) goto bb 4; else goto bb 12; bb 4: niters.3_44 = (unsigned int) num_5(D); _46 = niters.3_44 + 4294967280; _47 = _46 4; bnd.4_45 = _47 + 1; ratio_mult_vf.5_48 = bnd.4_45 4; _59 = (unsigned int) num_5(D); _60 = _59 + 4294967295; if (_60 = 14) is this necessary? goto bb 10; else goto bb 5; The check _60=14 should be unnecessary because it is implied by the fact _29 15 in bb3. Consider this fact and if there are alias checks, we can safely remove the epilogue if the maximum trip count of the loop is less than or equal to the calculated threshold. Cong If there is no alias checks, I must consider the possibility that the vectorized loop may not be executed at runtime and then the epilogue should not be eliminated. The warning appears on epilogue, and with loop bound checks (and without alias checks) the warning will be gone. So I think the key is alias checks: my method only works if there is no alias checks. How about adding one more condition that checks if alias checks are needed, as the code shown below? else if (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) || (tree_ctz (LOOP_VINFO_NITERS (loop_vinfo)) (unsigned)exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo)) (!LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo) || (unsigned HOST_WIDE_INT)max_stmt_executions_int (LOOP_VINFO_LOOP (loop_vinfo)) (unsigned)th))) LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = true; thanks, Cong On Wed, Mar 12, 2014 at 1:24 AM, Jakub Jelinek ja...@redhat.com wrote: On Tue, Mar 11, 2014 at 04:16:13PM -0700, Cong Hou wrote: This patch is fixing PR60505 in which the vectorizer may produce unnecessary epilogues. Bootstrapped and tested on a x86_64 machine. OK for trunk? That looks wrong. Consider the case where the loop isn't versioned, if you disable generation of the epilogue loop, you end up only with a vector loop. Say: unsigned char ovec[16] __attribute__((aligned (16))) = { 0 }; void foo (char *__restrict in, char *__restrict out, int num) { int i; in = __builtin_assume_aligned (in, 16); out = __builtin_assume_aligned (out, 16); for (i = 0; i num; ++i) out[i] = (ovec[i] = in[i]); out[num] = ovec[num / 2]; } -O2 -ftree-vectorize. Now, consider if this function is called with num != 16 (num 16 is of course invalid, but num 0 to 15 is valid and your patch will cause a wrong-code in this case). Jakub -- Richard Biener rguent...@suse.de SUSE / SUSE Labs SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746 GF: Jeff Hawn, Jennifer Guild, Felix Imendorffer
Re: [PATCH] Fix PR60505
Thank you for pointing it out. I didn't realized that alias analysis has influences on this issue. The current problem is that the epilogue may be unnecessary if the loop bound cannot be larger than the number of iterations of the vectorized loop multiplied by VF when the vectorized loop is supposed to be executed. My method is incorrect because I assume the vectorized loop will be executed which is actually guaranteed by loop bound check (and also alias checks). So if the alias checks exist, my method is fine as both conditions are met. If there is no alias checks, I must consider the possibility that the vectorized loop may not be executed at runtime and then the epilogue should not be eliminated. The warning appears on epilogue, and with loop bound checks (and without alias checks) the warning will be gone. So I think the key is alias checks: my method only works if there is no alias checks. How about adding one more condition that checks if alias checks are needed, as the code shown below? else if (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) || (tree_ctz (LOOP_VINFO_NITERS (loop_vinfo)) (unsigned)exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo)) (!LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo) || (unsigned HOST_WIDE_INT)max_stmt_executions_int (LOOP_VINFO_LOOP (loop_vinfo)) (unsigned)th))) LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = true; thanks, Cong On Wed, Mar 12, 2014 at 1:24 AM, Jakub Jelinek ja...@redhat.com wrote: On Tue, Mar 11, 2014 at 04:16:13PM -0700, Cong Hou wrote: This patch is fixing PR60505 in which the vectorizer may produce unnecessary epilogues. Bootstrapped and tested on a x86_64 machine. OK for trunk? That looks wrong. Consider the case where the loop isn't versioned, if you disable generation of the epilogue loop, you end up only with a vector loop. Say: unsigned char ovec[16] __attribute__((aligned (16))) = { 0 }; void foo (char *__restrict in, char *__restrict out, int num) { int i; in = __builtin_assume_aligned (in, 16); out = __builtin_assume_aligned (out, 16); for (i = 0; i num; ++i) out[i] = (ovec[i] = in[i]); out[num] = ovec[num / 2]; } -O2 -ftree-vectorize. Now, consider if this function is called with num != 16 (num 16 is of course invalid, but num 0 to 15 is valid and your patch will cause a wrong-code in this case). Jakub
[PATCH] Fix PR60505
This patch is fixing PR60505 in which the vectorizer may produce unnecessary epilogues. Bootstrapped and tested on a x86_64 machine. OK for trunk? thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index e1d8666..f98e628 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,9 @@ +2014-03-11 Cong Hou co...@google.com + + PR tree-optimization/60505 + * tree-vect-loop.c (vect_analyze_loop_2): Check the maximum number + of iterations of the loop and see if we should build the epilogue. + 2014-03-10 Jakub Jelinek ja...@redhat.com PR ipa/60457 diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 41b6875..09ec1c0 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,8 @@ +2014-03-11 Cong Hou co...@google.com + + PR tree-optimization/60505 + * gcc.dg/vect/pr60505.c: New test. + 2014-03-10 Jakub Jelinek ja...@redhat.com PR ipa/60457 diff --git a/gcc/testsuite/gcc.dg/vect/pr60505.c b/gcc/testsuite/gcc.dg/vect/pr60505.c new file mode 100644 index 000..6940513 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr60505.c @@ -0,0 +1,12 @@ +/* { dg-do compile } */ +/* { dg-additional-options -Wall -Werror } */ + +void foo(char *in, char *out, int num) +{ + int i; + char ovec[16] = {0}; + + for(i = 0; i num ; ++i) +out[i] = (ovec[i] = in[i]); + out[num] = ovec[num/2]; +} diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index df6ab6f..2156d5f 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -1625,6 +1625,7 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo) bool ok, slp = false; int max_vf = MAX_VECTORIZATION_FACTOR; int min_vf = 2; + int th; /* Find all data references in the loop (which correspond to vdefs/vuses) and analyze their evolution in the loop. Also adjust the minimal @@ -1769,6 +1770,12 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo) /* Decide whether we need to create an epilogue loop to handle remaining scalar iterations. */ + th = MAX (PARAM_VALUE (PARAM_MIN_VECT_LOOP_BOUND), 1) + * LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1; + th = MAX (th, LOOP_VINFO_COST_MODEL_MIN_ITERS (loop_vinfo)) + 1; + th = (th / LOOP_VINFO_VECT_FACTOR (loop_vinfo)) + * LOOP_VINFO_VECT_FACTOR (loop_vinfo); + if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) 0) { @@ -1779,7 +1786,9 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo) } else if (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) || (tree_ctz (LOOP_VINFO_NITERS (loop_vinfo)) -(unsigned)exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo +(unsigned)exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo)) +(unsigned HOST_WIDE_INT)max_stmt_executions_int +(LOOP_VINFO_LOOP (loop_vinfo)) (unsigned)th)) LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = true; /* If an epilogue loop is required make sure we can create one. */
[GOOGLE] Emit a single unaligned load/store instruction for i386 m_GENERIC
This small patch lets GCC emit a single unaligned load/store instruction for m_GENERIC i386 CPUs. Bootstrapped and passed regression test. OK for Google branch? thanks, Cong Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 207701) +++ gcc/config/i386/i386.c (working copy) @@ -1903,10 +1903,10 @@ static unsigned int initial_ix86_tune_fe m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_ATOM | m_AMDFAM10 | m_BDVER | m_GENERIC, /* X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL */ - m_COREI7 | m_COREI7_AVX | m_AMDFAM10 | m_BDVER | m_BTVER, + m_COREI7 | m_COREI7_AVX | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC, /* X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL */ - m_COREI7 | m_COREI7_AVX | m_BDVER, + m_COREI7 | m_COREI7_AVX | m_BDVER | m_GENERIC, /* X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL */ m_BDVER ,
[GOOGLE] Prevent x_flag_complex_method to be set to 2 for C++.
With this patch x_flag_complex_method won't be set to 2 for C++ so that multiply/divide between std::complex objects won't be replaced by expensive builtin function calls. Bootstrapped and passed regression test. OK for Google branch? thanks, Cong Index: gcc/c-family/c-opts.c === --- gcc/c-family/c-opts.c (revision 207701) +++ gcc/c-family/c-opts.c (working copy) @@ -204,8 +204,10 @@ c_common_init_options_struct (struct gcc opts-x_warn_write_strings = c_dialect_cxx (); opts-x_flag_warn_unused_result = true; - /* By default, C99-like requirements for complex multiply and divide. */ - opts-x_flag_complex_method = 2; + /* By default, C99-like requirements for complex multiply and divide. + But for C++ this should not be required. */ + if (c_language != clk_cxx) +opts-x_flag_complex_method = 2; } /* Common initialization before calling option handlers. */
Re: [PATCH] Fixing PR60000: A bug in the vectorizer.
On Fri, Jan 31, 2014 at 5:06 AM, Jakub Jelinek ja...@redhat.com wrote: On Fri, Jan 31, 2014 at 09:41:59AM +0100, Richard Biener wrote: Is that because si and pattern_def_si point to the same stmts? Then I'd prefer to do if (is_store) { ... pattern_def_seq = NULL; } else if (!transform_pattern_stmt gsi_end_p (pattern_def_si)) { pattern_def_seq = NULL; gsi_next (si); } Yeah, I think stores can only appear at the end of patterns, so IMHO it should be safe to just clear pattern_def_seq always in that case. Right now the code has continue; separately for STMT_VINFO_GROUPED_ACCESS (stmt_info) and for !STMT_VINFO_GROUPED_ACCESS (stmt_info) stores, but I guess you should just move them at the end of if (is_store) and clear pattern_def_seq there before the continue. Add gcc_assert (!transform_pattern_stmt); too? I agree. I have updated the patch accordingly. Bootstrapped and tested on x86_64. OK for the trunk? thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 95a324c..cabcaf8 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,10 @@ +2014-01-30 Cong Hou co...@google.com + + PR tree-optimization/6 + * tree-vect-loop.c (vect_transform_loop): Set pattern_def_seq to NULL + if the vectorized statement is a store. A store statement can only + appear at the end of pattern statements. + 2014-01-27 Jakub Jelinek ja...@redhat.com PR bootstrap/59934 diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index fa61d5c..f2ce70f 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,8 @@ +2014-01-30 Cong Hou co...@google.com + + PR tree-optimization/6 + * g++.dg/vect/pr6.cc: New test. + 2014-01-27 Christian Bruel christian.br...@st.com * gcc.target/sh/torture/strncmp.c: New tests. diff --git a/gcc/testsuite/g++.dg/vect/pr6.cc b/gcc/testsuite/g++.dg/vect/pr6.cc new file mode 100644 index 000..fe39d6a --- /dev/null +++ b/gcc/testsuite/g++.dg/vect/pr6.cc @@ -0,0 +1,13 @@ +/* { dg-do compile } */ +/* { dg-additional-options -fno-tree-vrp } */ + +void foo (bool* a, int* b) +{ + for (int i = 0; i 1000; ++i) +{ + a[i] = i % 2; + b[i] = i % 3; +} +} + +/* { dg-final { cleanup-tree-dump vect } } */ diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 69c8d21..0e162cb 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -6053,7 +6053,6 @@ vect_transform_loop (loop_vec_info loop_vinfo) the chain. */ gsi_next (si); vect_remove_stores (GROUP_FIRST_ELEMENT (stmt_info)); - continue; } else { @@ -6063,11 +6062,13 @@ vect_transform_loop (loop_vec_info loop_vinfo) unlink_stmt_vdef (store); gsi_remove (si, true); release_defs (store); - continue; } - } - if (!transform_pattern_stmt gsi_end_p (pattern_def_si)) + /* Stores can only appear at the end of pattern statements. */ + gcc_assert (!transform_pattern_stmt); + pattern_def_seq = NULL; + } + else if (!transform_pattern_stmt gsi_end_p (pattern_def_si)) { pattern_def_seq = NULL; gsi_next (si); Jakub
Re: [PATCH] Fixing PR60000: A bug in the vectorizer.
Wrong format. Send it again. On Thu, Jan 30, 2014 at 4:57 PM, Cong Hou co...@google.com wrote: Hi PR6 (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=6) is caused by GCC vectorizer. The bug appears when handling vectorization patterns. When a pattern statement has additional new statements stored in pattern_def_seq in vect_transform_loop(), those statements are vectorized before the pattern statement. Once all those statements are handled, pattern_def_seq is set to NULL. However, if the pattern statement is a store, pattern_def_seq will not be set to NULL. In consequence, the next pattern statement will not have the correct pattern_def_seq. This bug can be fixed by nullifying pattern_def_seq before checking if the vectorized statement is a store. The patch is pasted below. Bootstrapped and tested on x86_64. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 95a324c..9df0d34 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,10 @@ +2014-01-30 Cong Hou co...@google.com + + PR tree-optimization/6 + * tree-vect-loop.c (vect_transform_loop): Set pattern_def_seq to NULL + before checking if the vectorized statement is a store. A store + statement can be a pattern one. + 2014-01-27 Jakub Jelinek ja...@redhat.com PR bootstrap/59934 diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index fa61d5c..f2ce70f 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,8 @@ +2014-01-30 Cong Hou co...@google.com + + PR tree-optimization/6 + * g++.dg/vect/pr6.cc: New test. + 2014-01-27 Christian Bruel christian.br...@st.com * gcc.target/sh/torture/strncmp.c: New tests. diff --git a/gcc/testsuite/g++.dg/vect/pr6.cc b/gcc/testsuite/g++.dg/vect/pr6.cc new file mode 100644 index 000..8a8bd22 --- /dev/null +++ b/gcc/testsuite/g++.dg/vect/pr6.cc @@ -0,0 +1,13 @@ +/* { dg-do compile } */ +/* { dg-additional-options -fno-tree-vrp } */ + +void foo (bool* a, int* b) +{ + for (int i = 0; i 1000; ++i) +{ + a[i] = i % 2; + b[i] = i % 3; +} +} + +/* { dg-final { cleanup-tree-dump vect } } */ diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 69c8d21..8c8bece 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -6044,6 +6044,10 @@ vect_transform_loop (loop_vec_info loop_vinfo) grouped_store = false; is_store = vect_transform_stmt (stmt, si, grouped_store, NULL, NULL); + + if (!transform_pattern_stmt gsi_end_p (pattern_def_si)) + pattern_def_seq = NULL; + if (is_store) { if (STMT_VINFO_GROUPED_ACCESS (stmt_info)) @@ -6068,10 +6072,7 @@ vect_transform_loop (loop_vec_info loop_vinfo) } if (!transform_pattern_stmt gsi_end_p (pattern_def_si)) - { - pattern_def_seq = NULL; - gsi_next (si); - } + gsi_next (si); } /* stmts in BB */ } /* BBs in loop */
Re: [PATCH] Fixing PR59006 and PR58921 by delaying loop invariant hoisting in vectorizer.
I noticed that LIM could not hoist vector invariant, and that is why my first implementation tries to hoist them all. In addition, there are two disadvantages of hoisting invariant load + lim method: First, for some instructions the scalar version is faster than the vector version, and in this case hoisting scalar instructions before vectorization is better. Those instructions include data packing/unpacking, integer multiplication with SSE2, etc.. Second, it may use more SIMD registers. The following code shows a simple example: char *a, *b, *c; for (int i = 0; i N; ++i) a[i] = b[0] * c[0] + a[i]; Vectorizing b[0]*c[0] is worse than loading the result of b[0]*c[0] into a vector. thanks, Cong On Mon, Jan 13, 2014 at 5:37 AM, Richard Biener rguent...@suse.de wrote: On Wed, 27 Nov 2013, Jakub Jelinek wrote: On Wed, Nov 27, 2013 at 10:53:56AM +0100, Richard Biener wrote: Hmm. I'm still thinking that we should handle this during the regular transform step. I wonder if it can't be done instead just in vectorizable_load, if LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo) and the load is invariant, just emit the (broadcasted) load not inside of the loop, but on the loop preheader edge. So this implements this suggestion, XFAILing the no longer handled cases. For example we get _94 = *b_8(D); vect_cst_.18_95 = {_94, _94, _94, _94}; _99 = prolog_loop_adjusted_niters.9_132 * 4; vectp_a.22_98 = a_6(D) + _99; ivtmp.43_77 = (unsigned long) vectp_a.22_98; bb 13: # ivtmp.41_67 = PHI ivtmp.41_70(3), 0(12) # ivtmp.43_71 = PHI ivtmp.43_69(3), ivtmp.43_77(12) vect__10.19_97 = vect_cst_.18_95 + { 1, 1, 1, 1 }; _76 = (void *) ivtmp.43_71; MEM[base: _76, offset: 0B] = vect__10.19_97; ... instead of having hoisted *b_8 + 1 as scalar computation. Not sure why LIM doesn't hoist the vector variant later. vect__10.19_97 = vect_cst_.18_95 + vect_cst_.20_96; invariant up to level 1, cost 1. ah, the cost thing. Should be improved to see that hoisting reduces the number of live SSA names in the loop. Eventually lower_vector_ssa could optimize vector to scalar code again ... (ick). Bootstrap / regtest running on x86_64. Comments? Thanks, Richard. 2014-01-13 Richard Biener rguent...@suse.de PR tree-optimization/58921 PR tree-optimization/59006 * tree-vect-loop-manip.c (vect_loop_versioning): Remove code hoisting invariant stmts. * tree-vect-stmts.c (vectorizable_load): Insert the splat of invariant loads on the preheader edge if possible. * gcc.dg/torture/pr58921.c: New testcase. * gcc.dg/torture/pr59006.c: Likewise. * gcc.dg/vect/pr58508.c: XFAIL no longer handled cases. Index: gcc/tree-vect-loop-manip.c === *** gcc/tree-vect-loop-manip.c (revision 206576) --- gcc/tree-vect-loop-manip.c (working copy) *** vect_loop_versioning (loop_vec_info loop *** 2435,2507 } } - - /* Extract load statements on memrefs with zero-stride accesses. */ - - if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) - { - /* In the loop body, we iterate each statement to check if it is a load. -Then we check the DR_STEP of the data reference. If DR_STEP is zero, -then we will hoist the load statement to the loop preheader. */ - - basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo); - int nbbs = loop-num_nodes; - - for (int i = 0; i nbbs; ++i) - { - for (gimple_stmt_iterator si = gsi_start_bb (bbs[i]); - !gsi_end_p (si);) - { - gimple stmt = gsi_stmt (si); - stmt_vec_info stmt_info = vinfo_for_stmt (stmt); - struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info); - - if (is_gimple_assign (stmt) - (!dr - || (DR_IS_READ (dr) integer_zerop (DR_STEP (dr) - { - bool hoist = true; - ssa_op_iter iter; - tree var; - - /* We hoist a statement if all SSA uses in it are defined -outside of the loop. */ - FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_USE) - { - gimple def = SSA_NAME_DEF_STMT (var); - if (!gimple_nop_p (def) - flow_bb_inside_loop_p (loop, gimple_bb (def))) - { - hoist = false; - break; - } - } - - if (hoist) - { - if (dr) - gimple_set_vuse (stmt, NULL); - - gsi_remove (si, false); - gsi_insert_on_edge_immediate (loop_preheader_edge
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
Ping? thanks, Cong On Mon, Dec 2, 2013 at 5:06 PM, Cong Hou co...@google.com wrote: Hi Richard Could you please take a look at this patch and see if it is ready for the trunk? The patch is pasted as a text file here again. Thank you very much! Cong On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou co...@google.com wrote: Hi James Sorry for the late reply. On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh james.greenha...@arm.com wrote: On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou co...@google.com wrote: Thank you for your detailed explanation. Once GCC detects a reduction operation, it will automatically accumulate all elements in the vector after the loop. In the loop the reduction variable is always a vector whose elements are reductions of corresponding values from other vectors. Therefore in your case the only instruction you need to generate is: VABAL ops[3], ops[1], ops[2] It is OK if you accumulate the elements into one in the vector inside of the loop (if one instruction can do this), but you have to make sure other elements in the vector should remain zero so that the final result is correct. If you are confused about the documentation, check the one for udot_prod (just above usad in md.texi), as it has very similar behavior as usad. Actually I copied the text from there and did some changes. As those two instruction patterns are both for vectorization, their behavior should not be difficult to explain. If you have more questions or think that the documentation is still improper please let me know. Hi Cong, Thanks for your reply. I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and DOT_PROD_EXPR and I see that the same ambiguity exists for DOT_PROD_EXPR. Can you please add a note in your tree.def that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either: tmp = WIDEN_MINUS_EXPR (arg1, arg2) tmp2 = ABS_EXPR (tmp) arg3 = PLUS_EXPR (tmp2, arg3) or: tmp = WIDEN_MINUS_EXPR (arg1, arg2) tmp2 = ABS_EXPR (tmp) arg3 = WIDEN_SUM_EXPR (tmp2, arg3) Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a a value of the same (widened) type as arg3. I have added it, although we currently don't have WIDEN_MINUS_EXPR (I mentioned it in tree.def). Also, while looking for the history of DOT_PROD_EXPR I spotted this patch: [autovect] [patch] detect mult-hi and sad patterns http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html I wonder what the reason was for that patch to be dropped? It has been 8 years.. I have no idea why this patch is not accepted finally. There is even no reply in that thread. But I believe the SAD pattern is very important to be recognized. ARM also provides instructions for it. Thank you for your comment again! thanks, Cong Thanks, James
Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.
Ping? thanks, Cong On Mon, Dec 2, 2013 at 5:02 PM, Cong Hou co...@google.com wrote: Any comment on this patch? thanks, Cong On Fri, Nov 22, 2013 at 11:40 AM, Cong Hou co...@google.com wrote: On Fri, Nov 22, 2013 at 3:57 AM, Marc Glisse marc.gli...@inria.fr wrote: On Thu, 21 Nov 2013, Cong Hou wrote: On Thu, Nov 21, 2013 at 4:39 PM, Marc Glisse marc.gli...@inria.fr wrote: On Thu, 21 Nov 2013, Cong Hou wrote: While I added the new define_insn_and_split for vec_merge, a bug is exposed: in config/i386/sse.md, [ define_expand xop_vmfrczmode2 ] only takes one input, but the corresponding builtin functions have two inputs, which are shown in i386.c: { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv4sf2, __builtin_ia32_vfrczss, IX86_BUILTIN_VFRCZSS, UNKNOWN, (int)MULTI_ARG_2_SF }, { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv2df2, __builtin_ia32_vfrczsd, IX86_BUILTIN_VFRCZSD, UNKNOWN, (int)MULTI_ARG_2_DF }, In consequence, the ix86_expand_multi_arg_builtin() function tries to check two args but based on the define_expand of xop_vmfrczmode2, the content of insn_data[CODE_FOR_xop_vmfrczv4sf2].operand[2] may be incorrect (because it only needs one input). The patch below fixed this issue. Bootstrapped and tested on ax x86-64 machine. Note that this patch should be applied before the one I sent earlier (sorry for sending them in wrong order). This is PR 56788. Your patch seems strange to me and I don't think it fixes the real issue, but I'll let more knowledgeable people answer. Thank you for pointing out the bug report. This patch is not intended to fix PR56788. IMHO, if PR56788 was fixed, you wouldn't have this issue, and if PR56788 doesn't get fixed, I'll post a patch to remove _mm_frcz_sd and the associated builtin, which would solve your issue as well. I agree. Then I will wait until your patch is merged to the trunk, otherwise my patch could not pass the test. For your function: #include x86intrin.h __m128d f(__m128d x, __m128d y){ return _mm_frcz_sd(x,y); } Note that the second parameter is ignored intentionally, but the prototype of this function contains two parameters. My fix is explicitly telling GCC that the optab xop_vmfrczv4sf3 should have three operands instead of two, to let it have the correct information in insn_data[CODE_FOR_xop_vmfrczv4sf3].operand[2] which is used to match the type of the second parameter in the builtin function in ix86_expand_multi_arg_builtin(). I disagree that this is intentional, it is a bug. AFAIK there is no AMD documentation that could be used as a reference for what _mm_frcz_sd is supposed to do. The only existing documentations are by Microsoft (which does *not* ignore the second argument) and by LLVM (which has a single argument). Whatever we chose for _mm_frcz_sd, the builtin should take a single argument, and if necessary we'll use 2 builtins to implement _mm_frcz_sd. I also only found the one by Microsoft.. If the second argument is ignored, we could just remove it, as long as there is no standard that requires two arguments. Hopefully it won't break current projects using _mm_frcz_sd. Thank you for your comments! Cong -- Marc Glisse
Re: [PATCH] Enhancing the widen-mult pattern in vectorization.
After further reviewing this patch, I found I don't have to change the code in tree-vect-stmts.c to allow further type conversion after widen-mult operation. Instead, I detect the following pattern in vect_recog_widen_mult_pattern(): T1 a, b; ai = (T2) a; bi = (T2) b; c = ai * bi; where T2 is more that double the size of T1. (e.g. T1 is char and T2 is int). In this case I just create a new type T3 whose size is double of the size of T1, then get an intermediate result of type T3 from widen-mult. Then I add a new statement to STMT_VINFO_PATTERN_DEF_SEQ converting the result into type T2. This strategy makes the patch more clean. Bootstrapped and tested on an x86-64 machine. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index f298c0b..12990b2 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,10 @@ +2013-12-02 Cong Hou co...@google.com + + * tree-vect-patterns.c (vect_recog_widen_mult_pattern): Enhance + the widen-mult pattern by handling two operands with different + sizes, and operands whose size is smaller than half of the result + type. + 2013-11-22 Jakub Jelinek ja...@redhat.com PR sanitizer/59061 diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 12d2c90..611ae1c 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,8 @@ +2013-12-02 Cong Hou co...@google.com + + * gcc.dg/vect/vect-widen-mult-u8-s16-s32.c: New test. + * gcc.dg/vect/vect-widen-mult-u8-u32.c: New test. + 2013-11-22 Jakub Jelinek ja...@redhat.com * c-c++-common/asan/no-redundant-instrumentation-7.c: Fix diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c new file mode 100644 index 000..9f9081b --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c @@ -0,0 +1,48 @@ +/* { dg-require-effective-target vect_int } */ + +#include stdarg.h +#include tree-vect.h + +#define N 64 + +unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__))); +short Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__))); +int result[N]; + +/* unsigned char * short - int widening-mult. */ +__attribute__ ((noinline)) int +foo1(int len) { + int i; + + for (i=0; ilen; i++) { +result[i] = X[i] * Y[i]; + } +} + +int main (void) +{ + int i; + + check_vect (); + + for (i=0; iN; i++) { +X[i] = i; +Y[i] = 64-i; +__asm__ volatile (); + } + + foo1 (N); + + for (i=0; iN; i++) { +if (result[i] != X[i] * Y[i]) + abort (); + } + + return 0; +} + +/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect { target { vect_widen_mult_hi_to_si || vect_unpack } } } } */ +/* { dg-final { scan-tree-dump-times vect_recog_widen_mult_pattern: detected 1 vect { target vect_widen_mult_hi_to_si_pattern } } } */ +/* { dg-final { scan-tree-dump-times pattern recognized 1 vect { target vect_widen_mult_hi_to_si_pattern } } } */ +/* { dg-final { cleanup-tree-dump vect } } */ + diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-u32.c b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-u32.c new file mode 100644 index 000..12c4692 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-u32.c @@ -0,0 +1,48 @@ +/* { dg-require-effective-target vect_int } */ + +#include stdarg.h +#include tree-vect.h + +#define N 64 + +unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__))); +unsigned char Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__))); +unsigned int result[N]; + +/* unsigned char- unsigned int widening-mult. */ +__attribute__ ((noinline)) int +foo1(int len) { + int i; + + for (i=0; ilen; i++) { +result[i] = X[i] * Y[i]; + } +} + +int main (void) +{ + int i; + + check_vect (); + + for (i=0; iN; i++) { +X[i] = i; +Y[i] = 64-i; +__asm__ volatile (); + } + + foo1 (N); + + for (i=0; iN; i++) { +if (result[i] != X[i] * Y[i]) + abort (); + } + + return 0; +} + +/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect { target { vect_widen_mult_qi_to_hi || vect_unpack } } } } */ +/* { dg-final { scan-tree-dump-times vect_recog_widen_mult_pattern: detected 1 vect { target vect_widen_mult_qi_to_hi_pattern } } } */ +/* { dg-final { scan-tree-dump-times pattern recognized 1 vect { target vect_widen_mult_qi_to_hi_pattern } } } */ +/* { dg-final { cleanup-tree-dump vect } } */ + diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c index 7823cc3..f412e2d 100644 --- a/gcc/tree-vect-patterns.c +++ b/gcc/tree-vect-patterns.c @@ -529,7 +529,8 @@ vect_handle_widen_op_by_const (gimple stmt, enum tree_code code, Try to find the following pattern: - type a_t, b_t; + type1 a_t; + type2 b_t; TYPE a_T, b_T, prod_T; S1 a_t = ; @@ -538,11 +539,12 @@ vect_handle_widen_op_by_const (gimple stmt, enum tree_code code, S4 b_T = (TYPE) b_t; S5 prod_T = a_T * b_T; - where type 'TYPE' is at least double the size of type 'type'. + where type 'TYPE' is at least
Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.
Hi Richard You mentioned that Micha has a patch pending that enables of zero-step stores. What is the status of this patch? I could not find it through searching Micha. Thank you! Cong On Wed, Oct 16, 2013 at 2:02 AM, Richard Biener rguent...@suse.de wrote: On Tue, 15 Oct 2013, Cong Hou wrote: Thank you for your reminder, Jeff! I just noticed Richard's comment. I have modified the patch according to that. The new patch is attached. (posting patches inline is easier for review, now you have to deal with no quoting markers ;)) Comments inline. diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 8a38316..2637309 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2013-10-15 Cong Hou co...@google.com + + * tree-vect-loop-manip.c (vect_loop_versioning): Hoist loop invariant + statement that contains data refs with zero-step. + 2013-10-14 David Malcolm dmalc...@redhat.com * dumpfile.h (gcc::dump_manager): New class, to hold state diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 075d071..9d0f4a5 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,7 @@ +2013-10-15 Cong Hou co...@google.com + + * gcc.dg/vect/pr58508.c: New test. + 2013-10-14 Tobias Burnus bur...@net-b.de PR fortran/58658 diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c b/gcc/testsuite/gcc.dg/vect/pr58508.c new file mode 100644 index 000..cb22b50 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */ + + +/* The GCC vectorizer generates loop versioning for the following loop + since there may exist aliasing between A and B. The predicate checks + if A may alias with B across all iterations. Then for the loop in + the true body, we can assert that *B is a loop invariant so that + we can hoist the load of *B before the loop body. */ + +void foo (int* a, int* b) +{ + int i; + for (i = 0; i 10; ++i) +a[i] = *b + 1; +} + + +/* { dg-final { scan-tree-dump-times hoist 2 vect } } */ +/* { dg-final { cleanup-tree-dump vect } } */ diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c index 574446a..f4fdec2 100644 --- a/gcc/tree-vect-loop-manip.c +++ b/gcc/tree-vect-loop-manip.c @@ -2477,6 +2477,92 @@ vect_loop_versioning (loop_vec_info loop_vinfo, adjust_phi_and_debug_stmts (orig_phi, e, PHI_RESULT (new_phi)); } Note that applying this kind of transform at this point invalidates some of the earlier analysis the vectorizer performed (namely the def-kind which now effectively gets vect_external_def from vect_internal_def). In this case it doesn't seem to cause any issues (we re-compute the def-kind everytime we need it (how wasteful)). + /* Extract load and store statements on pointers with zero-stride + accesses. */ + if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) +{ + /* In the loop body, we iterate each statement to check if it is a load +or store. Then we check the DR_STEP of the data reference. If +DR_STEP is zero, then we will hoist the load statement to the loop +preheader, and move the store statement to the loop exit. */ We don't move the store yet. Micha has a patch pending that enables vectorization of zero-step stores. + for (gimple_stmt_iterator si = gsi_start_bb (loop-header); + !gsi_end_p (si);) While technically ok now (vectorized loops contain a single basic block) please use LOOP_VINFO_BBS () to get at the vector of basic-blcoks and iterate over them like other code does. + { + gimple stmt = gsi_stmt (si); + stmt_vec_info stmt_info = vinfo_for_stmt (stmt); + struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info); + + if (dr integer_zerop (DR_STEP (dr))) + { + if (DR_IS_READ (dr)) + { + if (dump_enabled_p ()) + { + dump_printf_loc + (MSG_NOTE, vect_location, + hoist the statement to outside of the loop ); hoisting out of the vectorized loop: + dump_gimple_stmt (MSG_NOTE, TDF_SLIM, stmt, 0); + dump_printf (MSG_NOTE, \n); + } + + gsi_remove (si, false); + gsi_insert_on_edge_immediate (loop_preheader_edge (loop), stmt); Note that this will result in a bogus VUSE on the stmt at this point which will be only fixed because of implementation details of loop versioning. Either get the correct VUSE from the loop header virtual PHI node preheader edge (if there is none then the current VUSE is the correct one to use) or clear it. + } + /* TODO: We also consider vectorizing loops containing zero-step
[PATCH] Enhancing the widen-mult pattern in vectorization.
Hi The current widen-mult pattern only considers two operands with the same size. However, operands with different sizes can also benefit from this pattern. The following loop shows such an example: char a[N]; short b[N]; int c[N]; for (int i = 0; i N; ++i) c[i] = a[i] * b[i]; In this case, we can convert a[i] into short type then perform widen-mult on b[i] and the converted value: for (int i = 0; i N; ++i) { short t = a[i]; c[i] = t w* b[i]; } This patch adds such support. In addition, the following loop fails to be recognized as a widen-mult pattern because the widening operation from char to int is not directly supported by the target: char a[N], b[N]; int c[N]; for (int i = 0; i N; ++i) c[i] = a[i] * b[i]; In this case, we can still perform widen-mult on a[i] and b[i], and get a result of short type, then convert it to int: char a[N], b[N]; int c[N]; for (int i = 0; i N; ++i) { short t = a[i] w* b[i]; c[i] = (int) t; } Currently GCC does not allow multi-step conversions for binary widening operations. This pattern removes this restriction and use VEC_UNPACK_LO_EXPR/VEC_UNPACK_HI_EXPR to arrange data after the widen-mult is performed for the widen-mult pattern. This can reduce several unpacking instructions (for this example, the number of packings/unpackings is reduced from 12 to 8. For SSE2, the inefficient multiplication between two V4SI vectors can also be avoided). Bootstrapped and tested on an x86_64 machine. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index f298c0b..44ed204 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,12 @@ +2013-12-02 Cong Hou co...@google.com + + * tree-vect-patterns.c (vect_recog_widen_mult_pattern): Enhance + the widen-mult pattern by handling two operands with different + sizes. + * tree-vect-stmts.c (vectorizable_conversion): Allow multi-steps + conversions after widening mult operation. + (supportable_widening_operation): Likewise. + 2013-11-22 Jakub Jelinek ja...@redhat.com PR sanitizer/59061 diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 12d2c90..611ae1c 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,8 @@ +2013-12-02 Cong Hou co...@google.com + + * gcc.dg/vect/vect-widen-mult-u8-s16-s32.c: New test. + * gcc.dg/vect/vect-widen-mult-u8-u32.c: New test. + 2013-11-22 Jakub Jelinek ja...@redhat.com * c-c++-common/asan/no-redundant-instrumentation-7.c: Fix diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c new file mode 100644 index 000..9f9081b --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c @@ -0,0 +1,48 @@ +/* { dg-require-effective-target vect_int } */ + +#include stdarg.h +#include tree-vect.h + +#define N 64 + +unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__))); +short Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__))); +int result[N]; + +/* unsigned char * short - int widening-mult. */ +__attribute__ ((noinline)) int +foo1(int len) { + int i; + + for (i=0; ilen; i++) { +result[i] = X[i] * Y[i]; + } +} + +int main (void) +{ + int i; + + check_vect (); + + for (i=0; iN; i++) { +X[i] = i; +Y[i] = 64-i; +__asm__ volatile (); + } + + foo1 (N); + + for (i=0; iN; i++) { +if (result[i] != X[i] * Y[i]) + abort (); + } + + return 0; +} + +/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect { target { vect_widen_mult_hi_to_si || vect_unpack } } } } */ +/* { dg-final { scan-tree-dump-times vect_recog_widen_mult_pattern: detected 1 vect { target vect_widen_mult_hi_to_si_pattern } } } */ +/* { dg-final { scan-tree-dump-times pattern recognized 1 vect { target vect_widen_mult_hi_to_si_pattern } } } */ +/* { dg-final { cleanup-tree-dump vect } } */ + diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-u32.c b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-u32.c new file mode 100644 index 000..51e9178 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-u32.c @@ -0,0 +1,48 @@ +/* { dg-require-effective-target vect_int } */ + +#include stdarg.h +#include tree-vect.h + +#define N 64 + +unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__))); +unsigned char Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__))); +unsigned int result[N]; + +/* unsigned char- unsigned int widening-mult. */ +__attribute__ ((noinline)) int +foo1(int len) { + int i; + + for (i=0; ilen; i++) { +result[i] = X[i] * Y[i]; + } +} + +int main (void) +{ + int i; + + check_vect (); + + for (i=0; iN; i++) { +X[i] = i; +Y[i] = 64-i; +__asm__ volatile (); + } + + foo1 (N); + + for (i=0; iN; i++) { +if (result[i] != X[i] * Y[i]) + abort (); + } + + return 0; +} + +/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect { target { vect_widen_mult_qi_to_hi || vect_unpack } } } } */ +/* { dg-final { scan-tree-dump-times
Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.
Any comment on this patch? thanks, Cong On Fri, Nov 22, 2013 at 11:40 AM, Cong Hou co...@google.com wrote: On Fri, Nov 22, 2013 at 3:57 AM, Marc Glisse marc.gli...@inria.fr wrote: On Thu, 21 Nov 2013, Cong Hou wrote: On Thu, Nov 21, 2013 at 4:39 PM, Marc Glisse marc.gli...@inria.fr wrote: On Thu, 21 Nov 2013, Cong Hou wrote: While I added the new define_insn_and_split for vec_merge, a bug is exposed: in config/i386/sse.md, [ define_expand xop_vmfrczmode2 ] only takes one input, but the corresponding builtin functions have two inputs, which are shown in i386.c: { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv4sf2, __builtin_ia32_vfrczss, IX86_BUILTIN_VFRCZSS, UNKNOWN, (int)MULTI_ARG_2_SF }, { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv2df2, __builtin_ia32_vfrczsd, IX86_BUILTIN_VFRCZSD, UNKNOWN, (int)MULTI_ARG_2_DF }, In consequence, the ix86_expand_multi_arg_builtin() function tries to check two args but based on the define_expand of xop_vmfrczmode2, the content of insn_data[CODE_FOR_xop_vmfrczv4sf2].operand[2] may be incorrect (because it only needs one input). The patch below fixed this issue. Bootstrapped and tested on ax x86-64 machine. Note that this patch should be applied before the one I sent earlier (sorry for sending them in wrong order). This is PR 56788. Your patch seems strange to me and I don't think it fixes the real issue, but I'll let more knowledgeable people answer. Thank you for pointing out the bug report. This patch is not intended to fix PR56788. IMHO, if PR56788 was fixed, you wouldn't have this issue, and if PR56788 doesn't get fixed, I'll post a patch to remove _mm_frcz_sd and the associated builtin, which would solve your issue as well. I agree. Then I will wait until your patch is merged to the trunk, otherwise my patch could not pass the test. For your function: #include x86intrin.h __m128d f(__m128d x, __m128d y){ return _mm_frcz_sd(x,y); } Note that the second parameter is ignored intentionally, but the prototype of this function contains two parameters. My fix is explicitly telling GCC that the optab xop_vmfrczv4sf3 should have three operands instead of two, to let it have the correct information in insn_data[CODE_FOR_xop_vmfrczv4sf3].operand[2] which is used to match the type of the second parameter in the builtin function in ix86_expand_multi_arg_builtin(). I disagree that this is intentional, it is a bug. AFAIK there is no AMD documentation that could be used as a reference for what _mm_frcz_sd is supposed to do. The only existing documentations are by Microsoft (which does *not* ignore the second argument) and by LLVM (which has a single argument). Whatever we chose for _mm_frcz_sd, the builtin should take a single argument, and if necessary we'll use 2 builtins to implement _mm_frcz_sd. I also only found the one by Microsoft.. If the second argument is ignored, we could just remove it, as long as there is no standard that requires two arguments. Hopefully it won't break current projects using _mm_frcz_sd. Thank you for your comments! Cong -- Marc Glisse
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
Hi Richard Could you please take a look at this patch and see if it is ready for the trunk? The patch is pasted as a text file here again. Thank you very much! Cong On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou co...@google.com wrote: Hi James Sorry for the late reply. On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh james.greenha...@arm.com wrote: On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou co...@google.com wrote: Thank you for your detailed explanation. Once GCC detects a reduction operation, it will automatically accumulate all elements in the vector after the loop. In the loop the reduction variable is always a vector whose elements are reductions of corresponding values from other vectors. Therefore in your case the only instruction you need to generate is: VABAL ops[3], ops[1], ops[2] It is OK if you accumulate the elements into one in the vector inside of the loop (if one instruction can do this), but you have to make sure other elements in the vector should remain zero so that the final result is correct. If you are confused about the documentation, check the one for udot_prod (just above usad in md.texi), as it has very similar behavior as usad. Actually I copied the text from there and did some changes. As those two instruction patterns are both for vectorization, their behavior should not be difficult to explain. If you have more questions or think that the documentation is still improper please let me know. Hi Cong, Thanks for your reply. I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and DOT_PROD_EXPR and I see that the same ambiguity exists for DOT_PROD_EXPR. Can you please add a note in your tree.def that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either: tmp = WIDEN_MINUS_EXPR (arg1, arg2) tmp2 = ABS_EXPR (tmp) arg3 = PLUS_EXPR (tmp2, arg3) or: tmp = WIDEN_MINUS_EXPR (arg1, arg2) tmp2 = ABS_EXPR (tmp) arg3 = WIDEN_SUM_EXPR (tmp2, arg3) Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a a value of the same (widened) type as arg3. I have added it, although we currently don't have WIDEN_MINUS_EXPR (I mentioned it in tree.def). Also, while looking for the history of DOT_PROD_EXPR I spotted this patch: [autovect] [patch] detect mult-hi and sad patterns http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html I wonder what the reason was for that patch to be dropped? It has been 8 years.. I have no idea why this patch is not accepted finally. There is even no reply in that thread. But I believe the SAD pattern is very important to be recognized. ARM also provides instructions for it. Thank you for your comment again! thanks, Cong Thanks, James diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 6bdaa31..37ff6c4 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,4 +1,24 @@ -2013-11-01 Trevor Saunders tsaund...@mozilla.com +2013-10-29 Cong Hou co...@google.com + + * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD + pattern recognition. + (type_conversion_p): PROMOTION is true if it's a type promotion + conversion, and false otherwise. Return true if the given expression + is a type conversion one. + * tree-vectorizer.h: Adjust the number of patterns. + * tree.def: Add SAD_EXPR. + * optabs.def: Add sad_optab. + * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case. + * expr.c (expand_expr_real_2): Likewise. + * gimple-pretty-print.c (dump_ternary_rhs): Likewise. + * gimple.c (get_gimple_rhs_num_ops): Likewise. + * optabs.c (optab_for_tree_code): Likewise. + * tree-cfg.c (estimate_operator_cost): Likewise. + * tree-ssa-operands.c (get_expr_operands): Likewise. + * tree-vect-loop.c (get_initial_def_for_reduction): Likewise. + * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD. + * doc/generic.texi: Add document for SAD_EXPR. + * doc/md.texi: Add document for ssad and usad. * function.c (reorder_blocks): Convert block_stack to a stack_vec. * gimplify.c (gimplify_compound_lval): Likewise. diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c index fb05ce7..1f824fb 100644 --- a/gcc/cfgexpand.c +++ b/gcc/cfgexpand.c @@ -2740,6 +2740,7 @@ expand_debug_expr (tree exp) { case COND_EXPR: case DOT_PROD_EXPR: + case SAD_EXPR: case WIDEN_MULT_PLUS_EXPR: case WIDEN_MULT_MINUS_EXPR: case FMA_EXPR: diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 9094a1c..af73817 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -7278,6 +7278,36 @@ DONE; }) +(define_expand usadv16qi + [(match_operand:V4SI 0 register_operand) + (match_operand:V16QI 1 register_operand) + (match_operand:V16QI 2 nonimmediate_operand) + (match_operand:V4SI 3 nonimmediate_operand)] + TARGET_SSE2 +{ + rtx t1 = gen_reg_rtx (V2DImode); + rtx t2
Re: [PATCH] Fixing PR59006 and PR58921 by delaying loop invariant hoisting in vectorizer.
On Wed, Nov 27, 2013 at 1:53 AM, Richard Biener rguent...@suse.de wrote: On Fri, 22 Nov 2013, Cong Hou wrote: Hi Currently in GCC vectorization, some loop invariant may be detected after aliasing checks, which can be hoisted outside of the loop. The current method in GCC may break the information built during the analysis phase, causing some crash (see PR59006 and PR58921). This patch improves the loop invariant hoisting by delaying it until all statements are vectorized, thereby keeping all built information. But those loop invariant statements won't be vectorized, and if a variable is defined by one of those loop invariant, it is treated as an external definition. Bootstrapped and testes on an x86-64 machine. Hmm. I'm still thinking that we should handle this during the regular transform step. Like with the following incomplete patch. Missing is adjusting the rest of the vectorizable_* functions to handle the case where all defs are dt_external or constant by setting their own STMT_VINFO_DEF_TYPE to dt_external. From the gcc.dg/vect/pr58508.c we get only 4 hoists instead of 8 because of this (I think). Also gcc.dg/vect/pr52298.c ICEs for yet unanalyzed reason. I can take over the bug if you like. Thanks, Richard. Index: gcc/tree-vect-data-refs.c === *** gcc/tree-vect-data-refs.c (revision 205435) --- gcc/tree-vect-data-refs.c (working copy) *** again: *** 3668,3673 --- 3668,3682 } STMT_VINFO_STRIDE_LOAD_P (stmt_info) = true; } + else if (loop_vinfo + integer_zerop (DR_STEP (dr))) + { + /* All loads from a non-varying address will be disambiguated +by data-ref analysis or via a runtime alias check and thus +they will become invariant. Force them to be vectorized +as external. */ + STMT_VINFO_DEF_TYPE (stmt_info) = vect_external_def; + } } /* If we stopped analysis at the first dataref we could not analyze I agree that setting the statement that loads a data-ref with zero step as vect_external_def early at this point is a good idea. This avoids two loop analyses seeing inconsistent def-info if we do this later. Note with this change the following loop in PR59006 will not be vectorized: int a[8], b; void fn1(void) { int c; for (; b; b++) { int d = a[b]; c = a[0] ? d : 0; a[b] = c; } } This is because the load to a[0] is now treated as an external def, in which case vectype cannot be found for the condition of the conditional expression, while vectorizable_condition requires that comp_vectype should be set properly. We can treat it as a missed optimization. Index: gcc/tree-vect-loop-manip.c === *** gcc/tree-vect-loop-manip.c (revision 205435) --- gcc/tree-vect-loop-manip.c (working copy) *** vect_loop_versioning (loop_vec_info loop *** 2269,2275 /* Extract load statements on memrefs with zero-stride accesses. */ ! if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) { /* In the loop body, we iterate each statement to check if it is a load. Then we check the DR_STEP of the data reference. If DR_STEP is zero, --- 2269,2275 /* Extract load statements on memrefs with zero-stride accesses. */ ! if (0 LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) { /* In the loop body, we iterate each statement to check if it is a load. Then we check the DR_STEP of the data reference. If DR_STEP is zero, Index: gcc/tree-vect-loop.c === *** gcc/tree-vect-loop.c(revision 205435) --- gcc/tree-vect-loop.c(working copy) *** vect_transform_loop (loop_vec_info loop_ *** 5995,6000 --- 5995,6020 } } + /* If the stmt is loop invariant simply move it. */ + if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_external_def) + { + if (dump_enabled_p ()) + { + dump_printf_loc (MSG_NOTE, vect_location, + hoisting out of the vectorized loop: ); + dump_gimple_stmt (MSG_NOTE, TDF_SLIM, stmt, 0); + dump_printf (MSG_NOTE, \n); + } + gsi_remove (si, false); + if (gimple_vuse (stmt)) + gimple_set_vuse (stmt, NULL); + basic_block new_bb; + new_bb = gsi_insert_on_edge_immediate (loop_preheader_edge (loop), +stmt); + gcc_assert (!new_bb); + continue; + } + /* vectorize statement */ if (dump_enabled_p
Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.
On Fri, Nov 22, 2013 at 1:32 AM, Uros Bizjak ubiz...@gmail.com wrote: Hello! In consequence, the ix86_expand_multi_arg_builtin() function tries to check two args but based on the define_expand of xop_vmfrczmode2, the content of insn_data[CODE_FOR_xop_vmfrczv4sf2].operand[2] may be incorrect (because it only needs one input). ;; scalar insns -(define_expand xop_vmfrczmode2 +(define_expand xop_vmfrczmode3 [(set (match_operand:VF_128 0 register_operand) (vec_merge:VF_128 (unspec:VF_128 [(match_operand:VF_128 1 nonimmediate_operand)] UNSPEC_FRCZ) - (match_dup 3) + (match_operand:VF_128 2 register_operand) (const_int 1)))] TARGET_XOP { - operands[3] = CONST0_RTX (MODEmode); + operands[2] = CONST0_RTX (MODEmode); }) No, just use (match_dup 2) in the RTX in addition to operands[2] change. Do not rename patterns. If I use match_dup 2, GCC still thinks this optab has one input argument instead of two, which won't fix the current issue. Marc suggested we should remove the second argument. This also works. Thank you! Cong Uros.
Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.
On Fri, Nov 22, 2013 at 3:57 AM, Marc Glisse marc.gli...@inria.fr wrote: On Thu, 21 Nov 2013, Cong Hou wrote: On Thu, Nov 21, 2013 at 4:39 PM, Marc Glisse marc.gli...@inria.fr wrote: On Thu, 21 Nov 2013, Cong Hou wrote: While I added the new define_insn_and_split for vec_merge, a bug is exposed: in config/i386/sse.md, [ define_expand xop_vmfrczmode2 ] only takes one input, but the corresponding builtin functions have two inputs, which are shown in i386.c: { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv4sf2, __builtin_ia32_vfrczss, IX86_BUILTIN_VFRCZSS, UNKNOWN, (int)MULTI_ARG_2_SF }, { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv2df2, __builtin_ia32_vfrczsd, IX86_BUILTIN_VFRCZSD, UNKNOWN, (int)MULTI_ARG_2_DF }, In consequence, the ix86_expand_multi_arg_builtin() function tries to check two args but based on the define_expand of xop_vmfrczmode2, the content of insn_data[CODE_FOR_xop_vmfrczv4sf2].operand[2] may be incorrect (because it only needs one input). The patch below fixed this issue. Bootstrapped and tested on ax x86-64 machine. Note that this patch should be applied before the one I sent earlier (sorry for sending them in wrong order). This is PR 56788. Your patch seems strange to me and I don't think it fixes the real issue, but I'll let more knowledgeable people answer. Thank you for pointing out the bug report. This patch is not intended to fix PR56788. IMHO, if PR56788 was fixed, you wouldn't have this issue, and if PR56788 doesn't get fixed, I'll post a patch to remove _mm_frcz_sd and the associated builtin, which would solve your issue as well. I agree. Then I will wait until your patch is merged to the trunk, otherwise my patch could not pass the test. For your function: #include x86intrin.h __m128d f(__m128d x, __m128d y){ return _mm_frcz_sd(x,y); } Note that the second parameter is ignored intentionally, but the prototype of this function contains two parameters. My fix is explicitly telling GCC that the optab xop_vmfrczv4sf3 should have three operands instead of two, to let it have the correct information in insn_data[CODE_FOR_xop_vmfrczv4sf3].operand[2] which is used to match the type of the second parameter in the builtin function in ix86_expand_multi_arg_builtin(). I disagree that this is intentional, it is a bug. AFAIK there is no AMD documentation that could be used as a reference for what _mm_frcz_sd is supposed to do. The only existing documentations are by Microsoft (which does *not* ignore the second argument) and by LLVM (which has a single argument). Whatever we chose for _mm_frcz_sd, the builtin should take a single argument, and if necessary we'll use 2 builtins to implement _mm_frcz_sd. I also only found the one by Microsoft.. If the second argument is ignored, we could just remove it, as long as there is no standard that requires two arguments. Hopefully it won't break current projects using _mm_frcz_sd. Thank you for your comments! Cong -- Marc Glisse
[PATCH] Fixing PR59006 and PR58921 by delaying loop invariant hoisting in vectorizer.
Hi Currently in GCC vectorization, some loop invariant may be detected after aliasing checks, which can be hoisted outside of the loop. The current method in GCC may break the information built during the analysis phase, causing some crash (see PR59006 and PR58921). This patch improves the loop invariant hoisting by delaying it until all statements are vectorized, thereby keeping all built information. But those loop invariant statements won't be vectorized, and if a variable is defined by one of those loop invariant, it is treated as an external definition. Bootstrapped and testes on an x86-64 machine. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 2c0554b..0614bab 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,18 @@ +2013-11-22 Cong Hou co...@google.com + + PR tree-optimization/58921 + PR tree-optimization/59006 + * tree-vectorizer.h (struct _stmt_vec_info): New data member + loop_invariant. + * tree-vect-loop-manip.c (vect_loop_versioning): Delay hoisting loop + invariants until all statements are vectorized. + * tree-vect-loop.c (vect_hoist_loop_invariants): New functions. + (vect_transform_loop): Hoist loop invariants after all statements + are vectorized. Do not vectorize loop invariants stmts. + * tree-vect-stmts.c (vect_get_vec_def_for_operand): Treat a loop + invariant as an external definition. + (new_stmt_vec_info): Initialize new data member. + 2013-11-12 Jeff Law l...@redhat.com * tree-ssa-threadedge.c (thread_around_empty_blocks): New diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 09c7f20..447625b 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,10 @@ +2013-11-22 Cong Hou co...@google.com + + PR tree-optimization/58921 + PR tree-optimization/59006 + * gcc.dg/vect/pr58921.c: New test. + * gcc.dg/vect/pr59006.c: New test. + 2013-11-12 Balaji V. Iyer balaji.v.i...@intel.com * gcc.dg/cilk-plus/cilk-plus.exp: Added a check for LTO before running diff --git a/gcc/testsuite/gcc.dg/vect/pr58921.c b/gcc/testsuite/gcc.dg/vect/pr58921.c new file mode 100644 index 000..ee3694a --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr58921.c @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target vect_int } */ + +int a[7]; +int b; + +void +fn1 () +{ + for (; b; b++) +a[b] = ((a[b] = 0) == (a[0] != 0)); +} + +/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect } } */ +/* { dg-final { cleanup-tree-dump vect } } */ diff --git a/gcc/testsuite/gcc.dg/vect/pr59006.c b/gcc/testsuite/gcc.dg/vect/pr59006.c new file mode 100644 index 000..95d90a9 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr59006.c @@ -0,0 +1,24 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target vect_int } */ + +int a[8], b; + +void fn1 (void) +{ + int c; + for (; b; b++) +{ + int d = a[b]; + c = a[0] ? d : 0; + a[b] = c; +} +} + +void fn2 () +{ + for (; b = 0; b++) +a[b] = a[0] || b; +} + +/* { dg-final { scan-tree-dump-times vectorized 1 loops 2 vect } } */ +/* { dg-final { cleanup-tree-dump vect } } */ diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c index 15227856..3adc73d 100644 --- a/gcc/tree-vect-loop-manip.c +++ b/gcc/tree-vect-loop-manip.c @@ -2448,8 +2448,12 @@ vect_loop_versioning (loop_vec_info loop_vinfo, FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_USE) { gimple def = SSA_NAME_DEF_STMT (var); + stmt_vec_info def_stmt_info; + if (!gimple_nop_p (def) - flow_bb_inside_loop_p (loop, gimple_bb (def))) + flow_bb_inside_loop_p (loop, gimple_bb (def)) + !((def_stmt_info = vinfo_for_stmt (def)) + STMT_VINFO_LOOP_INVARIANT_P (def_stmt_info))) { hoist = false; break; @@ -2458,21 +2462,8 @@ vect_loop_versioning (loop_vec_info loop_vinfo, if (hoist) { - if (dr) - gimple_set_vuse (stmt, NULL); - - gsi_remove (si, false); - gsi_insert_on_edge_immediate (loop_preheader_edge (loop), -stmt); - - if (dump_enabled_p ()) - { - dump_printf_loc - (MSG_NOTE, vect_location, - hoisting out of the vectorized loop: ); - dump_gimple_stmt (MSG_NOTE, TDF_SLIM, stmt, 0); - dump_printf (MSG_NOTE, \n); - } + STMT_VINFO_LOOP_INVARIANT_P (stmt_info) = true; + gsi_next (si); continue; } } @@ -2481,6 +2472,7 @@ vect_loop_versioning (loop_vec_info loop_vinfo, } } + /* End loop-exit-fixes after versioning. */ if (cond_expr_stmt_list) diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 292e771..148f9f1 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -5572,6 +5572,49 @@ vect_loop_kill_debug_uses (struct loop *loop, gimple stmt) } } +/* Find all loop invariants detected after alias checks, and hoist them + before the loop preheader. */ + +static void +vect_hoist_loop_invariants (loop_vec_info loop_vinfo) +{ + struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo); + basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo
Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.
On Thu, Nov 21, 2013 at 4:39 PM, Marc Glisse marc.gli...@inria.fr wrote: On Thu, 21 Nov 2013, Cong Hou wrote: While I added the new define_insn_and_split for vec_merge, a bug is exposed: in config/i386/sse.md, [ define_expand xop_vmfrczmode2 ] only takes one input, but the corresponding builtin functions have two inputs, which are shown in i386.c: { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv4sf2, __builtin_ia32_vfrczss, IX86_BUILTIN_VFRCZSS, UNKNOWN, (int)MULTI_ARG_2_SF }, { OPTION_MASK_ISA_XOP, CODE_FOR_xop_vmfrczv2df2, __builtin_ia32_vfrczsd, IX86_BUILTIN_VFRCZSD, UNKNOWN, (int)MULTI_ARG_2_DF }, In consequence, the ix86_expand_multi_arg_builtin() function tries to check two args but based on the define_expand of xop_vmfrczmode2, the content of insn_data[CODE_FOR_xop_vmfrczv4sf2].operand[2] may be incorrect (because it only needs one input). The patch below fixed this issue. Bootstrapped and tested on ax x86-64 machine. Note that this patch should be applied before the one I sent earlier (sorry for sending them in wrong order). This is PR 56788. Your patch seems strange to me and I don't think it fixes the real issue, but I'll let more knowledgeable people answer. Thank you for pointing out the bug report. This patch is not intended to fix PR56788. For your function: #include x86intrin.h __m128d f(__m128d x, __m128d y){ return _mm_frcz_sd(x,y); } Note that the second parameter is ignored intentionally, but the prototype of this function contains two parameters. My fix is explicitly telling GCC that the optab xop_vmfrczv4sf3 should have three operands instead of two, to let it have the correct information in insn_data[CODE_FOR_xop_vmfrczv4sf3].operand[2] which is used to match the type of the second parameter in the builtin function in ix86_expand_multi_arg_builtin(). thanks, Cong -- Marc Glisse
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
Ping... thanks, Cong On Fri, Nov 15, 2013 at 9:52 AM, Cong Hou co...@google.com wrote: Any more comments? thanks, Cong On Wed, Nov 13, 2013 at 6:06 PM, Cong Hou co...@google.com wrote: Ping? thanks, Cong On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou co...@google.com wrote: Hi James Sorry for the late reply. On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh james.greenha...@arm.com wrote: On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou co...@google.com wrote: Thank you for your detailed explanation. Once GCC detects a reduction operation, it will automatically accumulate all elements in the vector after the loop. In the loop the reduction variable is always a vector whose elements are reductions of corresponding values from other vectors. Therefore in your case the only instruction you need to generate is: VABAL ops[3], ops[1], ops[2] It is OK if you accumulate the elements into one in the vector inside of the loop (if one instruction can do this), but you have to make sure other elements in the vector should remain zero so that the final result is correct. If you are confused about the documentation, check the one for udot_prod (just above usad in md.texi), as it has very similar behavior as usad. Actually I copied the text from there and did some changes. As those two instruction patterns are both for vectorization, their behavior should not be difficult to explain. If you have more questions or think that the documentation is still improper please let me know. Hi Cong, Thanks for your reply. I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and DOT_PROD_EXPR and I see that the same ambiguity exists for DOT_PROD_EXPR. Can you please add a note in your tree.def that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either: tmp = WIDEN_MINUS_EXPR (arg1, arg2) tmp2 = ABS_EXPR (tmp) arg3 = PLUS_EXPR (tmp2, arg3) or: tmp = WIDEN_MINUS_EXPR (arg1, arg2) tmp2 = ABS_EXPR (tmp) arg3 = WIDEN_SUM_EXPR (tmp2, arg3) Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a a value of the same (widened) type as arg3. I have added it, although we currently don't have WIDEN_MINUS_EXPR (I mentioned it in tree.def). Also, while looking for the history of DOT_PROD_EXPR I spotted this patch: [autovect] [patch] detect mult-hi and sad patterns http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html I wonder what the reason was for that patch to be dropped? It has been 8 years.. I have no idea why this patch is not accepted finally. There is even no reply in that thread. But I believe the SAD pattern is very important to be recognized. ARM also provides instructions for it. Thank you for your comment again! thanks, Cong Thanks, James
Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.
On Tue, Nov 19, 2013 at 1:45 AM, Richard Biener rguent...@suse.de wrote: On Mon, 18 Nov 2013, Cong Hou wrote: I tried your method and it works well for doubles. But for float, there is an issue. For the following gimple code: c1 = a - b; c2 = a + b; c = VEC_PERM c1, c2, [0,5,2,7] It needs two instructions to implement the VEC_PERM operation in SSE2-4, one of which should be using shufps which is represented by the following pattern in rtl: (define_insn sse_shufps_mode [(set (match_operand:VI4F_128 0 register_operand =x,x) (vec_select:VI4F_128 (vec_concat:ssedoublevecmode (match_operand:VI4F_128 1 register_operand 0,x) (match_operand:VI4F_128 2 nonimmediate_operand xm,xm)) (parallel [(match_operand 3 const_0_to_3_operand) (match_operand 4 const_0_to_3_operand) (match_operand 5 const_4_to_7_operand) (match_operand 6 const_4_to_7_operand)])))] ...) Note that it contains two rtl instructions. It's a single instruction as far as combine is concerned (RTL instructions have arbitrary complexity). Even it is one instruction, we will end up with four rtl statements, which still cannot be combined as there are restrictions on combining four instructions (loads of constants or binary operations involving a constant). Note that vec_select instead of vec_merge is used here because currently vec_merge is emitted only if SSE4 is enabled (thus blend instructions can be used. If you look at ix86_expand_vec_perm_const_1() in i386.c, you can find that vec_merge is generated in expand_vec_perm_1() with SSE4.). Without SSE4 support, in most cases a vec_merge statement could not be translated by one SSE instruction. Together with minus, plus, and one more shuffling instruction, we have at least five instructions for addsub pattern. I think during the combine pass, only four instructions are considered to be combined, right? So unless we compress those five instructions into four or less, we could not use this method for float values. At the moment addsubv4sf looks like (define_insn sse3_addsubv4sf3 [(set (match_operand:V4SF 0 register_operand =x,x) (vec_merge:V4SF (plus:V4SF (match_operand:V4SF 1 register_operand 0,x) (match_operand:V4SF 2 nonimmediate_operand xm,xm)) (minus:V4SF (match_dup 1) (match_dup 2)) (const_int 10)))] to match this it's best to have the VEC_SHUFFLE retained as vec_merge and thus support arbitrary(?) vec_merge for the aid of combining until reload(?) after which we can split it. You mean VEC_PERM (this is generated in gimple from your patch)? Note as I mentioned above, without SSE4, it is difficult to translate VEC_PERM into vec_merge. Even if we can do it, we still need do define split to convert one vec_merge into two or more other statements later. ADDSUB instructions are proved by SSE3 and I think we should not rely on SSE4 to perform this transformation, right? To sum up, if we use vec_select instead of vec_merge, we may have four rtl statements for float types, in which case they cannot be combined. If we use vec_merge, we need to define the split for it without SSE4 support, and we need also to change the behavior of ix86_expand_vec_perm_const_1(). What do you think? Besides of addsub are there other instructions that can be expressed similarly? Thus, how far should the combiner pattern go? I think your method is quite flexible. Beside blending add/sub, we could blend other combinations of two operations, and even one operation and a no-op. For example, consider vectorizing complex conjugate operation: for (int i = 0; i N; i+=2) { a[i] = b[i]; a[i+1] = -b[i+1]; } This is loop is better to be vectorized by hybrid SLP. The second statement has a unary minus operation but there is no operation in the first one. We can improve out SLP grouping algorithm to let GCC SLP vectorize it. thanks, Cong Richard. thanks, Cong On Fri, Nov 15, 2013 at 12:53 AM, Richard Biener rguent...@suse.de wrote: On Thu, 14 Nov 2013, Cong Hou wrote: Hi This patch adds the support to two non-isomorphic operations addsub and subadd for SLP vectorizer. More non-isomorphic operations can be added later, but the limitation is that operations on even/odd elements should still be isomorphic. Once such an operation is detected, the code of the operation used in vectorized code is stored and later will be used during statement transformation. Two new GIMPLE opeartions VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR are defined. And also new optabs for them. They are also documented. The target supports for SSE/SSE2/SSE3/AVX are added for those two new operations on floating points. SSE3/AVX provides ADDSUBPD and ADDSUBPS instructions. For SSE/SSE2, those two operations are emulated using two instructions (selectively negate then add). With this patch the following
Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.
I tried your method and it works well for doubles. But for float, there is an issue. For the following gimple code: c1 = a - b; c2 = a + b; c = VEC_PERM c1, c2, [0,5,2,7] It needs two instructions to implement the VEC_PERM operation in SSE2-4, one of which should be using shufps which is represented by the following pattern in rtl: (define_insn sse_shufps_mode [(set (match_operand:VI4F_128 0 register_operand =x,x) (vec_select:VI4F_128 (vec_concat:ssedoublevecmode (match_operand:VI4F_128 1 register_operand 0,x) (match_operand:VI4F_128 2 nonimmediate_operand xm,xm)) (parallel [(match_operand 3 const_0_to_3_operand) (match_operand 4 const_0_to_3_operand) (match_operand 5 const_4_to_7_operand) (match_operand 6 const_4_to_7_operand)])))] ...) Note that it contains two rtl instructions. Together with minus, plus, and one more shuffling instruction, we have at least five instructions for addsub pattern. I think during the combine pass, only four instructions are considered to be combined, right? So unless we compress those five instructions into four or less, we could not use this method for float values. What do you think? thanks, Cong On Fri, Nov 15, 2013 at 12:53 AM, Richard Biener rguent...@suse.de wrote: On Thu, 14 Nov 2013, Cong Hou wrote: Hi This patch adds the support to two non-isomorphic operations addsub and subadd for SLP vectorizer. More non-isomorphic operations can be added later, but the limitation is that operations on even/odd elements should still be isomorphic. Once such an operation is detected, the code of the operation used in vectorized code is stored and later will be used during statement transformation. Two new GIMPLE opeartions VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR are defined. And also new optabs for them. They are also documented. The target supports for SSE/SSE2/SSE3/AVX are added for those two new operations on floating points. SSE3/AVX provides ADDSUBPD and ADDSUBPS instructions. For SSE/SSE2, those two operations are emulated using two instructions (selectively negate then add). With this patch the following function will be SLP vectorized: float a[4], b[4], c[4]; // double also OK. void subadd () { c[0] = a[0] - b[0]; c[1] = a[1] + b[1]; c[2] = a[2] - b[2]; c[3] = a[3] + b[3]; } void addsub () { c[0] = a[0] + b[0]; c[1] = a[1] - b[1]; c[2] = a[2] + b[2]; c[3] = a[3] - b[3]; } Boostrapped and tested on an x86-64 machine. I managed to do this without adding new tree codes or optabs by vectorizing the above as c1 = a + b; c2 = a - b; c = VEC_PERM c1, c2, the-proper-mask which then matches sse3_addsubv4sf3 if you fix that pattern to not use vec_merge (or fix PR56766). Doing it this way also means that the code is vectorizable if you don't have a HW instruction for that but can do the VEC_PERM efficiently. So, I'd like to avoid new tree codes and optabs whenever possible and here I've already proved (with a patch) that it is possible. Didn't have time to clean it up, and it likely doesn't apply anymore (and PR56766 blocks it but it even has a patch). Btw, this was PR56902 where I attached my patch. Richard. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 2c0554b..656d5fb 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,31 @@ +2013-11-14 Cong Hou co...@google.com + + * tree-vect-slp.c (vect_create_new_slp_node): Initialize + SLP_TREE_OP_CODE. + (slp_supported_non_isomorphic_op): New function. Check if the + non-isomorphic operation is supported or not. + (vect_build_slp_tree_1): Consider non-isomorphic operations. + (vect_build_slp_tree): Change argument. + * tree-vect-stmts.c (vectorizable_operation): Consider the opcode + for non-isomorphic operations. + * optabs.def (vec_addsub_optab, vec_subadd_optab): New optabs. + * tree.def (VEC_ADDSUB_EXPR, VEC_SUBADD_EXPR): New operations. + * expr.c (expand_expr_real_2): Add support to VEC_ADDSUB_EXPR and + VEC_SUBADD_EXPR. + * gimple-pretty-print.c (dump_binary_rhs): Likewise. + * optabs.c (optab_for_tree_code): Likewise. + * tree-cfg.c (verify_gimple_assign_binary): Likewise. + * tree-vectorizer.h (struct _slp_tree): New data member. + * config/i386/i386-protos.h (ix86_sse_expand_fp_addsub_operator): + New funtion. Expand addsub/subadd operations for SSE2. + * config/i386/i386.c (ix86_sse_expand_fp_addsub_operator): Likewise. + * config/i386/sse.md (UNSPEC_SUBADD, UNSPEC_ADDSUB): New RTL operation. + (vec_subadd_v4sf3, vec_subadd_v2df3, vec_subadd_mode3, + vec_addsub_v4sf3, vec_addsub_v2df3, vec_addsub_mode3): + Expand addsub/subadd operations for SSE/SSE2/SSE3/AVX. + * doc/generic.texi (VEC_ADDSUB_EXPR, VEC_SUBADD_EXPR): New doc. + * doc/md.texi (vec_addsub_@var{m}3, vec_subadd_@var{m}3): New doc. + 2013-11-12 Jeff Law l...@redhat.com * tree-ssa-threadedge.c (thread_around_empty_blocks): New diff --git a/gcc/config/i386/i386-protos.h
Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.
On Fri, Nov 15, 2013 at 1:20 AM, Uros Bizjak ubiz...@gmail.com wrote: Hello! This patch adds the support to two non-isomorphic operations addsub and subadd for SLP vectorizer. More non-isomorphic operations can be added later, but the limitation is that operations on even/odd elements should still be isomorphic. Once such an operation is detected, the code of the operation used in vectorized code is stored and later will be used during statement transformation. Two new GIMPLE opeartions VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR are defined. And also new optabs for them. They are also documented. The target supports for SSE/SSE2/SSE3/AVX are added for those two new operations on floating points. SSE3/AVX provides ADDSUBPD and ADDSUBPS instructions. For SSE/SSE2, those two operations are emulated using two instructions (selectively negate then add). ;; SSE3 UNSPEC_LDDQU + UNSPEC_SUBADD + UNSPEC_ADDSUB No! Please avoid unspecs. OK, got it. +(define_expand vec_subadd_v4sf3 + [(set (match_operand:V4SF 0 register_operand) + (unspec:V4SF + [(match_operand:V4SF 1 register_operand) + (match_operand:V4SF 2 nonimmediate_operand)] UNSPEC_SUBADD))] + TARGET_SSE +{ + if (TARGET_SSE3) +emit_insn (gen_sse3_addsubv4sf3 (operands[0], operands[1], operands[2])); + else +ix86_sse_expand_fp_addsub_operator (true, V4SFmode, operands); + DONE; +}) Make the expander pattern look like correspondig sse3 insn and: ... { if (!TARGET_SSE3) { ix86_sse_expand_fp_...(); DONE; } } You mean I should write two expanders for SSE and SSE3 respectively? Thank you for your comment! Cong Uros.
Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.
On Fri, Nov 15, 2013 at 10:18 AM, Richard Earnshaw rearn...@arm.com wrote: On 15/11/13 02:06, Cong Hou wrote: Hi This patch adds the support to two non-isomorphic operations addsub and subadd for SLP vectorizer. More non-isomorphic operations can be added later, but the limitation is that operations on even/odd elements should still be isomorphic. Once such an operation is detected, the code of the operation used in vectorized code is stored and later will be used during statement transformation. Two new GIMPLE opeartions VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR are defined. And also new optabs for them. They are also documented. Not withstanding what Richi has already said on this subject, you certainly don't need both VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR. The latter can always be formed by vec-negating the second operand and passing it to VEC_ADDSUB_EXPR. Right. But I also considered targets without the support to addsub instructions. Then we could still selectively negate odd/even elements using masks then use PLUS_EXPR (at most 2 instructions). If I implement VEC_ADDSUB_EXPR by negating the second operand then using VEC_ADDSUB_EXPR, I end up with one more instruction. thanks, Cong R.
Re: [PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.
On Mon, Nov 18, 2013 at 12:27 PM, Uros Bizjak ubiz...@gmail.com wrote: On Mon, Nov 18, 2013 at 9:15 PM, Cong Hou co...@google.com wrote: This patch adds the support to two non-isomorphic operations addsub and subadd for SLP vectorizer. More non-isomorphic operations can be added later, but the limitation is that operations on even/odd elements should still be isomorphic. Once such an operation is detected, the code of the operation used in vectorized code is stored and later will be used during statement transformation. Two new GIMPLE opeartions VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR are defined. And also new optabs for them. They are also documented. The target supports for SSE/SSE2/SSE3/AVX are added for those two new operations on floating points. SSE3/AVX provides ADDSUBPD and ADDSUBPS instructions. For SSE/SSE2, those two operations are emulated using two instructions (selectively negate then add). +(define_expand vec_subadd_v4sf3 + [(set (match_operand:V4SF 0 register_operand) + (unspec:V4SF + [(match_operand:V4SF 1 register_operand) + (match_operand:V4SF 2 nonimmediate_operand)] UNSPEC_SUBADD))] + TARGET_SSE +{ + if (TARGET_SSE3) +emit_insn (gen_sse3_addsubv4sf3 (operands[0], operands[1], operands[2])); + else +ix86_sse_expand_fp_addsub_operator (true, V4SFmode, operands); + DONE; +}) Make the expander pattern look like correspondig sse3 insn and: ... { if (!TARGET_SSE3) { ix86_sse_expand_fp_...(); DONE; } } You mean I should write two expanders for SSE and SSE3 respectively? No, please use the same approach as you did for absmode2 expander. For !TARGET_SSE3, call the helper function (ix86_sse_expand...), otherwise expand through pattern. Also, it looks to me that you should partially expand in the pattern before calling helper function, mainly to avoid a bunch of if (...) at the beginning of the helper function. I know what you mean. Then I have to change the pattern being detected for sse3_addsubv4sf3, so that it can handle ADDSUB_EXPR for SSE3. Currently I am considering using Richard's method without creating new tree nodes and optabs, based on pattern matching. I will handle SSE2 and SSE3 separately by define_expand and define_insn. The current problem is that the pattern may contain more than four instructions which cannot be processed by the combine pass. I am considering how to reduce the number of instructions in the pattern to four. Thank you very much! Cong Uros.
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
Any more comments? thanks, Cong On Wed, Nov 13, 2013 at 6:06 PM, Cong Hou co...@google.com wrote: Ping? thanks, Cong On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou co...@google.com wrote: Hi James Sorry for the late reply. On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh james.greenha...@arm.com wrote: On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou co...@google.com wrote: Thank you for your detailed explanation. Once GCC detects a reduction operation, it will automatically accumulate all elements in the vector after the loop. In the loop the reduction variable is always a vector whose elements are reductions of corresponding values from other vectors. Therefore in your case the only instruction you need to generate is: VABAL ops[3], ops[1], ops[2] It is OK if you accumulate the elements into one in the vector inside of the loop (if one instruction can do this), but you have to make sure other elements in the vector should remain zero so that the final result is correct. If you are confused about the documentation, check the one for udot_prod (just above usad in md.texi), as it has very similar behavior as usad. Actually I copied the text from there and did some changes. As those two instruction patterns are both for vectorization, their behavior should not be difficult to explain. If you have more questions or think that the documentation is still improper please let me know. Hi Cong, Thanks for your reply. I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and DOT_PROD_EXPR and I see that the same ambiguity exists for DOT_PROD_EXPR. Can you please add a note in your tree.def that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either: tmp = WIDEN_MINUS_EXPR (arg1, arg2) tmp2 = ABS_EXPR (tmp) arg3 = PLUS_EXPR (tmp2, arg3) or: tmp = WIDEN_MINUS_EXPR (arg1, arg2) tmp2 = ABS_EXPR (tmp) arg3 = WIDEN_SUM_EXPR (tmp2, arg3) Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a a value of the same (widened) type as arg3. I have added it, although we currently don't have WIDEN_MINUS_EXPR (I mentioned it in tree.def). Also, while looking for the history of DOT_PROD_EXPR I spotted this patch: [autovect] [patch] detect mult-hi and sad patterns http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html I wonder what the reason was for that patch to be dropped? It has been 8 years.. I have no idea why this patch is not accepted finally. There is even no reply in that thread. But I believe the SAD pattern is very important to be recognized. ARM also provides instructions for it. Thank you for your comment again! thanks, Cong Thanks, James
Re: [PATCH] Do not set flag_complex_method to 2 for C++ by default.
See the following code: #include complex using std::complex; templatetypename _Tp, typename _Up complex_Tp mult_assign (complex_Tp __y, const complex_Up __z) { _Up _M_real = __y.real(); _Up _M_imag = __y.imag(); const _Tp __r = _M_real * __z.real() - _M_imag * __z.imag(); _M_imag = _M_real * __z.imag() + _M_imag * __z.real(); _M_real = __r; return __y; } void foo (complexfloat c1, complexfloat c2) { c1 *= c2; } void bar (complexfloat c1, complexfloat c2) { mult_assign(c1, c2); } The function mult_assign is written almost by copying the implementation of operator *= from complex. They have exactly the same behavior from the view of the source code. However, the compiled results of foo() and bar() are different: foo() is using builtin function for multiplication but bar() is not. Just because of a name change the final behavior is changed? This should not be how a compiler is working. thanks, Cong On Thu, Nov 14, 2013 at 10:17 AM, Andrew Pinski pins...@gmail.com wrote: On Thu, Nov 14, 2013 at 8:25 AM, Xinliang David Li davi...@google.com wrote: Can we revisit the decision for this? Here are the reasons: 1) It seems that the motivation to make C++ consistent with c99 is to avoid confusing users who build the C source with both C and C++ compilers. Why should C++'s default behavior be tuned for this niche case? It is not a niche case. It is confusing for people who write C++ code to rewrite their code to C99 and find that C is much slower because of correctness? I think they have this backwards here. C++ should be consistent with C here. 2) It is very confusing for users who see huge performance difference between compiler generated code for Complex multiplication vs manually expanded code I don't see why this is an issue if they understand how complex multiplication works for correctness. I am sorry but correctness over speed is a good argument of why this should stay this way. 3) The default setting can also block potential vectorization opportunities for complex operations Yes so again this is about a correctness issue over a speed issue. 4) GCC is about the only compiler which has this default -- very few user knows about GCC's strict default, and will think GCC performs poorly. Correctness over speed is better. I am sorry GCC is the only one which gets it correct here. If people don't like there is a flag to disable it. Thanks, Andrew Pinski thanks, David On Wed, Nov 13, 2013 at 9:07 PM, Andrew Pinski pins...@gmail.com wrote: On Wed, Nov 13, 2013 at 5:26 PM, Cong Hou co...@google.com wrote: This patch is for PR58963. In the patch http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00560.html, the builtin function is used to perform complex multiplication and division. This is to comply with C99 standard, but I am wondering if C++ also needs this. There is no complex keyword in C++, and no content in C++ standard about the behavior of operations on complex types. The complex header file is all written in source code, including complex multiplication and division. GCC should not do too much for them by using builtin calls by default (although we can set -fcx-limited-range to prevent GCC doing this), which has a big impact on performance (there may exist vectorization opportunities). In this patch flag_complex_method will not be set to 2 for C++. Bootstraped and tested on an x86-64 machine. I think you need to look into this issue deeper as the original patch only enabled it for C99: http://gcc.gnu.org/ml/gcc-patches/2005-02/msg01483.html . Just a little deeper will find http://gcc.gnu.org/ml/gcc/2007-07/msg00124.html which says yes C++ needs this. Thanks, Andrew Pinski thanks, Cong Index: gcc/c-family/c-opts.c === --- gcc/c-family/c-opts.c (revision 204712) +++ gcc/c-family/c-opts.c (working copy) @@ -198,8 +198,10 @@ c_common_init_options_struct (struct gcc opts-x_warn_write_strings = c_dialect_cxx (); opts-x_flag_warn_unused_result = true; - /* By default, C99-like requirements for complex multiply and divide. */ - opts-x_flag_complex_method = 2; + /* By default, C99-like requirements for complex multiply and divide. + But for C++ this should not be required. */ + if (c_language != clk_cxx c_language != clk_objcxx) +opts-x_flag_complex_method = 2; } /* Common initialization before calling option handlers. */ Index: gcc/c-family/ChangeLog === --- gcc/c-family/ChangeLog (revision 204712) +++ gcc/c-family/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2013-11-13 Cong Hou co...@google.com + + * c-opts.c (c_common_init_options_struct): Don't let C++ comply with + C99-like requirements for complex multiply and divide. + 2013-11-12 Joseph Myers jos...@codesourcery.com * c-common.c (c_common_reswords): Add _Thread_local.
[PATCH] Support addsub/subadd as non-isomorphic operations for SLP vectorizer.
Hi This patch adds the support to two non-isomorphic operations addsub and subadd for SLP vectorizer. More non-isomorphic operations can be added later, but the limitation is that operations on even/odd elements should still be isomorphic. Once such an operation is detected, the code of the operation used in vectorized code is stored and later will be used during statement transformation. Two new GIMPLE opeartions VEC_ADDSUB_EXPR and VEC_SUBADD_EXPR are defined. And also new optabs for them. They are also documented. The target supports for SSE/SSE2/SSE3/AVX are added for those two new operations on floating points. SSE3/AVX provides ADDSUBPD and ADDSUBPS instructions. For SSE/SSE2, those two operations are emulated using two instructions (selectively negate then add). With this patch the following function will be SLP vectorized: float a[4], b[4], c[4]; // double also OK. void subadd () { c[0] = a[0] - b[0]; c[1] = a[1] + b[1]; c[2] = a[2] - b[2]; c[3] = a[3] + b[3]; } void addsub () { c[0] = a[0] + b[0]; c[1] = a[1] - b[1]; c[2] = a[2] + b[2]; c[3] = a[3] - b[3]; } Boostrapped and tested on an x86-64 machine. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 2c0554b..656d5fb 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,31 @@ +2013-11-14 Cong Hou co...@google.com + + * tree-vect-slp.c (vect_create_new_slp_node): Initialize + SLP_TREE_OP_CODE. + (slp_supported_non_isomorphic_op): New function. Check if the + non-isomorphic operation is supported or not. + (vect_build_slp_tree_1): Consider non-isomorphic operations. + (vect_build_slp_tree): Change argument. + * tree-vect-stmts.c (vectorizable_operation): Consider the opcode + for non-isomorphic operations. + * optabs.def (vec_addsub_optab, vec_subadd_optab): New optabs. + * tree.def (VEC_ADDSUB_EXPR, VEC_SUBADD_EXPR): New operations. + * expr.c (expand_expr_real_2): Add support to VEC_ADDSUB_EXPR and + VEC_SUBADD_EXPR. + * gimple-pretty-print.c (dump_binary_rhs): Likewise. + * optabs.c (optab_for_tree_code): Likewise. + * tree-cfg.c (verify_gimple_assign_binary): Likewise. + * tree-vectorizer.h (struct _slp_tree): New data member. + * config/i386/i386-protos.h (ix86_sse_expand_fp_addsub_operator): + New funtion. Expand addsub/subadd operations for SSE2. + * config/i386/i386.c (ix86_sse_expand_fp_addsub_operator): Likewise. + * config/i386/sse.md (UNSPEC_SUBADD, UNSPEC_ADDSUB): New RTL operation. + (vec_subadd_v4sf3, vec_subadd_v2df3, vec_subadd_mode3, + vec_addsub_v4sf3, vec_addsub_v2df3, vec_addsub_mode3): + Expand addsub/subadd operations for SSE/SSE2/SSE3/AVX. + * doc/generic.texi (VEC_ADDSUB_EXPR, VEC_SUBADD_EXPR): New doc. + * doc/md.texi (vec_addsub_@var{m}3, vec_subadd_@var{m}3): New doc. + 2013-11-12 Jeff Law l...@redhat.com * tree-ssa-threadedge.c (thread_around_empty_blocks): New diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h index fdf9d58..b02b757 100644 --- a/gcc/config/i386/i386-protos.h +++ b/gcc/config/i386/i386-protos.h @@ -117,6 +117,7 @@ extern rtx ix86_expand_adjust_ufix_to_sfix_si (rtx, rtx *); extern enum ix86_fpcmp_strategy ix86_fp_comparison_strategy (enum rtx_code); extern void ix86_expand_fp_absneg_operator (enum rtx_code, enum machine_mode, rtx[]); +extern void ix86_sse_expand_fp_addsub_operator (bool, enum machine_mode, rtx[]); extern void ix86_expand_copysign (rtx []); extern void ix86_split_copysign_const (rtx []); extern void ix86_split_copysign_var (rtx []); diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 5287b49..76f38f5 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -18702,6 +18702,51 @@ ix86_expand_fp_absneg_operator (enum rtx_code code, enum machine_mode mode, emit_insn (set); } +/* Generate code for addsub or subadd on fp vectors for sse/sse2. The flag + SUBADD indicates if we are generating code for subadd or addsub. */ + +void +ix86_sse_expand_fp_addsub_operator (bool subadd, enum machine_mode mode, +rtx operands[]) +{ + rtx mask; + rtx neg_mask32 = GEN_INT (0x8000); + rtx neg_mask64 = GEN_INT ((HOST_WIDE_INT)1 63); + + switch (mode) +{ +case V4SFmode: + if (subadd) + mask = gen_rtx_CONST_VECTOR (V4SImode, gen_rtvec (4, + neg_mask32, const0_rtx, neg_mask32, const0_rtx)); + else + mask = gen_rtx_CONST_VECTOR (V4SImode, gen_rtvec (4, + const0_rtx, neg_mask32, const0_rtx, neg_mask32)); + break; + +case V2DFmode: + if (subadd) + mask = gen_rtx_CONST_VECTOR (V2DImode, gen_rtvec (2, + neg_mask64, const0_rtx)); + else + mask = gen_rtx_CONST_VECTOR (V2DImode, gen_rtvec (2, + const0_rtx, neg_mask64)); + break; + +default: + gcc_unreachable (); +} + + rtx tmp = gen_reg_rtx (mode); + convert_move (tmp, mask, false); + + rtx tmp2 = gen_reg_rtx (mode); + tmp2 = expand_simple_binop (mode, XOR, tmp, operands[2], + tmp2, 0, OPTAB_DIRECT); + expand_simple_binop (mode, PLUS, operands[1], tmp2
[PATCH] Do not set flag_complex_method to 2 for C++ by default.
This patch is for PR58963. In the patch http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00560.html, the builtin function is used to perform complex multiplication and division. This is to comply with C99 standard, but I am wondering if C++ also needs this. There is no complex keyword in C++, and no content in C++ standard about the behavior of operations on complex types. The complex header file is all written in source code, including complex multiplication and division. GCC should not do too much for them by using builtin calls by default (although we can set -fcx-limited-range to prevent GCC doing this), which has a big impact on performance (there may exist vectorization opportunities). In this patch flag_complex_method will not be set to 2 for C++. Bootstraped and tested on an x86-64 machine. thanks, Cong Index: gcc/c-family/c-opts.c === --- gcc/c-family/c-opts.c (revision 204712) +++ gcc/c-family/c-opts.c (working copy) @@ -198,8 +198,10 @@ c_common_init_options_struct (struct gcc opts-x_warn_write_strings = c_dialect_cxx (); opts-x_flag_warn_unused_result = true; - /* By default, C99-like requirements for complex multiply and divide. */ - opts-x_flag_complex_method = 2; + /* By default, C99-like requirements for complex multiply and divide. + But for C++ this should not be required. */ + if (c_language != clk_cxx c_language != clk_objcxx) +opts-x_flag_complex_method = 2; } /* Common initialization before calling option handlers. */ Index: gcc/c-family/ChangeLog === --- gcc/c-family/ChangeLog (revision 204712) +++ gcc/c-family/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2013-11-13 Cong Hou co...@google.com + + * c-opts.c (c_common_init_options_struct): Don't let C++ comply with + C99-like requirements for complex multiply and divide. + 2013-11-12 Joseph Myers jos...@codesourcery.com * c-common.c (c_common_reswords): Add _Thread_local.
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
Ping? thanks, Cong On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou co...@google.com wrote: Hi James Sorry for the late reply. On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh james.greenha...@arm.com wrote: On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou co...@google.com wrote: Thank you for your detailed explanation. Once GCC detects a reduction operation, it will automatically accumulate all elements in the vector after the loop. In the loop the reduction variable is always a vector whose elements are reductions of corresponding values from other vectors. Therefore in your case the only instruction you need to generate is: VABAL ops[3], ops[1], ops[2] It is OK if you accumulate the elements into one in the vector inside of the loop (if one instruction can do this), but you have to make sure other elements in the vector should remain zero so that the final result is correct. If you are confused about the documentation, check the one for udot_prod (just above usad in md.texi), as it has very similar behavior as usad. Actually I copied the text from there and did some changes. As those two instruction patterns are both for vectorization, their behavior should not be difficult to explain. If you have more questions or think that the documentation is still improper please let me know. Hi Cong, Thanks for your reply. I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and DOT_PROD_EXPR and I see that the same ambiguity exists for DOT_PROD_EXPR. Can you please add a note in your tree.def that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either: tmp = WIDEN_MINUS_EXPR (arg1, arg2) tmp2 = ABS_EXPR (tmp) arg3 = PLUS_EXPR (tmp2, arg3) or: tmp = WIDEN_MINUS_EXPR (arg1, arg2) tmp2 = ABS_EXPR (tmp) arg3 = WIDEN_SUM_EXPR (tmp2, arg3) Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a a value of the same (widened) type as arg3. I have added it, although we currently don't have WIDEN_MINUS_EXPR (I mentioned it in tree.def). Also, while looking for the history of DOT_PROD_EXPR I spotted this patch: [autovect] [patch] detect mult-hi and sad patterns http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html I wonder what the reason was for that patch to be dropped? It has been 8 years.. I have no idea why this patch is not accepted finally. There is even no reply in that thread. But I believe the SAD pattern is very important to be recognized. ARM also provides instructions for it. Thank you for your comment again! thanks, Cong Thanks, James
Re: [PATCH] Small fix: add { dg-require-effective-target vect_int } to testsuite/gcc.dg/vect/pr58508.c
Hi Jakub Thank you for pointing it out. The updated patch is pasted below. I will pay attention to it in the future. thanks, Cong diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 3d9916d..32a6ff7 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,7 @@ +2013-11-12 Cong Hou co...@google.com + + * gcc.dg/vect/pr58508.c: Remove dg-options as vect_int is indicated. + 2013-10-29 Cong Hou co...@google.com * gcc.dg/vect/pr58508.c: Update. diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c b/gcc/testsuite/gcc.dg/vect/pr58508.c index fff7a04..c4921bb 100644 --- a/gcc/testsuite/gcc.dg/vect/pr58508.c +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c @@ -1,6 +1,5 @@ /* { dg-require-effective-target vect_int } */ /* { dg-do compile } */ -/* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */ /* The GCC vectorizer generates loop versioning for the following loop On Tue, Nov 12, 2013 at 6:05 AM, Jakub Jelinek ja...@redhat.com wrote: On Thu, Nov 07, 2013 at 06:24:55PM -0800, Cong Hou wrote: Ping. OK for the trunk? On Fri, Nov 1, 2013 at 10:47 AM, Cong Hou co...@google.com wrote: It seems that on some platforms the loops in testsuite/gcc.dg/vect/pr58508.c may be unable to be vectorized. This small patch added { dg-require-effective-target vect_int } to make sure all loops can be vectorized. diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 9d0f4a5..3d9916d 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,7 @@ +2013-10-29 Cong Hou co...@google.com + + * gcc.dg/vect/pr58508.c: Update. + 2013-10-15 Cong Hou co...@google.com * gcc.dg/vect/pr58508.c: New test. diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c b/gcc/testsuite/gcc.dg/vect/pr58508.c index 6484a65..fff7a04 100644 --- a/gcc/testsuite/gcc.dg/vect/pr58508.c +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c @@ -1,3 +1,4 @@ +/* { dg-require-effective-target vect_int } */ /* { dg-do compile } */ /* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */ This isn't the only bug in the testcase. Another one is using dg-options in gcc.dg/vect/, you should just leave that out, the default options already include those options, but explicit dg-options mean that other required options like -msse2 on i?86 aren't added. Jakub
[PATCH] [Vectorization] Fixing a bug in alias checks merger.
The current alias check merger does not consider the DR_STEP of data-refs when sorting data-refs. For the following loop: for (i = 0; i N; ++i) a[i] = b[0] + b[i] + b[1]; The data ref b[0] and b[i] have the same DR_INIT and DR_OFFSET, and after sorting three DR pairs, the following order can be a possible result: (a[i], b[0]), (a[i], b[i]), (a[i], b[1]) This prevents the alias checks for (a[i], b[0]) and (a[i], b[1]) being merged. This patch added the comparison between DR_STEP of two data refs during the sort. The test case is also updated. The previous one used explicit dg-options which blocks the options from the target vect_int. The test case also assumes a vector can hold at least 4 integers of int type, which may not be true on some targets. The patch is pasted below. Bootstrapped and tested on a x86-64 machine. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 2c0554b..5faa5ca 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,14 @@ +2013-11-12 Cong Hou co...@google.com + + * tree-vectorizer.h (struct dr_with_seg_len): Remove the base + address field as it can be obtained from dr. Rename the struct. + * tree-vect-data-refs.c (comp_dr_with_seg_len_pair): Consider + steps of data references during sort. + (vect_prune_runtime_alias_test_list): Adjust with the change to + struct dr_with_seg_len. + * tree-vect-loop-manip.c (vect_create_cond_for_alias_checks): + Adjust with the change to struct dr_with_seg_len. + 2013-11-12 Jeff Law l...@redhat.com * tree-ssa-threadedge.c (thread_around_empty_blocks): New diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 09c7f20..8075409 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,7 @@ +2013-11-12 Cong Hou co...@google.com + + * gcc.dg/vect/vect-alias-check.c: Update. + 2013-11-12 Balaji V. Iyer balaji.v.i...@intel.com * gcc.dg/cilk-plus/cilk-plus.exp: Added a check for LTO before running diff --git a/gcc/testsuite/gcc.dg/vect/vect-alias-check.c b/gcc/testsuite/gcc.dg/vect/vect-alias-check.c index 64a4e0c..c1bffed 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-alias-check.c +++ b/gcc/testsuite/gcc.dg/vect/vect-alias-check.c @@ -1,17 +1,17 @@ /* { dg-require-effective-target vect_int } */ /* { dg-do compile } */ -/* { dg-options -O2 -ftree-vectorize --param=vect-max-version-for-alias-checks=2 -fdump-tree-vect-details } */ +/* { dg-additional-options --param=vect-max-version-for-alias-checks=2 } */ -/* A test case showing three potential alias checks between - a[i] and b[i], b[i+7], b[i+14]. With alias checks merging - enabled, those tree checks can be merged into one, and the - loop will be vectorized with vect-max-version-for-alias-checks=2. */ +/* A test case showing four potential alias checks between a[i] and b[0], b[1], + b[i+1] and b[i+2]. With alias check merging enabled, those four checks + can be merged into two, and the loop will be vectorized with + vect-max-version-for-alias-checks=2. */ void foo (int *a, int *b) { int i; for (i = 0; i 1000; ++i) -a[i] = b[i] + b[i+7] + b[i+14]; +a[i] = b[0] + b[1] + b[i+1] + b[i+2]; } /* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect } } */ diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c index c479775..7f0920d 100644 --- a/gcc/tree-vect-data-refs.c +++ b/gcc/tree-vect-data-refs.c @@ -2620,7 +2620,7 @@ vect_analyze_data_ref_accesses (loop_vec_info loop_vinfo, bb_vec_info bb_vinfo) } -/* Operator == between two dr_addr_with_seg_len objects. +/* Operator == between two dr_with_seg_len objects. This equality operator is used to make sure two data refs are the same one so that we will consider to combine the @@ -2628,62 +2628,51 @@ vect_analyze_data_ref_accesses (loop_vec_info loop_vinfo, bb_vec_info bb_vinfo) refs. */ static bool -operator == (const dr_addr_with_seg_len d1, - const dr_addr_with_seg_len d2) +operator == (const dr_with_seg_len d1, + const dr_with_seg_len d2) { - return operand_equal_p (d1.basic_addr, d2.basic_addr, 0) - compare_tree (d1.offset, d2.offset) == 0 - compare_tree (d1.seg_len, d2.seg_len) == 0; + return operand_equal_p (DR_BASE_ADDRESS (d1.dr), + DR_BASE_ADDRESS (d2.dr), 0) +compare_tree (d1.offset, d2.offset) == 0 +compare_tree (d1.seg_len, d2.seg_len) == 0; } -/* Function comp_dr_addr_with_seg_len_pair. +/* Function comp_dr_with_seg_len_pair. - Comparison function for sorting objects of dr_addr_with_seg_len_pair_t + Comparison function for sorting objects of dr_with_seg_len_pair_t so that we can combine aliasing checks in one scan. */ static int -comp_dr_addr_with_seg_len_pair (const void *p1_, const void *p2_) +comp_dr_with_seg_len_pair (const void *p1_, const void *p2_) { - const dr_addr_with_seg_len_pair_t* p1 = -(const dr_addr_with_seg_len_pair_t *) p1_; - const dr_addr_with_seg_len_pair_t* p2 = -(const dr_addr_with_seg_len_pair_t *) p2_; - - const
Re: [PATCH] Bug fix for PR59050
Hi Jeff I have committed the fix. Please update your repo. Thank you! Cong On Mon, Nov 11, 2013 at 10:32 AM, Jeff Law l...@redhat.com wrote: On 11/11/13 02:32, Richard Biener wrote: On Fri, 8 Nov 2013, Cong Hou wrote: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59050 This is my bad. I forget to check the test result for gfortran. With this patch the bug should be fixed (tested on x86-64). Ok. Btw, requirements are to bootstrap and test with all default languages enabled (that is, without any --enable-languages or --enable-languages=all). That gets you c,c++,objc,java,fortran,lto and misses obj-c++ ada and go. I am personally using --enable-languages=all,ada,obj-c++. FWIW, I bootstrapped with Cong's patch to keep my own test results clean. So it's already been through those tests. If Cong doesn't get to it soon, I'll check it in myself. jeff
Re: [PATCH] Bug fix for PR59050
Thank you for your advice! I will follow this instruction in future. thanks, Cong On Mon, Nov 11, 2013 at 1:32 AM, Richard Biener rguent...@suse.de wrote: On Fri, 8 Nov 2013, Cong Hou wrote: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59050 This is my bad. I forget to check the test result for gfortran. With this patch the bug should be fixed (tested on x86-64). Ok. Btw, requirements are to bootstrap and test with all default languages enabled (that is, without any --enable-languages or --enable-languages=all). That gets you c,c++,objc,java,fortran,lto and misses obj-c++ ada and go. I am personally using --enable-languages=all,ada,obj-c++. Thanks, Richard. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 90b01f2..e62c672 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2013-11-08 Cong Hou co...@google.com + + PR tree-optimization/59050 + * tree-vect-data-refs.c (comp_dr_addr_with_seg_len_pair): Bug fix. + 2013-11-07 Cong Hou co...@google.com * tree-vect-loop-manip.c (vect_create_cond_for_alias_checks): diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c index b2a31b1..b7eb926 100644 --- a/gcc/tree-vect-data-refs.c +++ b/gcc/tree-vect-data-refs.c @@ -2669,9 +2669,9 @@ comp_dr_addr_with_seg_len_pair (const void *p1_, const void *p2_) if (comp_res != 0) return comp_res; } - if (tree_int_cst_compare (p11.offset, p21.offset) 0) + else if (tree_int_cst_compare (p11.offset, p21.offset) 0) return -1; - if (tree_int_cst_compare (p11.offset, p21.offset) 0) + else if (tree_int_cst_compare (p11.offset, p21.offset) 0) return 1; if (TREE_CODE (p12.offset) != INTEGER_CST || TREE_CODE (p22.offset) != INTEGER_CST) @@ -2680,9 +2680,9 @@ comp_dr_addr_with_seg_len_pair (const void *p1_, const void *p2_) if (comp_res != 0) return comp_res; } - if (tree_int_cst_compare (p12.offset, p22.offset) 0) + else if (tree_int_cst_compare (p12.offset, p22.offset) 0) return -1; - if (tree_int_cst_compare (p12.offset, p22.offset) 0) + else if (tree_int_cst_compare (p12.offset, p22.offset) 0) return 1; return 0; -- Richard Biener rguent...@suse.de SUSE / SUSE Labs SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746 GF: Jeff Hawn, Jennifer Guild, Felix Imend
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
Hi James Sorry for the late reply. On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh james.greenha...@arm.com wrote: On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou co...@google.com wrote: Thank you for your detailed explanation. Once GCC detects a reduction operation, it will automatically accumulate all elements in the vector after the loop. In the loop the reduction variable is always a vector whose elements are reductions of corresponding values from other vectors. Therefore in your case the only instruction you need to generate is: VABAL ops[3], ops[1], ops[2] It is OK if you accumulate the elements into one in the vector inside of the loop (if one instruction can do this), but you have to make sure other elements in the vector should remain zero so that the final result is correct. If you are confused about the documentation, check the one for udot_prod (just above usad in md.texi), as it has very similar behavior as usad. Actually I copied the text from there and did some changes. As those two instruction patterns are both for vectorization, their behavior should not be difficult to explain. If you have more questions or think that the documentation is still improper please let me know. Hi Cong, Thanks for your reply. I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and DOT_PROD_EXPR and I see that the same ambiguity exists for DOT_PROD_EXPR. Can you please add a note in your tree.def that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either: tmp = WIDEN_MINUS_EXPR (arg1, arg2) tmp2 = ABS_EXPR (tmp) arg3 = PLUS_EXPR (tmp2, arg3) or: tmp = WIDEN_MINUS_EXPR (arg1, arg2) tmp2 = ABS_EXPR (tmp) arg3 = WIDEN_SUM_EXPR (tmp2, arg3) Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a a value of the same (widened) type as arg3. I have added it, although we currently don't have WIDEN_MINUS_EXPR (I mentioned it in tree.def). Also, while looking for the history of DOT_PROD_EXPR I spotted this patch: [autovect] [patch] detect mult-hi and sad patterns http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html I wonder what the reason was for that patch to be dropped? It has been 8 years.. I have no idea why this patch is not accepted finally. There is even no reply in that thread. But I believe the SAD pattern is very important to be recognized. ARM also provides instructions for it. Thank you for your comment again! thanks, Cong Thanks, James diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 6bdaa31..37ff6c4 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,4 +1,24 @@ -2013-11-01 Trevor Saunders tsaund...@mozilla.com +2013-10-29 Cong Hou co...@google.com + + * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD + pattern recognition. + (type_conversion_p): PROMOTION is true if it's a type promotion + conversion, and false otherwise. Return true if the given expression + is a type conversion one. + * tree-vectorizer.h: Adjust the number of patterns. + * tree.def: Add SAD_EXPR. + * optabs.def: Add sad_optab. + * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case. + * expr.c (expand_expr_real_2): Likewise. + * gimple-pretty-print.c (dump_ternary_rhs): Likewise. + * gimple.c (get_gimple_rhs_num_ops): Likewise. + * optabs.c (optab_for_tree_code): Likewise. + * tree-cfg.c (estimate_operator_cost): Likewise. + * tree-ssa-operands.c (get_expr_operands): Likewise. + * tree-vect-loop.c (get_initial_def_for_reduction): Likewise. + * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD. + * doc/generic.texi: Add document for SAD_EXPR. + * doc/md.texi: Add document for ssad and usad. * function.c (reorder_blocks): Convert block_stack to a stack_vec. * gimplify.c (gimplify_compound_lval): Likewise. diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c index fb05ce7..1f824fb 100644 --- a/gcc/cfgexpand.c +++ b/gcc/cfgexpand.c @@ -2740,6 +2740,7 @@ expand_debug_expr (tree exp) { case COND_EXPR: case DOT_PROD_EXPR: + case SAD_EXPR: case WIDEN_MULT_PLUS_EXPR: case WIDEN_MULT_MINUS_EXPR: case FMA_EXPR: diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 9094a1c..af73817 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -7278,6 +7278,36 @@ DONE; }) +(define_expand usadv16qi + [(match_operand:V4SI 0 register_operand) + (match_operand:V16QI 1 register_operand) + (match_operand:V16QI 2 nonimmediate_operand) + (match_operand:V4SI 3 nonimmediate_operand)] + TARGET_SSE2 +{ + rtx t1 = gen_reg_rtx (V2DImode); + rtx t2 = gen_reg_rtx (V4SImode); + emit_insn (gen_sse2_psadbw (t1, operands[1], operands[2])); + convert_move (t2, t1, 0); + emit_insn (gen_addv4si3 (operands[0], t2, operands[3])); + DONE; +}) + +(define_expand usadv32qi + [(match_operand:V8SI 0
[PATCH] Bug fix for PR59050
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59050 This is my bad. I forget to check the test result for gfortran. With this patch the bug should be fixed (tested on x86-64). thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 90b01f2..e62c672 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2013-11-08 Cong Hou co...@google.com + + PR tree-optimization/59050 + * tree-vect-data-refs.c (comp_dr_addr_with_seg_len_pair): Bug fix. + 2013-11-07 Cong Hou co...@google.com * tree-vect-loop-manip.c (vect_create_cond_for_alias_checks): diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c index b2a31b1..b7eb926 100644 --- a/gcc/tree-vect-data-refs.c +++ b/gcc/tree-vect-data-refs.c @@ -2669,9 +2669,9 @@ comp_dr_addr_with_seg_len_pair (const void *p1_, const void *p2_) if (comp_res != 0) return comp_res; } - if (tree_int_cst_compare (p11.offset, p21.offset) 0) + else if (tree_int_cst_compare (p11.offset, p21.offset) 0) return -1; - if (tree_int_cst_compare (p11.offset, p21.offset) 0) + else if (tree_int_cst_compare (p11.offset, p21.offset) 0) return 1; if (TREE_CODE (p12.offset) != INTEGER_CST || TREE_CODE (p22.offset) != INTEGER_CST) @@ -2680,9 +2680,9 @@ comp_dr_addr_with_seg_len_pair (const void *p1_, const void *p2_) if (comp_res != 0) return comp_res; } - if (tree_int_cst_compare (p12.offset, p22.offset) 0) + else if (tree_int_cst_compare (p12.offset, p22.offset) 0) return -1; - if (tree_int_cst_compare (p12.offset, p22.offset) 0) + else if (tree_int_cst_compare (p12.offset, p22.offset) 0) return 1; return 0;
Re: [PATCH] Reducing number of alias checks in vectorization.
Thank you for the report. I have submitted a bug fix patch waiting to be reviewed. thanks, Cong On Fri, Nov 8, 2013 at 5:26 AM, Dominique Dhumieres domi...@lps.ens.fr wrote: According to http://gcc.gnu.org/ml/gcc-regression/2013-11/msg00197.html revision 204538 is breaking several tests. On x86_64-apple-darwin* the failures I have looked at are of the kind /opt/gcc/work/gcc/testsuite/gfortran.dg/typebound_operator_9.f03: In function 'nabla2_cart2d': /opt/gcc/work/gcc/testsuite/gfortran.dg/typebound_operator_9.f03:272:0: internal compiler error: tree check: expected integer_cst, have plus_expr in tree_int_cst_lt, at tree.c:7083 function nabla2_cart2d (obj) TIA Dominique
Re: [PATCH] Bug fix for PR59050
Yes, I think so. The bug is that the arguments of tree_int_cst_compare() may not be constant integers. This patch should take care of it. thanks, Cong On Fri, Nov 8, 2013 at 12:06 PM, H.J. Lu hjl.to...@gmail.com wrote: On Fri, Nov 8, 2013 at 10:34 AM, Cong Hou co...@google.com wrote: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59050 This is my bad. I forget to check the test result for gfortran. With this patch the bug should be fixed (tested on x86-64). thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 90b01f2..e62c672 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2013-11-08 Cong Hou co...@google.com + + PR tree-optimization/59050 + * tree-vect-data-refs.c (comp_dr_addr_with_seg_len_pair): Bug fix. + Many SPEC CPU 2000 tests failed with costab.c: In function 'HandleCoinc2': costab.c:1565:17: internal compiler error: tree check: expected integer_cst, have plus_expr in tree_int_cst_lt, at tree.c:7083 voidHandleCoinc2 ( cos1, cos2, hdfactor ) ^ 0xb6e084 tree_check_failed(tree_node const*, char const*, int, char const*, ...) ../../src-trunk/gcc/tree.c:9477 0xb6ffe4 tree_check ../../src-trunk/gcc/tree.h:2914 0xb6ffe4 tree_int_cst_lt(tree_node const*, tree_node const*) ../../src-trunk/gcc/tree.c:7083 0xb70020 tree_int_cst_compare(tree_node const*, tree_node const*) ../../src-trunk/gcc/tree.c:7093 0xe53f1c comp_dr_addr_with_seg_len_pair ../../src-trunk/gcc/tree-vect-data-refs.c:2672 0xe5cbb5 vecdr_addr_with_seg_len_pair_t, va_heap, vl_embed::qsort(int (*)(void const*, void const*)) ../../src-trunk/gcc/vec.h:941 0xe5cbb5 vecdr_addr_with_seg_len_pair_t, va_heap, vl_ptr::qsort(int (*)(void const*, void const*)) ../../src-trunk/gcc/vec.h:1620 0xe5cbb5 vect_prune_runtime_alias_test_list(_loop_vec_info*) ../../src-trunk/gcc/tree-vect-data-refs.c:2845 0xb39382 vect_analyze_loop_2 ../../src-trunk/gcc/tree-vect-loop.c:1716 0xb39382 vect_analyze_loop(loop*) ../../src-trunk/gcc/tree-vect-loop.c:1807 0xb4f78f vectorize_loops() ../../src-trunk/gcc/tree-vectorizer.c:360 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. See http://gcc.gnu.org/bugs.html for instructions. specmake[3]: *** [costab.o] Error 1 specmake[3]: *** Waiting for unfinished jobs Will this patch fix them? -- H.J.
Re: [PATCH] Small fix: add { dg-require-effective-target vect_int } to testsuite/gcc.dg/vect/pr58508.c
Ping. OK for the trunk? thanks, Cong On Fri, Nov 1, 2013 at 10:47 AM, Cong Hou co...@google.com wrote: It seems that on some platforms the loops in testsuite/gcc.dg/vect/pr58508.c may be unable to be vectorized. This small patch added { dg-require-effective-target vect_int } to make sure all loops can be vectorized. thanks, Cong diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 9d0f4a5..3d9916d 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,7 @@ +2013-10-29 Cong Hou co...@google.com + + * gcc.dg/vect/pr58508.c: Update. + 2013-10-15 Cong Hou co...@google.com * gcc.dg/vect/pr58508.c: New test. diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c b/gcc/testsuite/gcc.dg/vect/pr58508.c index 6484a65..fff7a04 100644 --- a/gcc/testsuite/gcc.dg/vect/pr58508.c +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c @@ -1,3 +1,4 @@ +/* { dg-require-effective-target vect_int } */ /* { dg-do compile } */ /* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
Now is this patch OK for the trunk? Thank you! thanks, Cong On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou co...@google.com wrote: Thank you for your detailed explanation. Once GCC detects a reduction operation, it will automatically accumulate all elements in the vector after the loop. In the loop the reduction variable is always a vector whose elements are reductions of corresponding values from other vectors. Therefore in your case the only instruction you need to generate is: VABAL ops[3], ops[1], ops[2] It is OK if you accumulate the elements into one in the vector inside of the loop (if one instruction can do this), but you have to make sure other elements in the vector should remain zero so that the final result is correct. If you are confused about the documentation, check the one for udot_prod (just above usad in md.texi), as it has very similar behavior as usad. Actually I copied the text from there and did some changes. As those two instruction patterns are both for vectorization, their behavior should not be difficult to explain. If you have more questions or think that the documentation is still improper please let me know. Thank you very much! Cong On Tue, Nov 5, 2013 at 1:53 AM, James Greenhalgh james.greenha...@arm.com wrote: On Mon, Nov 04, 2013 at 06:30:55PM +, Cong Hou wrote: On Mon, Nov 4, 2013 at 2:06 AM, James Greenhalgh james.greenha...@arm.com wrote: On Fri, Nov 01, 2013 at 04:48:53PM +, Cong Hou wrote: diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index 2a5a2e1..8f5d39a 100644 --- a/gcc/doc/md.texi +++ b/gcc/doc/md.texi @@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3. Operand 3 is of a mode equal or wider than the mode of the product. The result is placed in operand 0, which is of the same mode as operand 3. +@cindex @code{ssad@var{m}} instruction pattern +@item @samp{ssad@var{m}} +@cindex @code{usad@var{m}} instruction pattern +@item @samp{usad@var{m}} +Compute the sum of absolute differences of two signed/unsigned elements. +Operand 1 and operand 2 are of the same mode. Their absolute difference, which +is of a wider mode, is computed and added to operand 3. Operand 3 is of a mode +equal or wider than the mode of the absolute difference. The result is placed +in operand 0, which is of the same mode as operand 3. + @cindex @code{ssum_widen@var{m3}} instruction pattern @item @samp{ssum_widen@var{m3}} @cindex @code{usum_widen@var{m3}} instruction pattern diff --git a/gcc/expr.c b/gcc/expr.c index 4975a64..1db8a49 100644 I'm not sure I follow, and if I do - I don't think it matches what you have implemented for i386. From your text description I would guess the series of operations to be: v1 = widen (operands[1]) v2 = widen (operands[2]) v3 = abs (v1 - v2) operands[0] = v3 + operands[3] But if I understand the behaviour of PSADBW correctly, what you have actually implemented is: v1 = widen (operands[1]) v2 = widen (operands[2]) v3 = abs (v1 - v2) v4 = reduce_plus (v3) operands[0] = v4 + operands[3] To my mind, synthesizing the reduce_plus step will be wasteful for targets who do not get this for free with their Absolute Difference step. Imagine a simple loop where we have synthesized the reduce_plus, we compute partial sums each loop iteration, though we would be better to leave the reduce_plus step until after the loop. REDUC_PLUS_EXPR would be the appropriate Tree code for this. What do you mean when you use synthesizing here? For each pattern, the only synthesized operation is the one being returned from the pattern recognizer. In this case, it is USAD_EXPR. The recognition of reduce sum is necessary as we need corresponding prolog and epilog for reductions, which is already done before pattern recognition. Note that reduction is not a pattern but is a type of vector definition. A vectorization pattern can still be a reduction operation as long as STMT_VINFO_RELATED_STMT of this pattern is a reduction operation. You can check the other two reduction patterns: widen_sum_pattern and dot_prod_pattern for reference. My apologies for not being clear. What I mean is, for a target which does not have a dedicated PSADBW instruction, the individual steps of 'usadm' must be synthesized in such a way as to match the expected behaviour of the tree code. So, I must expand 'usadm' to a series of equivalent instructions as USAD_EXPR expects. If USAD_EXPR requires me to emit a reduction on each loop iteration, I think that will be inefficient compared to performing the reduction after the loop body. To a first approximation on ARM, I would expect from your description of 'usadm' that generating, VABAL ops[3], ops[1], ops[2] (Vector widening Absolute Difference and Accumulate) would fulfil the requirements. But to match the behaviour you have
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
Thank you for your detailed explanation. Once GCC detects a reduction operation, it will automatically accumulate all elements in the vector after the loop. In the loop the reduction variable is always a vector whose elements are reductions of corresponding values from other vectors. Therefore in your case the only instruction you need to generate is: VABAL ops[3], ops[1], ops[2] It is OK if you accumulate the elements into one in the vector inside of the loop (if one instruction can do this), but you have to make sure other elements in the vector should remain zero so that the final result is correct. If you are confused about the documentation, check the one for udot_prod (just above usad in md.texi), as it has very similar behavior as usad. Actually I copied the text from there and did some changes. As those two instruction patterns are both for vectorization, their behavior should not be difficult to explain. If you have more questions or think that the documentation is still improper please let me know. Thank you very much! Cong On Tue, Nov 5, 2013 at 1:53 AM, James Greenhalgh james.greenha...@arm.com wrote: On Mon, Nov 04, 2013 at 06:30:55PM +, Cong Hou wrote: On Mon, Nov 4, 2013 at 2:06 AM, James Greenhalgh james.greenha...@arm.com wrote: On Fri, Nov 01, 2013 at 04:48:53PM +, Cong Hou wrote: diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index 2a5a2e1..8f5d39a 100644 --- a/gcc/doc/md.texi +++ b/gcc/doc/md.texi @@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3. Operand 3 is of a mode equal or wider than the mode of the product. The result is placed in operand 0, which is of the same mode as operand 3. +@cindex @code{ssad@var{m}} instruction pattern +@item @samp{ssad@var{m}} +@cindex @code{usad@var{m}} instruction pattern +@item @samp{usad@var{m}} +Compute the sum of absolute differences of two signed/unsigned elements. +Operand 1 and operand 2 are of the same mode. Their absolute difference, which +is of a wider mode, is computed and added to operand 3. Operand 3 is of a mode +equal or wider than the mode of the absolute difference. The result is placed +in operand 0, which is of the same mode as operand 3. + @cindex @code{ssum_widen@var{m3}} instruction pattern @item @samp{ssum_widen@var{m3}} @cindex @code{usum_widen@var{m3}} instruction pattern diff --git a/gcc/expr.c b/gcc/expr.c index 4975a64..1db8a49 100644 I'm not sure I follow, and if I do - I don't think it matches what you have implemented for i386. From your text description I would guess the series of operations to be: v1 = widen (operands[1]) v2 = widen (operands[2]) v3 = abs (v1 - v2) operands[0] = v3 + operands[3] But if I understand the behaviour of PSADBW correctly, what you have actually implemented is: v1 = widen (operands[1]) v2 = widen (operands[2]) v3 = abs (v1 - v2) v4 = reduce_plus (v3) operands[0] = v4 + operands[3] To my mind, synthesizing the reduce_plus step will be wasteful for targets who do not get this for free with their Absolute Difference step. Imagine a simple loop where we have synthesized the reduce_plus, we compute partial sums each loop iteration, though we would be better to leave the reduce_plus step until after the loop. REDUC_PLUS_EXPR would be the appropriate Tree code for this. What do you mean when you use synthesizing here? For each pattern, the only synthesized operation is the one being returned from the pattern recognizer. In this case, it is USAD_EXPR. The recognition of reduce sum is necessary as we need corresponding prolog and epilog for reductions, which is already done before pattern recognition. Note that reduction is not a pattern but is a type of vector definition. A vectorization pattern can still be a reduction operation as long as STMT_VINFO_RELATED_STMT of this pattern is a reduction operation. You can check the other two reduction patterns: widen_sum_pattern and dot_prod_pattern for reference. My apologies for not being clear. What I mean is, for a target which does not have a dedicated PSADBW instruction, the individual steps of 'usadm' must be synthesized in such a way as to match the expected behaviour of the tree code. So, I must expand 'usadm' to a series of equivalent instructions as USAD_EXPR expects. If USAD_EXPR requires me to emit a reduction on each loop iteration, I think that will be inefficient compared to performing the reduction after the loop body. To a first approximation on ARM, I would expect from your description of 'usadm' that generating, VABAL ops[3], ops[1], ops[2] (Vector widening Absolute Difference and Accumulate) would fulfil the requirements. But to match the behaviour you have implemented in the i386 backend I would be required to generate: VABAL ops[3], ops[1], ops[2] VPADD ops[3], ops[3], ops[3] (add one set
Re: [PATCH] Handling == or != comparisons that may affect range test optimization.
It seems there are some changes in GCC. But if you change the type of n into signed int, the issue appears again: int foo(int n) { if (n != 0) if (n != 1) if (n != 2) if (n != 3) if (n != 4) return ++n; return n; } Also, ifcombine also suffers from the same issue here. thanks, Cong On Tue, Nov 5, 2013 at 12:53 PM, Jakub Jelinek ja...@redhat.com wrote: On Tue, Nov 05, 2013 at 01:23:00PM -0700, Jeff Law wrote: On 10/31/13 18:03, Cong Hou wrote: (This patch is for the bug 58728: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58728) As in the bug report, consider the following loop: int foo(unsigned int n) { if (n != 0) if (n != 1) if (n != 2) if (n != 3) if (n != 4) return ++n; return n; } The range test optimization should be able to merge all those five conditions into one in reassoc pass, but I fails to do so. The reason is that the phi arg of n is replaced by the constant it compares to in case of == or != comparisons (in vrp pass). GCC checks there is no side effect on n between any two neighboring conditions by examining if they defined the same phi arg in the join node. But as the phi arg is replace by a constant, the check fails. I can't reproduce this, at least not on x86_64-linux with -O2, the ifcombine pass already merges those. Jakub
Re: [PATCH] Handling == or != comparisons that may affect range test optimization.
On Tue, Nov 5, 2013 at 12:23 PM, Jeff Law l...@redhat.com wrote: On 10/31/13 18:03, Cong Hou wrote: (This patch is for the bug 58728: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58728) As in the bug report, consider the following loop: int foo(unsigned int n) { if (n != 0) if (n != 1) if (n != 2) if (n != 3) if (n != 4) return ++n; return n; } The range test optimization should be able to merge all those five conditions into one in reassoc pass, but I fails to do so. The reason is that the phi arg of n is replaced by the constant it compares to in case of == or != comparisons (in vrp pass). GCC checks there is no side effect on n between any two neighboring conditions by examining if they defined the same phi arg in the join node. But as the phi arg is replace by a constant, the check fails. This patch deals with this situation by considering the existence of == or != comparisons, which is attached below (a text file is also attached with proper tabs). Bootstrap and make check both get passed. Any comment? + bool is_eq_expr = is_cond (gimple_cond_code (stmt) == NE_EXPR + || gimple_cond_code (stmt) == EQ_EXPR) + TREE_CODE (phi_arg) == INTEGER_CST; + + if (is_eq_expr) + { + lhs = gimple_cond_lhs (stmt); + rhs = gimple_cond_rhs (stmt); + + if (operand_equal_p (lhs, phi_arg, 0)) + { + tree t = lhs; + lhs = rhs; + rhs = t; + } + if (operand_equal_p (rhs, phi_arg, 0) +operand_equal_p (lhs, phi_arg2, 0)) + continue; + } + + gimple stmt2 = last_stmt (test_bb); + bool is_eq_expr2 = gimple_code (stmt2) == GIMPLE_COND + (gimple_cond_code (stmt2) == NE_EXPR +|| gimple_cond_code (stmt2) == EQ_EXPR) + TREE_CODE (phi_arg2) == INTEGER_CST; + + if (is_eq_expr2) + { + lhs2 = gimple_cond_lhs (stmt2); + rhs2 = gimple_cond_rhs (stmt2); + + if (operand_equal_p (lhs2, phi_arg2, 0)) + { + tree t = lhs2; + lhs2 = rhs2; + rhs2 = t; + } + if (operand_equal_p (rhs2, phi_arg2, 0) +operand_equal_p (lhs2, phi_arg, 0)) + continue; + } Can you factor those two hunks of nearly identical code into a single function and call it twice? I'm also curious if you really need the code to swap lhs/rhs. When can the LHS of a cond be an integer constant? Don't we canonicalize it as ssa_name COND constant? I was not aware that the comparison between a variable and a constant will always be canonicalized as ssa_name COND constant. Then I will remove the swap, and as the code is much smaller, I think it may not be necessary to create a function for them. I'd probably write the ChangeLog as: * tree-ssa-reassoc.c (suitable_cond_bb): Handle constant PHI operands resulting from propagation of edge equivalences. OK, much better than mine ;) I'm also curious -- did this code show up in a particular benchmark, if so, which one? I didn't find this problem from any benchmark, but from another concern about loop upper bound estimation. Look at the following code: int foo(unsigned int n, int r) { int i; if (n 0) if (n 4) { do { --n; r *= 2; } while (n 0); } return r+n; } In order to get the upper bound of the loop in this function, GCC traverses conditions n4 and n0 separately and tries to get any useful information. But as those two conditions cannot be combined into one due to this issue (note that n0 will be transformed into n!=0), when GCC sees n4, it will consider the possibility that n may be equal to 0, in which case the upper bound is UINT_MAX. If those two conditions can be combined into one, which is n-1=2, then we can get the correct upper bound of the loop. thanks, Cong jeff
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
On Mon, Nov 4, 2013 at 2:06 AM, James Greenhalgh james.greenha...@arm.com wrote: On Fri, Nov 01, 2013 at 04:48:53PM +, Cong Hou wrote: diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index 2a5a2e1..8f5d39a 100644 --- a/gcc/doc/md.texi +++ b/gcc/doc/md.texi @@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3. Operand 3 is of a mode equal or wider than the mode of the product. The result is placed in operand 0, which is of the same mode as operand 3. +@cindex @code{ssad@var{m}} instruction pattern +@item @samp{ssad@var{m}} +@cindex @code{usad@var{m}} instruction pattern +@item @samp{usad@var{m}} +Compute the sum of absolute differences of two signed/unsigned elements. +Operand 1 and operand 2 are of the same mode. Their absolute difference, which +is of a wider mode, is computed and added to operand 3. Operand 3 is of a mode +equal or wider than the mode of the absolute difference. The result is placed +in operand 0, which is of the same mode as operand 3. + @cindex @code{ssum_widen@var{m3}} instruction pattern @item @samp{ssum_widen@var{m3}} @cindex @code{usum_widen@var{m3}} instruction pattern diff --git a/gcc/expr.c b/gcc/expr.c index 4975a64..1db8a49 100644 I'm not sure I follow, and if I do - I don't think it matches what you have implemented for i386. From your text description I would guess the series of operations to be: v1 = widen (operands[1]) v2 = widen (operands[2]) v3 = abs (v1 - v2) operands[0] = v3 + operands[3] But if I understand the behaviour of PSADBW correctly, what you have actually implemented is: v1 = widen (operands[1]) v2 = widen (operands[2]) v3 = abs (v1 - v2) v4 = reduce_plus (v3) operands[0] = v4 + operands[3] To my mind, synthesizing the reduce_plus step will be wasteful for targets who do not get this for free with their Absolute Difference step. Imagine a simple loop where we have synthesized the reduce_plus, we compute partial sums each loop iteration, though we would be better to leave the reduce_plus step until after the loop. REDUC_PLUS_EXPR would be the appropriate Tree code for this. What do you mean when you use synthesizing here? For each pattern, the only synthesized operation is the one being returned from the pattern recognizer. In this case, it is USAD_EXPR. The recognition of reduce sum is necessary as we need corresponding prolog and epilog for reductions, which is already done before pattern recognition. Note that reduction is not a pattern but is a type of vector definition. A vectorization pattern can still be a reduction operation as long as STMT_VINFO_RELATED_STMT of this pattern is a reduction operation. You can check the other two reduction patterns: widen_sum_pattern and dot_prod_pattern for reference. Thank you for your comment! Cong I would prefer to see this Tree code not imply the reduce_plus. Thanks, James
[PATCH] Small fix: add { dg-require-effective-target vect_int } to testsuite/gcc.dg/vect/pr58508.c
It seems that on some platforms the loops in testsuite/gcc.dg/vect/pr58508.c may be unable to be vectorized. This small patch added { dg-require-effective-target vect_int } to make sure all loops can be vectorized. thanks, Cong diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 9d0f4a5..3d9916d 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,7 @@ +2013-10-29 Cong Hou co...@google.com + + * gcc.dg/vect/pr58508.c: Update. + 2013-10-15 Cong Hou co...@google.com * gcc.dg/vect/pr58508.c: New test. diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c b/gcc/testsuite/gcc.dg/vect/pr58508.c index 6484a65..fff7a04 100644 --- a/gcc/testsuite/gcc.dg/vect/pr58508.c +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c @@ -1,3 +1,4 @@ +/* { dg-require-effective-target vect_int } */ /* { dg-do compile } */ /* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
This update makes it more safe. You showed me how to write better expand code. Thank you for the improvement! thanks, Cong On Thu, Oct 31, 2013 at 11:43 AM, Uros Bizjak ubiz...@gmail.com wrote: On Wed, Oct 30, 2013 at 9:02 PM, Cong Hou co...@google.com wrote: I have run check_GNU_style.sh on my patch. The patch is submitted. Thank you for your comments and help on this patch! I have committed a couple of fixes/improvements to your expander in i386.c. There is no need to check for the result of expand_simple_binop. Also, there is no guarantee that expand_simple_binop will expand to the target. It can return different RTX. Also, unhandled modes are now marked with gcc_unreachable. 2013-10-31 Uros Bizjak ubiz...@gmail.com * config/i386/i386.c (ix86_expand_sse2_abs): Rename function arguments. Use gcc_unreachable for unhandled modes. Do not check results of expand_simple_binop. If not expanded to target, move the result. Tested on x86_64-pc-linux-gnu and committed. Uros.
[PATCH] Handling == or != comparisons that may affect range test optimization.
(This patch is for the bug 58728: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58728) As in the bug report, consider the following loop: int foo(unsigned int n) { if (n != 0) if (n != 1) if (n != 2) if (n != 3) if (n != 4) return ++n; return n; } The range test optimization should be able to merge all those five conditions into one in reassoc pass, but I fails to do so. The reason is that the phi arg of n is replaced by the constant it compares to in case of == or != comparisons (in vrp pass). GCC checks there is no side effect on n between any two neighboring conditions by examining if they defined the same phi arg in the join node. But as the phi arg is replace by a constant, the check fails. This patch deals with this situation by considering the existence of == or != comparisons, which is attached below (a text file is also attached with proper tabs). Bootstrap and make check both get passed. Any comment? thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 8a38316..9247222 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,11 @@ +2013-10-31 Cong Hou co...@google.com + + PR tree-optimization/58728 + * tree-ssa-reassoc.c (suitable_cond_bb): Consider the situtation + that ==/!= comparisons between a variable and a constant may lead + to that the later phi arg of the variable is substitued by the + constant from prior passes, during range test optimization. + 2013-10-14 David Malcolm dmalc...@redhat.com * dumpfile.h (gcc::dump_manager): New class, to hold state diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 075d071..44a5e70 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,8 @@ +2013-10-31 Cong Hou co...@google.com + + PR tree-optimization/58728 + * gcc.dg/tree-ssa/pr58728: New test. + 2013-10-14 Tobias Burnus bur...@net-b.de PR fortran/58658 diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr58728.c b/gcc/testsuite/gcc.dg/tree-ssa/pr58728.c new file mode 100644 index 000..312aebc --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr58728.c @@ -0,0 +1,25 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -fdump-tree-reassoc1-details } */ + +int foo (unsigned int n) +{ + if (n != 0) +if (n != 1) + return ++n; + return n; +} + +int bar (unsigned int n) +{ + if (n == 0) +; + else if (n == 1) +; + else +return ++n; + return n; +} + + +/* { dg-final { scan-tree-dump-times Optimizing range tests 2 reassoc1 } } */ +/* { dg-final { cleanup-tree-dump reassoc1 } } */ diff --git a/gcc/tree-ssa-reassoc.c b/gcc/tree-ssa-reassoc.c index 6859518..bccf99f 100644 --- a/gcc/tree-ssa-reassoc.c +++ b/gcc/tree-ssa-reassoc.c @@ -2426,11 +2426,70 @@ suitable_cond_bb (basic_block bb, basic_block test_bb, basic_block *other_bb, for (gsi = gsi_start_phis (e-dest); !gsi_end_p (gsi); gsi_next (gsi)) { gimple phi = gsi_stmt (gsi); + tree phi_arg = gimple_phi_arg_def (phi, e-dest_idx); + tree phi_arg2 = gimple_phi_arg_def (phi, e2-dest_idx); + /* If both BB and TEST_BB end with GIMPLE_COND, all PHI arguments corresponding to BB and TEST_BB predecessor must be the same. */ - if (!operand_equal_p (gimple_phi_arg_def (phi, e-dest_idx), -gimple_phi_arg_def (phi, e2-dest_idx), 0)) - { + if (!operand_equal_p (phi_arg, phi_arg2, 0)) + { + /* If the condition in BB or TEST_BB is an NE or EQ comparison like + if (n != N) or if (n == N), it is possible that the corresponding + def of n in the phi function is replaced by N. We should still allow + range test optimization in this case. */ + + tree lhs = NULL, rhs = NULL, + lhs2 = NULL, rhs2 = NULL; + bool is_eq_expr = is_cond (gimple_cond_code (stmt) == NE_EXPR + || gimple_cond_code (stmt) == EQ_EXPR) + TREE_CODE (phi_arg) == INTEGER_CST; + + if (is_eq_expr) + { +lhs = gimple_cond_lhs (stmt); +rhs = gimple_cond_rhs (stmt); + +if (operand_equal_p (lhs, phi_arg, 0)) + { + tree t = lhs; + lhs = rhs; + rhs = t; + } +if (operand_equal_p (rhs, phi_arg, 0) + operand_equal_p (lhs, phi_arg2, 0)) + continue; + } + + gimple stmt2 = last_stmt (test_bb); + bool is_eq_expr2 = gimple_code (stmt2) == GIMPLE_COND + (gimple_cond_code (stmt2) == NE_EXPR + || gimple_cond_code (stmt2) == EQ_EXPR) + TREE_CODE (phi_arg2) == INTEGER_CST; + + if (is_eq_expr2) + { +lhs2 = gimple_cond_lhs (stmt2); +rhs2 = gimple_cond_rhs (stmt2); + +if (operand_equal_p (lhs2, phi_arg2, 0)) + { + tree t = lhs2; + lhs2 = rhs2; + rhs2 = t; + } +if (operand_equal_p (rhs2, phi_arg2, 0) + operand_equal_p (lhs2, phi_arg, 0)) + continue; + } + + if (is_eq_expr is_eq_expr2) + { +if (operand_equal_p (rhs, phi_arg, 0) + operand_equal_p (rhs2, phi_arg2, 0) + operand_equal_p (lhs, lhs2, 0)) + continue; + } + /* Otherwise, if one of the blocks doesn't end with GIMPLE_COND, one of the PHIs should have the lhs of the last stmt in that block as PHI arg
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
I found my problem: I put DONE outside of if not inside. You are right. I have updated my patch. I appreciate your comment and test on it! thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 8a38316..84c7ab5 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,10 @@ +2013-10-22 Cong Hou co...@google.com + + PR target/58762 + * config/i386/i386-protos.h (ix86_expand_sse2_abs): New function. + * config/i386/i386.c (ix86_expand_sse2_abs): New function. + * config/i386/sse.md: Add SSE2 support to abs (8/16/32-bit-int). + 2013-10-14 David Malcolm dmalc...@redhat.com * dumpfile.h (gcc::dump_manager): New class, to hold state diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h index 3ab2f3a..ca31224 100644 --- a/gcc/config/i386/i386-protos.h +++ b/gcc/config/i386/i386-protos.h @@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx, rtx, rtx, bool, bool); extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool); extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx); extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx); +extern void ix86_expand_sse2_abs (rtx, rtx); /* In i386-c.c */ extern void ix86_target_macros (void); diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 02cbbbd..71905fc 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2) gen_rtx_MULT (mode, op1, op2)); } +void +ix86_expand_sse2_abs (rtx op0, rtx op1) +{ + enum machine_mode mode = GET_MODE (op0); + rtx tmp0, tmp1; + + switch (mode) +{ + /* For 32-bit signed integer X, the best way to calculate the absolute + value of X is (((signed) X (W-1)) ^ X) - ((signed) X (W-1)). */ + case V4SImode: + tmp0 = expand_simple_binop (mode, ASHIFTRT, op1, +GEN_INT (GET_MODE_BITSIZE + (GET_MODE_INNER (mode)) - 1), +NULL, 0, OPTAB_DIRECT); + if (tmp0) + tmp1 = expand_simple_binop (mode, XOR, op1, tmp0, + NULL, 0, OPTAB_DIRECT); + if (tmp0 tmp1) + expand_simple_binop (mode, MINUS, tmp1, tmp0, + op0, 0, OPTAB_DIRECT); + break; + + /* For 16-bit signed integer X, the best way to calculate the absolute + value of X is max (X, -X), as SSE2 provides the PMAXSW insn. */ + case V8HImode: + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0); + if (tmp0) + expand_simple_binop (mode, SMAX, op1, tmp0, op0, 0, + OPTAB_DIRECT); + break; + + /* For 8-bit signed integer X, the best way to calculate the absolute + value of X is min ((unsigned char) X, (unsigned char) (-X)), + as SSE2 provides the PMINUB insn. */ + case V16QImode: + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0); + if (tmp0) + expand_simple_binop (V16QImode, UMIN, op1, tmp0, op0, 0, + OPTAB_DIRECT); + break; + + default: + break; +} +} + /* Expand an insert into a vector register through pinsr insn. Return true if successful. */ diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index c3f6c94..46e1df4 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -8721,7 +8721,7 @@ (set (attr prefix_rex) (symbol_ref x86_extended_reg_mentioned_p (insn))) (set_attr mode DI)]) -(define_insn absmode2 +(define_insn *absmode2 [(set (match_operand:VI124_AVX2_48_AVX512F 0 register_operand =v) (abs:VI124_AVX2_48_AVX512F (match_operand:VI124_AVX2_48_AVX512F 1 nonimmediate_operand vm)))] @@ -8733,6 +8733,19 @@ (set_attr prefix maybe_vex) (set_attr mode sseinsnmode)]) +(define_expand absmode2 + [(set (match_operand:VI124_AVX2_48_AVX512F 0 register_operand) + (abs:VI124_AVX2_48_AVX512F + (match_operand:VI124_AVX2_48_AVX512F 1 nonimmediate_operand)))] + TARGET_SSE2 +{ + if (!TARGET_SSSE3) +{ + ix86_expand_sse2_abs (operands[0], operands[1]); + DONE; +} +}) + (define_insn absmode2 [(set (match_operand:MMXMODEI 0 register_operand =y) (abs:MMXMODEI diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 075d071..cf5b942 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,8 @@ +2013-10-22 Cong Hou co...@google.com + + PR target/58762 + * gcc.dg/vect/pr58762.c: New test. + 2013-10-14 Tobias Burnus bur...@net-b.de PR fortran/58658 diff --git a/gcc/testsuite/gcc.dg/vect/pr58762.c b/gcc/testsuite/gcc.dg/vect/pr58762.c new file mode 100644 index 000..6468d0a --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr58762.c @@ -0,0 +1,28 @@ +/* { dg-require-effective-target vect_int } */ +/* { dg-do compile } */ +/* { dg-options -O2 -ftree-vectorize } */ + +void test1 (char* a, char* b) +{ + int i; + for (i = 0; i 1; ++i) +a[i] = abs (b[i]); +} + +void test2 (short* a, short* b) +{ + int i; + for (i = 0; i 1; ++i) +a[i] = abs (b[i]); +} + +void test3 (int* a, int* b) +{ + int i; + for (i = 0; i 1; ++i) +a[i] = abs (b[i]); +} + +/* { dg-final { scan-tree-dump-times vectorized 1 loops 3 vect
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
Forget to attach the patch file. thanks, Cong On Wed, Oct 30, 2013 at 10:01 AM, Cong Hou co...@google.com wrote: I found my problem: I put DONE outside of if not inside. You are right. I have updated my patch. I appreciate your comment and test on it! thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 8a38316..84c7ab5 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,10 @@ +2013-10-22 Cong Hou co...@google.com + + PR target/58762 + * config/i386/i386-protos.h (ix86_expand_sse2_abs): New function. + * config/i386/i386.c (ix86_expand_sse2_abs): New function. + * config/i386/sse.md: Add SSE2 support to abs (8/16/32-bit-int). + 2013-10-14 David Malcolm dmalc...@redhat.com * dumpfile.h (gcc::dump_manager): New class, to hold state diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h index 3ab2f3a..ca31224 100644 --- a/gcc/config/i386/i386-protos.h +++ b/gcc/config/i386/i386-protos.h @@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx, rtx, rtx, bool, bool); extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool); extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx); extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx); +extern void ix86_expand_sse2_abs (rtx, rtx); /* In i386-c.c */ extern void ix86_target_macros (void); diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 02cbbbd..71905fc 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2) gen_rtx_MULT (mode, op1, op2)); } +void +ix86_expand_sse2_abs (rtx op0, rtx op1) +{ + enum machine_mode mode = GET_MODE (op0); + rtx tmp0, tmp1; + + switch (mode) +{ + /* For 32-bit signed integer X, the best way to calculate the absolute + value of X is (((signed) X (W-1)) ^ X) - ((signed) X (W-1)). */ + case V4SImode: + tmp0 = expand_simple_binop (mode, ASHIFTRT, op1, +GEN_INT (GET_MODE_BITSIZE + (GET_MODE_INNER (mode)) - 1), +NULL, 0, OPTAB_DIRECT); + if (tmp0) + tmp1 = expand_simple_binop (mode, XOR, op1, tmp0, + NULL, 0, OPTAB_DIRECT); + if (tmp0 tmp1) + expand_simple_binop (mode, MINUS, tmp1, tmp0, + op0, 0, OPTAB_DIRECT); + break; + + /* For 16-bit signed integer X, the best way to calculate the absolute + value of X is max (X, -X), as SSE2 provides the PMAXSW insn. */ + case V8HImode: + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0); + if (tmp0) + expand_simple_binop (mode, SMAX, op1, tmp0, op0, 0, + OPTAB_DIRECT); + break; + + /* For 8-bit signed integer X, the best way to calculate the absolute + value of X is min ((unsigned char) X, (unsigned char) (-X)), + as SSE2 provides the PMINUB insn. */ + case V16QImode: + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0); + if (tmp0) + expand_simple_binop (V16QImode, UMIN, op1, tmp0, op0, 0, + OPTAB_DIRECT); + break; + + default: + break; +} +} + /* Expand an insert into a vector register through pinsr insn. Return true if successful. */ diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index c3f6c94..46e1df4 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -8721,7 +8721,7 @@ (set (attr prefix_rex) (symbol_ref x86_extended_reg_mentioned_p (insn))) (set_attr mode DI)]) -(define_insn absmode2 +(define_insn *absmode2 [(set (match_operand:VI124_AVX2_48_AVX512F 0 register_operand =v) (abs:VI124_AVX2_48_AVX512F (match_operand:VI124_AVX2_48_AVX512F 1 nonimmediate_operand vm)))] @@ -8733,6 +8733,19 @@ (set_attr prefix maybe_vex) (set_attr mode sseinsnmode)]) +(define_expand absmode2 + [(set (match_operand:VI124_AVX2_48_AVX512F 0 register_operand) + (abs:VI124_AVX2_48_AVX512F + (match_operand:VI124_AVX2_48_AVX512F 1 nonimmediate_operand)))] + TARGET_SSE2 +{ + if (!TARGET_SSSE3) +{ + ix86_expand_sse2_abs (operands[0], operands[1]); + DONE; +} +}) + (define_insn absmode2 [(set (match_operand:MMXMODEI 0 register_operand =y) (abs:MMXMODEI diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 075d071..cf5b942 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,8 @@ +2013-10-22 Cong Hou co...@google.com + + PR target/58762 + * gcc.dg/vect/pr58762.c: New test. + 2013-10-14 Tobias Burnus bur...@net-b.de PR fortran/58658 diff --git a/gcc/testsuite/gcc.dg/vect/pr58762.c b/gcc/testsuite/gcc.dg/vect/pr58762.c new file mode 100644 index 000..6468d0a --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr58762.c @@ -0,0 +1,28 @@ +/* { dg-require-effective-target vect_int } */ +/* { dg-do compile } */ +/* { dg-options -O2 -ftree-vectorize } */ + +void test1 (char* a, char* b) +{ + int i; + for (i = 0; i 1; ++i) +a[i] = abs (b[i]); +} + +void test2
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
On Wed, Oct 30, 2013 at 10:22 AM, Uros Bizjak ubiz...@gmail.com wrote: On Wed, Oct 30, 2013 at 6:01 PM, Cong Hou co...@google.com wrote: I found my problem: I put DONE outside of if not inside. You are right. I have updated my patch. OK, great that we put things in order ;) Does this patch need some extra middle-end functionality? I was not able to vectorize char and short part of your patch. In the original patch, I converted abs() on short and char values to their own types by removing type casts. That is, originally char_val1 = abs(char_val2) will be converted to char_val1 = (char) abs((int) char_val2) in the frontend, and I would like to convert it back to char_val1 = abs(char_val2). But after several discussions, it seems this conversion has some problems such as overflow converns, and I thereby removed that part. Now you should still be able to vectorize abs(char) and abs(short) but with packing and unpacking. Later I will consider to write pattern recognizer for abs(char) and abs(short) and then the expand on abs(char)/abs(short) in this patch will be used during vectorization. Regarding the testcase - please put it to gcc.target/i386/ directory. There is nothing generic in the test, as confirmed by target-dependent scan test. You will find plenty of examples in the mentioned directory. I'd suggest to split the testcase in three files, and to simplify it to something like the testcase with global variables I used earlier. I have done it. The test case is split into three for s8/s16/s32 in gcc.target/i386. Thank you! Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 8a38316..84c7ab5 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,10 @@ +2013-10-22 Cong Hou co...@google.com + + PR target/58762 + * config/i386/i386-protos.h (ix86_expand_sse2_abs): New function. + * config/i386/i386.c (ix86_expand_sse2_abs): New function. + * config/i386/sse.md: Add SSE2 support to abs (8/16/32-bit-int). + 2013-10-14 David Malcolm dmalc...@redhat.com * dumpfile.h (gcc::dump_manager): New class, to hold state diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h index 3ab2f3a..ca31224 100644 --- a/gcc/config/i386/i386-protos.h +++ b/gcc/config/i386/i386-protos.h @@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx, rtx, rtx, bool, bool); extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool); extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx); extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx); +extern void ix86_expand_sse2_abs (rtx, rtx); /* In i386-c.c */ extern void ix86_target_macros (void); diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 02cbbbd..71905fc 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2) gen_rtx_MULT (mode, op1, op2)); } +void +ix86_expand_sse2_abs (rtx op0, rtx op1) +{ + enum machine_mode mode = GET_MODE (op0); + rtx tmp0, tmp1; + + switch (mode) +{ + /* For 32-bit signed integer X, the best way to calculate the absolute + value of X is (((signed) X (W-1)) ^ X) - ((signed) X (W-1)). */ + case V4SImode: + tmp0 = expand_simple_binop (mode, ASHIFTRT, op1, +GEN_INT (GET_MODE_BITSIZE + (GET_MODE_INNER (mode)) - 1), +NULL, 0, OPTAB_DIRECT); + if (tmp0) + tmp1 = expand_simple_binop (mode, XOR, op1, tmp0, + NULL, 0, OPTAB_DIRECT); + if (tmp0 tmp1) + expand_simple_binop (mode, MINUS, tmp1, tmp0, + op0, 0, OPTAB_DIRECT); + break; + + /* For 16-bit signed integer X, the best way to calculate the absolute + value of X is max (X, -X), as SSE2 provides the PMAXSW insn. */ + case V8HImode: + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0); + if (tmp0) + expand_simple_binop (mode, SMAX, op1, tmp0, op0, 0, + OPTAB_DIRECT); + break; + + /* For 8-bit signed integer X, the best way to calculate the absolute + value of X is min ((unsigned char) X, (unsigned char) (-X)), + as SSE2 provides the PMINUB insn. */ + case V16QImode: + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0); + if (tmp0) + expand_simple_binop (V16QImode, UMIN, op1, tmp0, op0, 0, + OPTAB_DIRECT); + break; + + default: + break; +} +} + /* Expand an insert into a vector register through pinsr insn. Return true if successful. */ diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index c3f6c94..46e1df4 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -8721,7 +8721,7 @@ (set (attr prefix_rex) (symbol_ref x86_extended_reg_mentioned_p (insn))) (set_attr mode DI)]) -(define_insn absmode2 +(define_insn *absmode2 [(set (match_operand:VI124_AVX2_48_AVX512F 0 register_operand =v) (abs:VI124_AVX2_48_AVX512F (match_operand:VI124_AVX2_48_AVX512F 1 nonimmediate_operand vm)))] @@ -8733,6 +8733,19 @@ (set_attr prefix maybe_vex) (set_attr mode sseinsnmode)]) +(define_expand absmode2
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
Also, as the current expand for abs() on 8/16bit integer is not used at all, should I comment them temporarily now? Later I can uncomment them once I finished the pattern recognizer. thanks, Cong On Wed, Oct 30, 2013 at 10:22 AM, Uros Bizjak ubiz...@gmail.com wrote: On Wed, Oct 30, 2013 at 6:01 PM, Cong Hou co...@google.com wrote: I found my problem: I put DONE outside of if not inside. You are right. I have updated my patch. OK, great that we put things in order ;) Does this patch need some extra middle-end functionality? I was not able to vectorize char and short part of your patch. Regarding the testcase - please put it to gcc.target/i386/ directory. There is nothing generic in the test, as confirmed by target-dependent scan test. You will find plenty of examples in the mentioned directory. I'd suggest to split the testcase in three files, and to simplify it to something like the testcase with global variables I used earlier. Modulo testcase, the patch is OK otherwise, but middle-end parts should be committed first. Thanks, Uros.
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
I have run check_GNU_style.sh on my patch. The patch is submitted. Thank you for your comments and help on this patch! thanks, Cong On Wed, Oct 30, 2013 at 11:13 AM, Uros Bizjak ubiz...@gmail.com wrote: On Wed, Oct 30, 2013 at 7:01 PM, Cong Hou co...@google.com wrote: I found my problem: I put DONE outside of if not inside. You are right. I have updated my patch. OK, great that we put things in order ;) Does this patch need some extra middle-end functionality? I was not able to vectorize char and short part of your patch. In the original patch, I converted abs() on short and char values to their own types by removing type casts. That is, originally char_val1 = abs(char_val2) will be converted to char_val1 = (char) abs((int) char_val2) in the frontend, and I would like to convert it back to char_val1 = abs(char_val2). But after several discussions, it seems this conversion has some problems such as overflow converns, and I thereby removed that part. Now you should still be able to vectorize abs(char) and abs(short) but with packing and unpacking. Later I will consider to write pattern recognizer for abs(char) and abs(short) and then the expand on abs(char)/abs(short) in this patch will be used during vectorization. OK, this seems reasonable. We already have unused SSSE3 8/16 bit abs pattern, so I think we can commit SSE2 expanders, even if they will be unused for now. The proposed recognizer will benefit SSE2 as well as existing SSSE3 patterns. Regarding the testcase - please put it to gcc.target/i386/ directory. There is nothing generic in the test, as confirmed by target-dependent scan test. You will find plenty of examples in the mentioned directory. I'd suggest to split the testcase in three files, and to simplify it to something like the testcase with global variables I used earlier. I have done it. The test case is split into three for s8/s16/s32 in gcc.target/i386. OK. The patch is OK for mainline, but please check formatting and whitespace before the patch is committed. Thanks, Uros.
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
On Wed, Oct 30, 2013 at 4:27 AM, Richard Biener rguent...@suse.de wrote: On Tue, 29 Oct 2013, Cong Hou wrote: Hi SAD (Sum of Absolute Differences) is a common and important algorithm in image processing and other areas. SSE2 even introduced a new instruction PSADBW for it. A SAD loop can be greatly accelerated by this instruction after being vectorized. This patch introduced a new operation SAD_EXPR and a SAD pattern recognizer in vectorizer. The pattern of SAD is shown below: unsigned type x_t, y_t; signed TYPE1 diff, abs_diff; TYPE2 sum = init; loop: sum_0 = phi init, sum_1 S1 x_t = ... S2 y_t = ... S3 x_T = (TYPE1) x_t; S4 y_T = (TYPE1) y_t; S5 diff = x_T - y_T; S6 abs_diff = ABS_EXPR diff; [S7 abs_diff = (TYPE2) abs_diff; #optional] S8 sum_1 = abs_diff + sum_0; where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the same size of 'TYPE1' or bigger. This is a special case of a reduction computation. For SSE2, type is char, and TYPE1 and TYPE2 are int. In order to express this new operation, a new expression SAD_EXPR is introduced in tree.def, and the corresponding entry in optabs is added. The patch also added the define_expand for SSE2 and AVX2 platforms for i386. The patch is pasted below and also attached as a text file (in which you can see tabs). Bootstrap and make check got passed on x86. Please give me your comments. Apart from the testcase comment made earlier +++ b/gcc/tree-cfg.c @@ -3797,6 +3797,7 @@ verify_gimple_assign_ternary (gimple stmt) return false; case DOT_PROD_EXPR: +case SAD_EXPR: case REALIGN_LOAD_EXPR: /* FIXME. */ return false; please add proper verification of the operand types. OK. +/* Widening sad (sum of absolute differences). + The first two arguments are of type t1 which should be unsigned integer. + The third argument and the result are of type t2, such that t2 is at least + twice the size of t1. SAD_EXPR(arg1,arg2,arg3) is equivalent to: + tmp1 = WIDEN_MINUS_EXPR (arg1, arg2); + tmp2 = ABS_EXPR (tmp1); + arg3 = PLUS_EXPR (tmp2, arg3); */ +DEFTREECODE (SAD_EXPR, sad_expr, tcc_expression, 3) WIDEN_MINUS_EXPR doesn't exist so you have to explain on its operation (it returns a signed wide difference?). Why should the first two arguments be unsigned? I cannot see a good reason to require that (other than that maybe the x86 target only has support for widened unsigned difference?). So if you want to make that restriction maybe change the name to SADU_EXPR (sum of absolute differences of unsigned)? I suppose you tried introducing WIDEN_MINUS_EXPR instead and letting combine do it's work, avoiding the very special optab? I may use the wrong representation here. I think the behavior of WIDEN_MINUS_EXPR in SAD is different from the general one. SAD usually works on unsigned integers (see http://en.wikipedia.org/wiki/Sum_of_absolute_differences), and before getting the difference between two unsigned integers, they are promoted to bigger signed integers. And the result of (int)(char)(1) - (int)(char)(-1) is different from (int)(unsigned char)(1) - (int)(unsigned char)(-1). So we cannot implement SAD using WIDEN_MINUS_EXPR. Also, the SSE2 instruction PSADBW also requires the operands to be unsigned 8-bit integers. I will remove the improper description as you pointed out. thanks, Cong Thanks, Richard. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 8a38316..d528307 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,23 @@ +2013-10-29 Cong Hou co...@google.com + + * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD + pattern recognition. + (type_conversion_p): PROMOTION is true if it's a type promotion + conversion, and false otherwise. Return true if the given expression + is a type conversion one. + * tree-vectorizer.h: Adjust the number of patterns. + * tree.def: Add SAD_EXPR. + * optabs.def: Add sad_optab. + * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case. + * expr.c (expand_expr_real_2): Likewise. + * gimple-pretty-print.c (dump_ternary_rhs): Likewise. + * gimple.c (get_gimple_rhs_num_ops): Likewise. + * optabs.c (optab_for_tree_code): Likewise. + * tree-cfg.c (estimate_operator_cost): Likewise. + * tree-ssa-operands.c (get_expr_operands): Likewise. + * tree-vect-loop.c (get_initial_def_for_reduction): Likewise. + * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD. + 2013-10-14 David Malcolm dmalc...@redhat.com * dumpfile.h (gcc::dump_manager): New class, to hold state diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c index 7ed29f5..9ec761a 100644 --- a/gcc/cfgexpand.c +++ b/gcc/cfgexpand.c @@ -2730,6 +2730,7 @@ expand_debug_expr (tree exp) { case COND_EXPR: case DOT_PROD_EXPR: + case SAD_EXPR: case
Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
On Tue, Oct 29, 2013 at 4:49 PM, Ramana Radhakrishnan ramana@googlemail.com wrote: Cong, Please don't do the following. +++ b/gcc/testsuite/gcc.dg/vect/ vect-reduc-sad.c @@ -0,0 +1,54 @@ +/* { dg-require-effective-target sse2 { target { i?86-*-* x86_64-*-* } } } */ you are adding a test to gcc.dg/vect - It's a common directory containing tests that need to run on multiple architectures and such tests should be keyed by the feature they enable which can be turned on for ports that have such an instruction. The correct way of doing this is to key this on the feature something like dg-require-effective-target vect_sad_char . And define the equivalent routine in testsuite/lib/target-supports.exp and enable it for sse2 for the x86 port. If in doubt look at check_effective_target_vect_int and a whole family of such functions in testsuite/lib/target-supports.exp This makes life easy for other port maintainers who want to turn on this support. And for bonus points please update the testcase writing wiki page with this information if it isn't already there. OK, I will likely move the test case to gcc.target/i386 as currently only SSE2 provides SAD instruction. But your suggestion also helps! You are also missing documentation updates for SAD_EXPR, md.texi for the new standard pattern name. Shouldn't it be called sadmode4 really ? I will add the documentation for the new operation SAD_EXPR. I use sadmode by just following udot_prodmode as those two operations are quite similar: OPTAB_D (udot_prod_optab, udot_prod$I$a) thanks, Cong regards Ramana On Tue, Oct 29, 2013 at 10:23 PM, Cong Hou co...@google.com wrote: Hi SAD (Sum of Absolute Differences) is a common and important algorithm in image processing and other areas. SSE2 even introduced a new instruction PSADBW for it. A SAD loop can be greatly accelerated by this instruction after being vectorized. This patch introduced a new operation SAD_EXPR and a SAD pattern recognizer in vectorizer. The pattern of SAD is shown below: unsigned type x_t, y_t; signed TYPE1 diff, abs_diff; TYPE2 sum = init; loop: sum_0 = phi init, sum_1 S1 x_t = ... S2 y_t = ... S3 x_T = (TYPE1) x_t; S4 y_T = (TYPE1) y_t; S5 diff = x_T - y_T; S6 abs_diff = ABS_EXPR diff; [S7 abs_diff = (TYPE2) abs_diff; #optional] S8 sum_1 = abs_diff + sum_0; where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the same size of 'TYPE1' or bigger. This is a special case of a reduction computation. For SSE2, type is char, and TYPE1 and TYPE2 are int. In order to express this new operation, a new expression SAD_EXPR is introduced in tree.def, and the corresponding entry in optabs is added. The patch also added the define_expand for SSE2 and AVX2 platforms for i386. The patch is pasted below and also attached as a text file (in which you can see tabs). Bootstrap and make check got passed on x86. Please give me your comments. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 8a38316..d528307 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,23 @@ +2013-10-29 Cong Hou co...@google.com + + * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD + pattern recognition. + (type_conversion_p): PROMOTION is true if it's a type promotion + conversion, and false otherwise. Return true if the given expression + is a type conversion one. + * tree-vectorizer.h: Adjust the number of patterns. + * tree.def: Add SAD_EXPR. + * optabs.def: Add sad_optab. + * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case. + * expr.c (expand_expr_real_2): Likewise. + * gimple-pretty-print.c (dump_ternary_rhs): Likewise. + * gimple.c (get_gimple_rhs_num_ops): Likewise. + * optabs.c (optab_for_tree_code): Likewise. + * tree-cfg.c (estimate_operator_cost): Likewise. + * tree-ssa-operands.c (get_expr_operands): Likewise. + * tree-vect-loop.c (get_initial_def_for_reduction): Likewise. + * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD. + 2013-10-14 David Malcolm dmalc...@redhat.com * dumpfile.h (gcc::dump_manager): New class, to hold state diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c index 7ed29f5..9ec761a 100644 --- a/gcc/cfgexpand.c +++ b/gcc/cfgexpand.c @@ -2730,6 +2730,7 @@ expand_debug_expr (tree exp) { case COND_EXPR: case DOT_PROD_EXPR: + case SAD_EXPR: case WIDEN_MULT_PLUS_EXPR: case WIDEN_MULT_MINUS_EXPR: case FMA_EXPR: diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index c3f6c94..ca1ab70 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -6052,6 +6052,40 @@ DONE; }) +(define_expand sadv16qi + [(match_operand:V4SI 0 register_operand) + (match_operand:V16QI 1 register_operand) + (match_operand:V16QI 2 register_operand) + (match_operand:V4SI 3 register_operand)] + TARGET_SSE2 +{ + rtx
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
On Tue, Oct 29, 2013 at 1:38 AM, Uros Bizjak ubiz...@gmail.com wrote: Hello! For the define_expand I added as below, the else body is there to avoid fall-through transformations to ABS operation in optabs.c. Otherwise ABS will be converted to other operations even that we have corresponding instructions from SSSE3. No, it wont be. Fallthrough will generate the pattern that will be matched by the insn pattern above, just like you are doing by hand below. I think the case is special for abs(). In optabs.c, there is a function expand_abs() in which the function expand_abs_nojump() is called. This function first tries the expand function defined for the target and if it fails it will try max(v, -v) then shift-xor-sub method. If I don't generate any instruction for SSSE3, the fall-through will be max(v, -v). I have tested it on my machine. (define_expand absmode2 [(set (match_operand:VI124_AVX2_48_AVX512F 0 register_operand) (abs:VI124_AVX2_48_AVX512F (match_operand:VI124_AVX2_48_AVX512F 1 nonimmediate_operand)))] TARGET_SSE2 { if (!TARGET_SSSE3) ix86_expand_sse2_abs (operands[0], force_reg (MODEmode, operands[1])); Do you really need force_reg here? You are using generic expanders in ix86_expand_sse2_abs that can handle non-registers operands just as well. You are right. I have removed force_reg. else emit_insn (gen_rtx_SET (VOIDmode, operands[0], gen_rtx_ABS (MODEmode, operands[1]))); DONE; }) Please note that your mailer mangles indents. Please indent your code correctly. Right.. I also attached a text file in which all tabs are there. The updated patch is pasted below (and also in the attached file). Thank you very much for your comment! Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 8a38316..84c7ab5 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,10 @@ +2013-10-22 Cong Hou co...@google.com + + PR target/58762 + * config/i386/i386-protos.h (ix86_expand_sse2_abs): New function. + * config/i386/i386.c (ix86_expand_sse2_abs): New function. + * config/i386/sse.md: Add SSE2 support to abs (8/16/32-bit-int). + 2013-10-14 David Malcolm dmalc...@redhat.com * dumpfile.h (gcc::dump_manager): New class, to hold state diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h index 3ab2f3a..ca31224 100644 --- a/gcc/config/i386/i386-protos.h +++ b/gcc/config/i386/i386-protos.h @@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx, rtx, rtx, bool, bool); extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool); extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx); extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx); +extern void ix86_expand_sse2_abs (rtx, rtx); /* In i386-c.c */ extern void ix86_target_macros (void); diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 02cbbbd..71905fc 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2) gen_rtx_MULT (mode, op1, op2)); } +void +ix86_expand_sse2_abs (rtx op0, rtx op1) +{ + enum machine_mode mode = GET_MODE (op0); + rtx tmp0, tmp1; + + switch (mode) +{ + /* For 32-bit signed integer X, the best way to calculate the absolute + value of X is (((signed) X (W-1)) ^ X) - ((signed) X (W-1)). */ + case V4SImode: + tmp0 = expand_simple_binop (mode, ASHIFTRT, op1, +GEN_INT (GET_MODE_BITSIZE + (GET_MODE_INNER (mode)) - 1), +NULL, 0, OPTAB_DIRECT); + if (tmp0) + tmp1 = expand_simple_binop (mode, XOR, op1, tmp0, + NULL, 0, OPTAB_DIRECT); + if (tmp0 tmp1) + expand_simple_binop (mode, MINUS, tmp1, tmp0, + op0, 0, OPTAB_DIRECT); + break; + + /* For 16-bit signed integer X, the best way to calculate the absolute + value of X is max (X, -X), as SSE2 provides the PMAXSW insn. */ + case V8HImode: + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0); + if (tmp0) + expand_simple_binop (mode, SMAX, op1, tmp0, op0, 0, + OPTAB_DIRECT); + break; + + /* For 8-bit signed integer X, the best way to calculate the absolute + value of X is min ((unsigned char) X, (unsigned char) (-X)), + as SSE2 provides the PMINUB insn. */ + case V16QImode: + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0); + if (tmp0) + expand_simple_binop (V16QImode, UMIN, op1, tmp0, op0, 0, + OPTAB_DIRECT); + break; + + default: + break; +} +} + /* Expand an insert into a vector register through pinsr insn. Return true if successful. */ diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index c3f6c94..0d9cefe 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -8721,7 +8721,7 @@ (set (attr prefix_rex) (symbol_ref x86_extended_reg_mentioned_p (insn))) (set_attr mode DI)]) -(define_insn absmode2 +(define_insn *absmode2 [(set (match_operand:VI124_AVX2_48_AVX512F 0 register_operand =v) (abs:VI124_AVX2_48_AVX512F
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
On Tue, Oct 29, 2013 at 10:34 AM, Uros Bizjak ubiz...@gmail.com wrote: On Tue, Oct 29, 2013 at 6:18 PM, Cong Hou co...@google.com wrote: For the define_expand I added as below, the else body is there to avoid fall-through transformations to ABS operation in optabs.c. Otherwise ABS will be converted to other operations even that we have corresponding instructions from SSSE3. No, it wont be. Fallthrough will generate the pattern that will be matched by the insn pattern above, just like you are doing by hand below. I think the case is special for abs(). In optabs.c, there is a function expand_abs() in which the function expand_abs_nojump() is called. This function first tries the expand function defined for the target and if it fails it will try max(v, -v) then shift-xor-sub method. If I don't generate any instruction for SSSE3, the fall-through will be max(v, -v). I have tested it on my machine. Huh, strange. Then you can rename previous pattern to absmode2_1 and call it from the new expander instead of expanding it manually. Please also add a small comment, describing the situation to prevent future optimizations in this place. Could you tell me how to do that? The renamed pattern absmode2_1 is also a define_expand? How to call this expander? Thank you! Cong Thanks, Uros.
[PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
Hi SAD (Sum of Absolute Differences) is a common and important algorithm in image processing and other areas. SSE2 even introduced a new instruction PSADBW for it. A SAD loop can be greatly accelerated by this instruction after being vectorized. This patch introduced a new operation SAD_EXPR and a SAD pattern recognizer in vectorizer. The pattern of SAD is shown below: unsigned type x_t, y_t; signed TYPE1 diff, abs_diff; TYPE2 sum = init; loop: sum_0 = phi init, sum_1 S1 x_t = ... S2 y_t = ... S3 x_T = (TYPE1) x_t; S4 y_T = (TYPE1) y_t; S5 diff = x_T - y_T; S6 abs_diff = ABS_EXPR diff; [S7 abs_diff = (TYPE2) abs_diff; #optional] S8 sum_1 = abs_diff + sum_0; where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the same size of 'TYPE1' or bigger. This is a special case of a reduction computation. For SSE2, type is char, and TYPE1 and TYPE2 are int. In order to express this new operation, a new expression SAD_EXPR is introduced in tree.def, and the corresponding entry in optabs is added. The patch also added the define_expand for SSE2 and AVX2 platforms for i386. The patch is pasted below and also attached as a text file (in which you can see tabs). Bootstrap and make check got passed on x86. Please give me your comments. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 8a38316..d528307 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,23 @@ +2013-10-29 Cong Hou co...@google.com + + * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD + pattern recognition. + (type_conversion_p): PROMOTION is true if it's a type promotion + conversion, and false otherwise. Return true if the given expression + is a type conversion one. + * tree-vectorizer.h: Adjust the number of patterns. + * tree.def: Add SAD_EXPR. + * optabs.def: Add sad_optab. + * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case. + * expr.c (expand_expr_real_2): Likewise. + * gimple-pretty-print.c (dump_ternary_rhs): Likewise. + * gimple.c (get_gimple_rhs_num_ops): Likewise. + * optabs.c (optab_for_tree_code): Likewise. + * tree-cfg.c (estimate_operator_cost): Likewise. + * tree-ssa-operands.c (get_expr_operands): Likewise. + * tree-vect-loop.c (get_initial_def_for_reduction): Likewise. + * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD. + 2013-10-14 David Malcolm dmalc...@redhat.com * dumpfile.h (gcc::dump_manager): New class, to hold state diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c index 7ed29f5..9ec761a 100644 --- a/gcc/cfgexpand.c +++ b/gcc/cfgexpand.c @@ -2730,6 +2730,7 @@ expand_debug_expr (tree exp) { case COND_EXPR: case DOT_PROD_EXPR: + case SAD_EXPR: case WIDEN_MULT_PLUS_EXPR: case WIDEN_MULT_MINUS_EXPR: case FMA_EXPR: diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index c3f6c94..ca1ab70 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -6052,6 +6052,40 @@ DONE; }) +(define_expand sadv16qi + [(match_operand:V4SI 0 register_operand) + (match_operand:V16QI 1 register_operand) + (match_operand:V16QI 2 register_operand) + (match_operand:V4SI 3 register_operand)] + TARGET_SSE2 +{ + rtx t1 = gen_reg_rtx (V2DImode); + rtx t2 = gen_reg_rtx (V4SImode); + emit_insn (gen_sse2_psadbw (t1, operands[1], operands[2])); + convert_move (t2, t1, 0); + emit_insn (gen_rtx_SET (VOIDmode, operands[0], + gen_rtx_PLUS (V4SImode, + operands[3], t2))); + DONE; +}) + +(define_expand sadv32qi + [(match_operand:V8SI 0 register_operand) + (match_operand:V32QI 1 register_operand) + (match_operand:V32QI 2 register_operand) + (match_operand:V8SI 3 register_operand)] + TARGET_AVX2 +{ + rtx t1 = gen_reg_rtx (V4DImode); + rtx t2 = gen_reg_rtx (V8SImode); + emit_insn (gen_avx2_psadbw (t1, operands[1], operands[2])); + convert_move (t2, t1, 0); + emit_insn (gen_rtx_SET (VOIDmode, operands[0], + gen_rtx_PLUS (V8SImode, + operands[3], t2))); + DONE; +}) + (define_insn ashrmode3 [(set (match_operand:VI24_AVX2 0 register_operand =x,x) (ashiftrt:VI24_AVX2 diff --git a/gcc/expr.c b/gcc/expr.c index 4975a64..1db8a49 100644 --- a/gcc/expr.c +++ b/gcc/expr.c @@ -9026,6 +9026,20 @@ expand_expr_real_2 (sepops ops, rtx target, enum machine_mode tmode, return target; } + case SAD_EXPR: + { + tree oprnd0 = treeop0; + tree oprnd1 = treeop1; + tree oprnd2 = treeop2; + rtx op2; + + expand_operands (oprnd0, oprnd1, NULL_RTX, op0, op1, EXPAND_NORMAL); + op2 = expand_normal (oprnd2); + target = expand_widen_pattern_expr (ops, op0, op1, op2, +target, unsignedp); + return target; + } + case REALIGN_LOAD_EXPR: { tree oprnd0 = treeop0; diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c index f0f8166..514ddd1 100644 --- a/gcc/gimple-pretty-print.c +++ b/gcc/gimple-pretty-print.c @@ -425,6 +425,16 @@ dump_ternary_rhs (pretty_printer *buffer, gimple gs, int spc, int flags
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
As there are some issues with abs() type conversions, I removed the related content from the patch but only kept the SSE2 support for abs(int). For the define_expand I added as below, the else body is there to avoid fall-through transformations to ABS operation in optabs.c. Otherwise ABS will be converted to other operations even that we have corresponding instructions from SSSE3. (define_expand absmode2 [(set (match_operand:VI124_AVX2_48_AVX512F 0 register_operand) (abs:VI124_AVX2_48_AVX512F (match_operand:VI124_AVX2_48_AVX512F 1 nonimmediate_operand)))] TARGET_SSE2 { if (!TARGET_SSSE3) ix86_expand_sse2_abs (operands[0], force_reg (MODEmode, operands[1])); else emit_insn (gen_rtx_SET (VOIDmode, operands[0], gen_rtx_ABS (MODEmode, operands[1]))); DONE; }) The patch is attached here. Please give me your comments. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 8a38316..84c7ab5 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,10 @@ +2013-10-22 Cong Hou co...@google.com + + PR target/58762 + * config/i386/i386-protos.h (ix86_expand_sse2_abs): New function. + * config/i386/i386.c (ix86_expand_sse2_abs): New function. + * config/i386/sse.md: Add SSE2 support to abs (8/16/32-bit-int). + 2013-10-14 David Malcolm dmalc...@redhat.com * dumpfile.h (gcc::dump_manager): New class, to hold state diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h index 3ab2f3a..ca31224 100644 --- a/gcc/config/i386/i386-protos.h +++ b/gcc/config/i386/i386-protos.h @@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx, rtx, rtx, bool, bool); extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool); extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx); extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx); +extern void ix86_expand_sse2_abs (rtx, rtx); /* In i386-c.c */ extern void ix86_target_macros (void); diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 02cbbbd..71905fc 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2) gen_rtx_MULT (mode, op1, op2)); } +void +ix86_expand_sse2_abs (rtx op0, rtx op1) +{ + enum machine_mode mode = GET_MODE (op0); + rtx tmp0, tmp1; + + switch (mode) +{ + /* For 32-bit signed integer X, the best way to calculate the absolute + value of X is (((signed) X (W-1)) ^ X) - ((signed) X (W-1)). */ + case V4SImode: + tmp0 = expand_simple_binop (mode, ASHIFTRT, op1, +GEN_INT (GET_MODE_BITSIZE + (GET_MODE_INNER (mode)) - 1), +NULL, 0, OPTAB_DIRECT); + if (tmp0) + tmp1 = expand_simple_binop (mode, XOR, op1, tmp0, + NULL, 0, OPTAB_DIRECT); + if (tmp0 tmp1) + expand_simple_binop (mode, MINUS, tmp1, tmp0, + op0, 0, OPTAB_DIRECT); + break; + + /* For 16-bit signed integer X, the best way to calculate the absolute + value of X is max (X, -X), as SSE2 provides the PMAXSW insn. */ + case V8HImode: + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0); + if (tmp0) + expand_simple_binop (mode, SMAX, op1, tmp0, op0, 0, + OPTAB_DIRECT); + break; + + /* For 8-bit signed integer X, the best way to calculate the absolute + value of X is min ((unsigned char) X, (unsigned char) (-X)), + as SSE2 provides the PMINUB insn. */ + case V16QImode: + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0); + if (tmp0) + expand_simple_binop (V16QImode, UMIN, op1, tmp0, op0, 0, + OPTAB_DIRECT); + break; + + default: + break; +} +} + /* Expand an insert into a vector register through pinsr insn. Return true if successful. */ diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index c3f6c94..b85ded4 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -8721,7 +8721,7 @@ (set (attr prefix_rex) (symbol_ref x86_extended_reg_mentioned_p (insn))) (set_attr mode DI)]) -(define_insn absmode2 +(define_insn *absmode2 [(set (match_operand:VI124_AVX2_48_AVX512F 0 register_operand =v) (abs:VI124_AVX2_48_AVX512F (match_operand:VI124_AVX2_48_AVX512F 1 nonimmediate_operand vm)))] @@ -8733,6 +8733,20 @@ (set_attr prefix maybe_vex) (set_attr mode sseinsnmode)]) +(define_expand absmode2 + [(set (match_operand:VI124_AVX2_48_AVX512F 0 register_operand) + (abs:VI124_AVX2_48_AVX512F + (match_operand:VI124_AVX2_48_AVX512F 1 nonimmediate_operand)))] + TARGET_SSE2 +{ + if (!TARGET_SSSE3) +ix86_expand_sse2_abs (operands[0], force_reg (MODEmode, operands[1])); + else +emit_insn (gen_rtx_SET (VOIDmode, operands[0], +gen_rtx_ABS (MODEmode, operands[1]))); + DONE; +}) + (define_insn absmode2 [(set (match_operand:MMXMODEI 0 register_operand =y) (abs:MMXMODEI diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 075d071..cf5b942 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,8 @@ +2013-10-22 Cong Hou co...@google.com + + PR
Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.
I have updated the patch according to your suggestion, and have committed the patch as the bootstrapping and make check both get passed. Thank you for your patient help on this patch! I learned a lot from it. thanks, Cong On Wed, Oct 23, 2013 at 1:13 PM, Joseph S. Myers jos...@codesourcery.com wrote: On Mon, 7 Oct 2013, Cong Hou wrote: + if (type != newtype) +break; That comparison would wrongly treat as different cases where the types differ only in one being a typedef, having qualifiers, etc. - or if in future GCC implemented proposed TS 18661-3, cases where they differ in e.g. one being float and the other _Float32 (defined as distinct types that are not compatible although they have the same representation and alignment). I think the right test here, bearing in mind the _Float32 case where types may not be compatible, is TYPE_MODE (type) != TYPE_MODE (newtype) - if the types have the same mode, they have the same set of values and so are not different in any way that matters for this optimization. OK with that change. -- Joseph S. Myers jos...@codesourcery.com
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
On Wed, Oct 23, 2013 at 11:18 PM, Jakub Jelinek ja...@redhat.com wrote: On Wed, Oct 23, 2013 at 09:40:21PM -0700, Cong Hou wrote: On Wed, Oct 23, 2013 at 8:52 AM, Joseph S. Myers jos...@codesourcery.com wrote: On Tue, 22 Oct 2013, Cong Hou wrote: For abs(char/short), type conversions are needed as the current abs() function/operation does not accept argument of char/short type. Therefore when we want to get the absolute value of a char_val using abs (char_val), it will be converted into abs ((int) char_val). It then can be vectorized, but the generated code is not efficient as lots of packings and unpackings are envolved. But if we convert (char) abs ((int) char_val) to abs (char_val), the vectorizer will be able to generate better code. Same for short. ABS_EXPR has undefined overflow behavior. Thus, abs ((int) -128) is defined (and we also define the subsequent conversion of +128 to signed char, which ISO C makes implementation-defined not undefined), and converting to an ABS_EXPR on char would wrongly make it undefined. For such a transformation to be valid (in the absence of VRP saying that -128 isn't a possible value) you'd need a GIMPLE representation for ABS_EXPRoverflow:wrap, as distinct from ABS_EXPRoverflow:undefined. You don't have the option there is for some arithmetic operations of converting to a corresponding operation on unsigned types. Yes, you are right. The method I use can guarantee wrapping on overflow (either shift-xor-sub or max(x, -x)). Can I just add the condition if (flag_wrapv) before the conversion I made to prevent the undefined behavior on overflow? What HW insns you expand to is one thing, but if some GCC pass assumes that ABS_EXPR always returns non-negative value (many do, look e.g. at tree_unary_nonnegative_warnv_p, extract_range_from_unary_expr_1, simplify_const_relational_operation, etc., you'd need to grep for all ABS_EXPR/ABS occurrences) and optimizes code based on that fact, you get wrong code because (char) abs((char) -128) is well defined. If we change ABS_EXPR/ABS definition that it is well defined on the most negative value of the typ (resp. mode), then we loose all those optimizations, if we do that only for the char/short types, it would be quite weird, though we could keep the benefits, but at the RTL level we'd need to treat that way all the modes equal to short's mode and smaller (so, for sizeof(short) == sizeof(int) target even int's mode). I checked those functions and they all consider the possibility of overflow. For example, tree_unary_nonnegative_warnv_p only returns true for ABS_EXPR on integers if overflow is undefined. If the consequence of overflow is wrapping, I think converting (char) abs((int)-128) to abs(-128) (-128 has char type) is safe. Can we do it by checking flag_wrapv? I could also first remove the abs conversion content from this patch but only keep the content of expanding abs() for i386. I will submit it later. The other possibility is not to create the ABS_EXPRs of char/short anywhere, solve the vectorization issues either through tree-vect-patterns.c or as part of the vectorization type demotion/promotions, see the recent discussions for that, you'd represent the short/char abs for the vectorized loop say using the shift-xor-sub or builtin etc. and if you want to do the same thing for scalar code, you'd just have combiner try to match some sequence. Yes, I could do it through tree-vect-patterns.c, if the abs conversion is prohibited. Currently the only reason I need the abs conversion is for vectorization. Vectorization type demotion/promotions is interesting, but I am afraid we will face the same problem there. Thank you for your comment! Cong Jakub
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
On Tue, Oct 22, 2013 at 8:11 PM, pins...@gmail.com wrote: Sent from my iPad On Oct 22, 2013, at 7:23 PM, Cong Hou co...@google.com wrote: This patch aims at PR58762. Currently GCC could not vectorize abs() operation for integers on x86 with only SSE2 support. For int type, the reason is that the expand on abs() is not defined for vector type. This patch defines such an expand so that abs(int) will be vectorized with only SSE2. For abs(char/short), type conversions are needed as the current abs() function/operation does not accept argument of char/short type. Therefore when we want to get the absolute value of a char_val using abs (char_val), it will be converted into abs ((int) char_val). It then can be vectorized, but the generated code is not efficient as lots of packings and unpackings are envolved. But if we convert (char) abs ((int) char_val) to abs (char_val), the vectorizer will be able to generate better code. Same for short. This conversion also enables vectorizing abs(char/short) operation with PABSB and PABSW instructions in SSE3. With only SSE2 support, I developed three methods to expand abs(char/short/int) seperately: 1. For 32 bit int value x, we can get abs (x) from (((signed) x (W-1)) ^ x) - ((signed) x (W-1)). This is better than max (x, -x), which needs bit masking. 2. For 16 bit int value x, we can get abs (x) from max (x, -x), as SSE2 provides PMAXSW instruction. 3. For 8 bit int value x, we can get abs (x) from min ((unsigned char) x, (unsigned char) (-x)), as SSE2 provides PMINUB instruction. The patch is pasted below. Please point out any problem in my patch and analysis. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 8a38316..e0f33ee 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,13 @@ +2013-10-22 Cong Hou co...@google.com + + PR target/58762 + * convert.c (convert_to_integer): Convert (char) abs ((int) char_val) + into abs (char_val). Also convert (short) abs ((int) short_val) + into abs (short_val). I don't like this optimization in convert. I think it should be submitted separately and should be done in tree-ssa-forwprop. Yes. This patch can be split into two: one for vectorization and one for abs conversion. The reason why I put abs conversion to convert.c is because fabs conversion is also done there. Also I think you should have a generic (non x86) test case for the above optimization. For vectorization I need to do it on x86 since the define_expand is only for it. But for abs conversion, yes, I should make a generic test case. Thank you for your comments! Cong Thanks, Andrew + * config/i386/i386-protos.h (ix86_expand_sse2_absvxsi2): New function. + * config/i386/i386.c (ix86_expand_sse2_absvxsi2): New function. + * config/i386/sse.md: Add SSE2 support to abs (char/int/short). + 2013-10-14 David Malcolm dmalc...@redhat.com * dumpfile.h (gcc::dump_manager): New class, to hold state diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h index 3ab2f3a..e85f663 100644 --- a/gcc/config/i386/i386-protos.h +++ b/gcc/config/i386/i386-protos.h @@ -238,6 +238,7 @@ extern void ix86_expand_mul_widen_evenodd (rtx, rtx, rtx, bool, bool); extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool); extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx); extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx); +extern void ix86_expand_sse2_absvxsi2 (rtx, rtx); /* In i386-c.c */ extern void ix86_target_macros (void); diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 02cbbbd..8050e02 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -41696,6 +41696,53 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2) gen_rtx_MULT (mode, op1, op2)); } +void +ix86_expand_sse2_absvxsi2 (rtx op0, rtx op1) +{ + enum machine_mode mode = GET_MODE (op0); + rtx tmp0, tmp1; + + switch (mode) +{ + /* For 32-bit signed integer X, the best way to calculate the absolute + value of X is (((signed) X (W-1)) ^ X) - ((signed) X (W-1)). */ + case V4SImode: + tmp0 = expand_simple_binop (mode, ASHIFTRT, op1, +GEN_INT (GET_MODE_BITSIZE + (GET_MODE_INNER (mode)) - 1), +NULL, 0, OPTAB_DIRECT); + if (tmp0) + tmp1 = expand_simple_binop (mode, XOR, op1, tmp0, + NULL, 0, OPTAB_DIRECT); + if (tmp0 tmp1) + expand_simple_binop (mode, MINUS, tmp1, tmp0, + op0, 0, OPTAB_DIRECT); + break; + + /* For 16-bit signed integer X, the best way to calculate the absolute + value of X is max (X, -X), as SSE2 provides the PMAXSW insn. */ + case V8HImode: + tmp0 = expand_unop (mode, neg_optab, op1, NULL_RTX, 0); + if (tmp0) + expand_simple_binop (mode, SMAX, op1, tmp0, op0, 0, + OPTAB_DIRECT); + break; + + /* For 8-bit signed integer X, the best way to calculate the absolute + value of X is min ((unsigned char) X, (unsigned char
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
On Wed, Oct 23, 2013 at 12:20 AM, Uros Bizjak ubiz...@gmail.com wrote: Hello! Currently GCC could not vectorize abs() operation for integers on x86 with only SSE2 support. For int type, the reason is that the expand on abs() is not defined for vector type. This patch defines such an expand so that abs(int) will be vectorized with only SSE2. +(define_expand absmode2 + [(set (match_operand:VI124_AVX2_48_AVX512F 0 register_operand) + (abs:VI124_AVX2_48_AVX512F + (match_operand:VI124_AVX2_48_AVX512F 1 register_operand)))] + TARGET_SSE2 +{ + if (TARGET_SSE2 !TARGET_SSSE3) +ix86_expand_sse2_absvxsi2 (operands[0], operands[1]); + else if (TARGET_SSSE3) +emit_insn (gen_rtx_SET (VOIDmode, operands[0], +gen_rtx_ABS (MODEmode, operands[1]))); + DONE; +}) This should be written as: (define_expand absmode2 [(set (match_operand:VI124_AVX2_48_AVX512F 0 register_operand) (abs:VI124_AVX2_48_AVX512F (match_operand:VI124_AVX2_48_AVX512F 1 nonimmediate_operand)))] TARGET_SSE2 { if (!TARGET_SSSE3) { ix86_expand_sse2_absvxsi2 (operands[0], operands[1]); DONE; } }) OK. Please note that operands[1] can be a memory operand, so your expander should either handle it (this is preferred) or load the operand to the register at the beginning of the expansion. OK. I think I don't have to make any change to ix86_expand_sse2_absvxsi2(), as operands[1] is always read-only. Right? +void +ix86_expand_sse2_absvxsi2 (rtx op0, rtx op1) This function name implies SImode operands ... please just name it ix86_expand_sse2_abs. Yes, my bad. At first I only considered V4SI but later forgot to rename the function. Thank you very much! Cong Uros.
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
On Wed, Oct 23, 2013 at 8:52 AM, Joseph S. Myers jos...@codesourcery.com wrote: On Tue, 22 Oct 2013, Cong Hou wrote: For abs(char/short), type conversions are needed as the current abs() function/operation does not accept argument of char/short type. Therefore when we want to get the absolute value of a char_val using abs (char_val), it will be converted into abs ((int) char_val). It then can be vectorized, but the generated code is not efficient as lots of packings and unpackings are envolved. But if we convert (char) abs ((int) char_val) to abs (char_val), the vectorizer will be able to generate better code. Same for short. ABS_EXPR has undefined overflow behavior. Thus, abs ((int) -128) is defined (and we also define the subsequent conversion of +128 to signed char, which ISO C makes implementation-defined not undefined), and converting to an ABS_EXPR on char would wrongly make it undefined. For such a transformation to be valid (in the absence of VRP saying that -128 isn't a possible value) you'd need a GIMPLE representation for ABS_EXPRoverflow:wrap, as distinct from ABS_EXPRoverflow:undefined. You don't have the option there is for some arithmetic operations of converting to a corresponding operation on unsigned types. Yes, you are right. The method I use can guarantee wrapping on overflow (either shift-xor-sub or max(x, -x)). Can I just add the condition if (flag_wrapv) before the conversion I made to prevent the undefined behavior on overflow? Thank you! Cong -- Joseph S. Myers jos...@codesourcery.com
Re: [PATCH] Vectorizing abs(char/short/int) on x86.
I think I did not make it clear. If GCC defines that passing 128 to a char value makes it the wrapping result -128, then the conversion from (char) abs ((int) char_val) to abs (char_val) is safe if we can guarantee abs (char(-128)) = -128 also. Then the subsequent methods used to get abs() should also guarantee wrapping on overflow. Shift-xor-sub is OK, but max(x, -x) is OK only if the result of the negation operation on -128 is also -128 (wrapping). I think that is right the behavior of SSE2 operation PSUBB ([0,...,0], [x,...,x]), as PSUBB can operate both signed/unsigned operands. thanks, Cong On Wed, Oct 23, 2013 at 9:40 PM, Cong Hou co...@google.com wrote: On Wed, Oct 23, 2013 at 8:52 AM, Joseph S. Myers jos...@codesourcery.com wrote: On Tue, 22 Oct 2013, Cong Hou wrote: For abs(char/short), type conversions are needed as the current abs() function/operation does not accept argument of char/short type. Therefore when we want to get the absolute value of a char_val using abs (char_val), it will be converted into abs ((int) char_val). It then can be vectorized, but the generated code is not efficient as lots of packings and unpackings are envolved. But if we convert (char) abs ((int) char_val) to abs (char_val), the vectorizer will be able to generate better code. Same for short. ABS_EXPR has undefined overflow behavior. Thus, abs ((int) -128) is defined (and we also define the subsequent conversion of +128 to signed char, which ISO C makes implementation-defined not undefined), and converting to an ABS_EXPR on char would wrongly make it undefined. For such a transformation to be valid (in the absence of VRP saying that -128 isn't a possible value) you'd need a GIMPLE representation for ABS_EXPRoverflow:wrap, as distinct from ABS_EXPRoverflow:undefined. You don't have the option there is for some arithmetic operations of converting to a corresponding operation on unsigned types. Yes, you are right. The method I use can guarantee wrapping on overflow (either shift-xor-sub or max(x, -x)). Can I just add the condition if (flag_wrapv) before the conversion I made to prevent the undefined behavior on overflow? Thank you! Cong -- Joseph S. Myers jos...@codesourcery.com
Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.
Jeff, thank you for installing this patch. Actually I already have the write privileges. I just came back from a trip. Thank you again! thanks, Cong On Fri, Oct 18, 2013 at 10:22 PM, Jeff Law l...@redhat.com wrote: On 10/18/13 03:56, Richard Biener wrote: On Thu, 17 Oct 2013, Cong Hou wrote: I tested this case with -fno-tree-loop-im and -fno-tree-pre, and it seems that GCC could hoist j+1 outside of the i loop: t3.c:5:5: note: hoisting out of the vectorized loop: _10 = (sizetype) j_25; t3.c:5:5: note: hoisting out of the vectorized loop: _11 = _10 + 1; t3.c:5:5: note: hoisting out of the vectorized loop: _12 = _11 * 4; t3.c:5:5: note: hoisting out of the vectorized loop: _14 = b_13(D) + _12; t3.c:5:5: note: hoisting out of the vectorized loop: _15 = *_14; t3.c:5:5: note: hoisting out of the vectorized loop: _16 = _15 + 1; But your suggestion is still nice as it can remove a branch and make the code more brief. I have updated the patch and also included the nested loop example into the test case. Ok if it passes bootstrap regtest. Bootstrapped regression tested on x86_64-unknown-linux-gnu. Installed on Cong's behalf. Cong -- if you plan on contributing regularly to GCC, please start the process for write privileges. This form should have everything you need: https://sourceware.org/cgi-bin/pdw/ps_form.cgi Jeff
Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.
OK. Have done that. And this is also a patch, right? ;) thanks, Cong diff --git a/MAINTAINERS b/MAINTAINERS index 15b6cc7..a6954da 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -406,6 +406,7 @@ Fergus Hendersonf...@cs.mu.oz.au Stuart Henderson shend...@gcc.gnu.org Matthew Hiller hil...@redhat.com Manfred Hollstein m...@suse.com +Cong Hou co...@google.com Falk Hueffner f...@debian.org Andrew John Hughes gnu_and...@member.fsf.org Andy Hutchinsonhutchinsona...@aim.com On Mon, Oct 21, 2013 at 9:46 AM, Jeff Law l...@redhat.com wrote: On 10/21/13 10:45, Cong Hou wrote: Jeff, thank you for installing this patch. Actually I already have the write privileges. I just came back from a trip. Ah. I didn't see you in the MAINTAINERS file. Can you update that file please. Thanks, jeff
Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.
I tested this case with -fno-tree-loop-im and -fno-tree-pre, and it seems that GCC could hoist j+1 outside of the i loop: t3.c:5:5: note: hoisting out of the vectorized loop: _10 = (sizetype) j_25; t3.c:5:5: note: hoisting out of the vectorized loop: _11 = _10 + 1; t3.c:5:5: note: hoisting out of the vectorized loop: _12 = _11 * 4; t3.c:5:5: note: hoisting out of the vectorized loop: _14 = b_13(D) + _12; t3.c:5:5: note: hoisting out of the vectorized loop: _15 = *_14; t3.c:5:5: note: hoisting out of the vectorized loop: _16 = _15 + 1; But your suggestion is still nice as it can remove a branch and make the code more brief. I have updated the patch and also included the nested loop example into the test case. Thank you! Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 8a38316..2637309 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2013-10-15 Cong Hou co...@google.com + + * tree-vect-loop-manip.c (vect_loop_versioning): Hoist loop invariant + statement that contains data refs with zero-step. + 2013-10-14 David Malcolm dmalc...@redhat.com * dumpfile.h (gcc::dump_manager): New class, to hold state diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 075d071..9d0f4a5 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,7 @@ +2013-10-15 Cong Hou co...@google.com + + * gcc.dg/vect/pr58508.c: New test. + 2013-10-14 Tobias Burnus bur...@net-b.de PR fortran/58658 diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c b/gcc/testsuite/gcc.dg/vect/pr58508.c new file mode 100644 index 000..6484a65 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c @@ -0,0 +1,70 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */ + + +/* The GCC vectorizer generates loop versioning for the following loop + since there may exist aliasing between A and B. The predicate checks + if A may alias with B across all iterations. Then for the loop in + the true body, we can assert that *B is a loop invariant so that + we can hoist the load of *B before the loop body. */ + +void test1 (int* a, int* b) +{ + int i; + for (i = 0; i 10; ++i) +a[i] = *b + 1; +} + +/* A test case with nested loops. The load of b[j+1] in the inner + loop should be hoisted. */ + +void test2 (int* a, int* b) +{ + int i, j; + for (j = 0; j 10; ++j) +for (i = 0; i 10; ++i) + a[i] = b[j+1] + 1; +} + +/* A test case with ifcvt transformation. */ + +void test3 (int* a, int* b) +{ + int i, t; + for (i = 0; i 1; ++i) +{ + if (*b 0) + t = *b * 2; + else + t = *b / 2; + a[i] = t; +} +} + +/* A test case in which the store in the loop can be moved outside + in the versioned loop with alias checks. Note this loop won't + be vectorized. */ + +void test4 (int* a, int* b) +{ + int i; + for (i = 0; i 10; ++i) +*a += b[i]; +} + +/* A test case in which the load and store in the loop to b + can be moved outside in the versioned loop with alias checks. + Note this loop won't be vectorized. */ + +void test5 (int* a, int* b) +{ + int i; + for (i = 0; i 10; ++i) +{ + *b += a[i]; + a[i] = *b; +} +} + +/* { dg-final { scan-tree-dump-times hoist 8 vect } } */ +/* { dg-final { cleanup-tree-dump vect } } */ diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c index 574446a..1cc563c 100644 --- a/gcc/tree-vect-loop-manip.c +++ b/gcc/tree-vect-loop-manip.c @@ -2477,6 +2477,73 @@ vect_loop_versioning (loop_vec_info loop_vinfo, adjust_phi_and_debug_stmts (orig_phi, e, PHI_RESULT (new_phi)); } + + /* Extract load statements on memrefs with zero-stride accesses. */ + + if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) +{ + /* In the loop body, we iterate each statement to check if it is a load. + Then we check the DR_STEP of the data reference. If DR_STEP is zero, + then we will hoist the load statement to the loop preheader. */ + + basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo); + int nbbs = loop-num_nodes; + + for (int i = 0; i nbbs; ++i) + { + for (gimple_stmt_iterator si = gsi_start_bb (bbs[i]); + !gsi_end_p (si);) +{ + gimple stmt = gsi_stmt (si); + stmt_vec_info stmt_info = vinfo_for_stmt (stmt); + struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info); + + if (is_gimple_assign (stmt) + (!dr + || (DR_IS_READ (dr) integer_zerop (DR_STEP (dr) + { + bool hoist = true; + ssa_op_iter iter; + tree var; + + /* We hoist a statement if all SSA uses in it are defined + outside of the loop. */ + FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_USE) +{ + gimple def = SSA_NAME_DEF_STMT (var); + if (!gimple_nop_p (def) + flow_bb_inside_loop_p (loop, gimple_bb (def))) + { + hoist = false; + break; + } +} + + if (hoist) +{ + if (dr) + gimple_set_vuse (stmt, NULL); + + gsi_remove (si, false
Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.
Ping? thanks, Cong On Mon, Oct 7, 2013 at 10:15 AM, Cong Hou co...@google.com wrote: You are right. I am not an expert on numerical analysis, but I tested your case and it proves the number 4 conversion is not safe. Now we have four conversions which are safe once the precision requirement is satisfied. I added a condition if (type != newtype) to remove the unsafe one, as in this case once more conversion is added which leads to the unsafe issue. If you think this condition does not make sense please let me know. The new patch is shown below (the attached file has tabs). Thank you very much! thanks, Cong Index: gcc/convert.c === --- gcc/convert.c (revision 203250) +++ gcc/convert.c (working copy) @@ -135,16 +135,19 @@ convert_to_real (tree type, tree expr) CASE_MATHFN (COS) CASE_MATHFN (ERF) CASE_MATHFN (ERFC) - CASE_MATHFN (FABS) CASE_MATHFN (LOG) CASE_MATHFN (LOG10) CASE_MATHFN (LOG2) CASE_MATHFN (LOG1P) - CASE_MATHFN (LOGB) CASE_MATHFN (SIN) - CASE_MATHFN (SQRT) CASE_MATHFN (TAN) CASE_MATHFN (TANH) +/* The above functions are not safe to do this conversion. */ +if (!flag_unsafe_math_optimizations) + break; + CASE_MATHFN (SQRT) + CASE_MATHFN (FABS) + CASE_MATHFN (LOGB) #undef CASE_MATHFN { tree arg0 = strip_float_extensions (CALL_EXPR_ARG (expr, 0)); @@ -155,13 +158,43 @@ convert_to_real (tree type, tree expr) if (TYPE_PRECISION (TREE_TYPE (arg0)) TYPE_PRECISION (type)) newtype = TREE_TYPE (arg0); + /* We consider to convert + + (T1) sqrtT2 ((T2) exprT3) + to + (T1) sqrtT4 ((T4) exprT3) + + , where T1 is TYPE, T2 is ITYPE, T3 is TREE_TYPE (ARG0), + and T4 is NEWTYPE. All those types are of floating point types. + T4 (NEWTYPE) should be narrower than T2 (ITYPE). This conversion + is safe only if P1 = P2*2+2, where P1 and P2 are precisions of + T2 and T4. See the following URL for a reference: + http://stackoverflow.com/questions/9235456/determining-floating-point-square-root + */ + if ((fcode == BUILT_IN_SQRT || fcode == BUILT_IN_SQRTL) + !flag_unsafe_math_optimizations) + { + /* The following conversion is unsafe even the precision condition + below is satisfied: + + (float) sqrtl ((long double) double_val) - (float) sqrt (double_val) +*/ + if (type != newtype) +break; + + int p1 = REAL_MODE_FORMAT (TYPE_MODE (itype))-p; + int p2 = REAL_MODE_FORMAT (TYPE_MODE (newtype))-p; + if (p1 p2 * 2 + 2) +break; + } + /* Be careful about integer to fp conversions. These may overflow still. */ if (FLOAT_TYPE_P (TREE_TYPE (arg0)) TYPE_PRECISION (newtype) TYPE_PRECISION (itype) (TYPE_MODE (newtype) == TYPE_MODE (double_type_node) || TYPE_MODE (newtype) == TYPE_MODE (float_type_node))) -{ + { tree fn = mathfn_built_in (newtype, fcode); if (fn) Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 203250) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,9 @@ +2013-10-07 Cong Hou co...@google.com + + * convert.c (convert_to_real): Forbid unsafe math function + conversions including sin/cos/log etc. Add precision check + for sqrt. + 2013-10-07 Bill Schmidt wschm...@linux.vnet.ibm.com * config/rs6000/rs6000.c (altivec_expand_vec_perm_const_le): New. Index: gcc/testsuite/ChangeLog === --- gcc/testsuite/ChangeLog (revision 203250) +++ gcc/testsuite/ChangeLog (working copy) @@ -1,3 +1,7 @@ +2013-10-07 Cong Hou co...@google.com + + * gcc.c-torture/execute/20030125-1.c: Update. + 2013-10-07 Bill Schmidt wschm...@linux.vnet.ibm.com * gcc.target/powerpc/pr43154.c: Skip for ppc64 little endian. Index: gcc/testsuite/gcc.c-torture/execute/20030125-1.c === --- gcc/testsuite/gcc.c-torture/execute/20030125-1.c (revision 203250) +++ gcc/testsuite/gcc.c-torture/execute/20030125-1.c (working copy) @@ -44,11 +44,11 @@ __attribute__ ((noinline)) double sin(double a) { - abort (); + return a; } __attribute__ ((noinline)) float sinf(float a) { - return a; + abort (); } On Thu, Oct 3, 2013 at 5:06 PM, Joseph S. Myers jos...@codesourcery.com wrote: On Fri, 6 Sep 2013, Cong Hou wrote: 4: (float) sqrtl ((long double) double_val) - (float) sqrt (double_val) I don't believe this case is in fact safe even if precision (long double) = precision (double) * 2 + 2 (when your patch would allow it). The result that precision (double) * 2 + 2 is sufficient for the result of rounding the long double value to double to be the same as the result of rounding once from infinite precision to double would I think also mean the same when rounding of the infinite-precision
Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.
On Wed, Oct 16, 2013 at 2:02 AM, Richard Biener rguent...@suse.de wrote: On Tue, 15 Oct 2013, Cong Hou wrote: Thank you for your reminder, Jeff! I just noticed Richard's comment. I have modified the patch according to that. The new patch is attached. (posting patches inline is easier for review, now you have to deal with no quoting markers ;)) Comments inline. diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 8a38316..2637309 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2013-10-15 Cong Hou co...@google.com + + * tree-vect-loop-manip.c (vect_loop_versioning): Hoist loop invariant + statement that contains data refs with zero-step. + 2013-10-14 David Malcolm dmalc...@redhat.com * dumpfile.h (gcc::dump_manager): New class, to hold state diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 075d071..9d0f4a5 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,7 @@ +2013-10-15 Cong Hou co...@google.com + + * gcc.dg/vect/pr58508.c: New test. + 2013-10-14 Tobias Burnus bur...@net-b.de PR fortran/58658 diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c b/gcc/testsuite/gcc.dg/vect/pr58508.c new file mode 100644 index 000..cb22b50 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */ + + +/* The GCC vectorizer generates loop versioning for the following loop + since there may exist aliasing between A and B. The predicate checks + if A may alias with B across all iterations. Then for the loop in + the true body, we can assert that *B is a loop invariant so that + we can hoist the load of *B before the loop body. */ + +void foo (int* a, int* b) +{ + int i; + for (i = 0; i 10; ++i) +a[i] = *b + 1; +} + + +/* { dg-final { scan-tree-dump-times hoist 2 vect } } */ +/* { dg-final { cleanup-tree-dump vect } } */ diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c index 574446a..f4fdec2 100644 --- a/gcc/tree-vect-loop-manip.c +++ b/gcc/tree-vect-loop-manip.c @@ -2477,6 +2477,92 @@ vect_loop_versioning (loop_vec_info loop_vinfo, adjust_phi_and_debug_stmts (orig_phi, e, PHI_RESULT (new_phi)); } Note that applying this kind of transform at this point invalidates some of the earlier analysis the vectorizer performed (namely the def-kind which now effectively gets vect_external_def from vect_internal_def). In this case it doesn't seem to cause any issues (we re-compute the def-kind everytime we need it (how wasteful)). + /* Extract load and store statements on pointers with zero-stride + accesses. */ + if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) +{ + /* In the loop body, we iterate each statement to check if it is a load +or store. Then we check the DR_STEP of the data reference. If +DR_STEP is zero, then we will hoist the load statement to the loop +preheader, and move the store statement to the loop exit. */ We don't move the store yet. Micha has a patch pending that enables vectorization of zero-step stores. + for (gimple_stmt_iterator si = gsi_start_bb (loop-header); + !gsi_end_p (si);) While technically ok now (vectorized loops contain a single basic block) please use LOOP_VINFO_BBS () to get at the vector of basic-blcoks and iterate over them like other code does. Have done it. + { + gimple stmt = gsi_stmt (si); + stmt_vec_info stmt_info = vinfo_for_stmt (stmt); + struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info); + + if (dr integer_zerop (DR_STEP (dr))) + { + if (DR_IS_READ (dr)) + { + if (dump_enabled_p ()) + { + dump_printf_loc + (MSG_NOTE, vect_location, + hoist the statement to outside of the loop ); hoisting out of the vectorized loop: + dump_gimple_stmt (MSG_NOTE, TDF_SLIM, stmt, 0); + dump_printf (MSG_NOTE, \n); + } + + gsi_remove (si, false); + gsi_insert_on_edge_immediate (loop_preheader_edge (loop), stmt); Note that this will result in a bogus VUSE on the stmt at this point which will be only fixed because of implementation details of loop versioning. Either get the correct VUSE from the loop header virtual PHI node preheader edge (if there is none then the current VUSE is the correct one to use) or clear it. I just cleared the VUSE since I noticed that after the vectorization pass the correct VUSE is reassigned to the load. + } + /* TODO: We also consider vectorizing loops containing zero-step +data refs as writes. For example
Re: [PATCH] Relax the requirement of reduction pattern in GCC vectorizer.
I have corrected the ChangeLog format, and committed this patch. Thank you! Cong On Tue, Oct 15, 2013 at 6:38 AM, Richard Biener richard.guent...@gmail.com wrote: On Sat, Sep 28, 2013 at 3:28 AM, Cong Hou co...@google.com wrote: The current GCC vectorizer requires the following pattern as a simple reduction computation: loop_header: a1 = phi a0, a2 a3 = ... a2 = operation (a3, a1) But a3 can also be defined outside of the loop. For example, the following loop can benefit from vectorization but the GCC vectorizer fails to vectorize it: int foo(int v) { int s = 1; ++v; for (int i = 0; i 10; ++i) s *= v; return s; } This patch relaxes the original requirement by also considering the following pattern: a3 = ... loop_header: a1 = phi a0, a2 a2 = operation (a3, a1) A test case is also added. The patch is tested on x86-64. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 39c786e..45c1667 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,9 @@ +2013-09-27 Cong Hou co...@google.com + + * tree-vect-loop.c: Relax the requirement of the reduction ChangeLog format is tab* tree-vect-loop.c (vect_is_simple_reduction_1): Relax the tabrequirement of the reduction. Ok with that change. Thanks, Richard. + pattern so that one operand of the reduction operation can + come from outside of the loop. + 2013-09-25 Tom Tromey tro...@redhat.com * Makefile.in (PARTITION_H, LTO_SYMTAB_H, COMMON_TARGET_DEF_H) diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 09644d2..90496a2 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,7 @@ +2013-09-27 Cong Hou co...@google.com + + * gcc.dg/vect/vect-reduc-pattern-3.c: New test. + 2013-09-25 Marek Polacek pola...@redhat.com PR sanitizer/58413 diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 2871ba1..3c51c3b 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -2091,6 +2091,13 @@ vect_is_slp_reduction (loop_vec_info loop_info, gimple phi, gimple first_stmt) a3 = ... a2 = operation (a3, a1) + or + + a3 = ... + loop_header: + a1 = phi a0, a2 + a2 = operation (a3, a1) + such that: 1. operation is commutative and associative and it is safe to change the order of the computation (if CHECK_REDUCTION is true) @@ -2451,6 +2458,7 @@ vect_is_simple_reduction_1 (loop_vec_info loop_info, gimple phi, if (def2 def2 == phi (code == COND_EXPR || !def1 || gimple_nop_p (def1) + || !flow_bb_inside_loop_p (loop, gimple_bb (def1)) || (def1 flow_bb_inside_loop_p (loop, gimple_bb (def1)) (is_gimple_assign (def1) || is_gimple_call (def1) @@ -2469,6 +2477,7 @@ vect_is_simple_reduction_1 (loop_vec_info loop_info, gimple phi, if (def1 def1 == phi (code == COND_EXPR || !def2 || gimple_nop_p (def2) + || !flow_bb_inside_loop_p (loop, gimple_bb (def2)) || (def2 flow_bb_inside_loop_p (loop, gimple_bb (def2)) (is_gimple_assign (def2) || is_gimple_call (def2) diff --git gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c new file mode 100644 index 000..06a9416 --- /dev/null +++ gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c @@ -0,0 +1,41 @@ +/* { dg-require-effective-target vect_int } */ + +#include stdarg.h +#include tree-vect.h + +#define N 10 +#define RES 1024 + +/* A reduction pattern in which there is no data ref in + the loop and one operand is defined outside of the loop. */ + +__attribute__ ((noinline)) int +foo (int v) +{ + int i; + int result = 1; + + ++v; + for (i = 0; i N; i++) +result *= v; + + return result; +} + +int +main (void) +{ + int res; + + check_vect (); + + res = foo (1); + if (res != RES) +abort (); + + return 0; +} + +/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect } } */ +/* { dg-final { cleanup-tree-dump vect } } */ +
Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.
Thank you for your reminder, Jeff! I just noticed Richard's comment. I have modified the patch according to that. The new patch is attached. thanks, Cong On Tue, Oct 15, 2013 at 12:33 PM, Jeff Law l...@redhat.com wrote: On 10/14/13 17:31, Cong Hou wrote: Any comment on this patch? Richi replied in the BZ you opened. http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58508 Essentially he said emit the load on the edge rather than in the block itself. jeff diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 8a38316..2637309 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2013-10-15 Cong Hou co...@google.com + + * tree-vect-loop-manip.c (vect_loop_versioning): Hoist loop invariant + statement that contains data refs with zero-step. + 2013-10-14 David Malcolm dmalc...@redhat.com * dumpfile.h (gcc::dump_manager): New class, to hold state diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 075d071..9d0f4a5 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,7 @@ +2013-10-15 Cong Hou co...@google.com + + * gcc.dg/vect/pr58508.c: New test. + 2013-10-14 Tobias Burnus bur...@net-b.de PR fortran/58658 diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c b/gcc/testsuite/gcc.dg/vect/pr58508.c new file mode 100644 index 000..cb22b50 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */ + + +/* The GCC vectorizer generates loop versioning for the following loop + since there may exist aliasing between A and B. The predicate checks + if A may alias with B across all iterations. Then for the loop in + the true body, we can assert that *B is a loop invariant so that + we can hoist the load of *B before the loop body. */ + +void foo (int* a, int* b) +{ + int i; + for (i = 0; i 10; ++i) +a[i] = *b + 1; +} + + +/* { dg-final { scan-tree-dump-times hoist 2 vect } } */ +/* { dg-final { cleanup-tree-dump vect } } */ diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c index 574446a..f4fdec2 100644 --- a/gcc/tree-vect-loop-manip.c +++ b/gcc/tree-vect-loop-manip.c @@ -2477,6 +2477,92 @@ vect_loop_versioning (loop_vec_info loop_vinfo, adjust_phi_and_debug_stmts (orig_phi, e, PHI_RESULT (new_phi)); } + + /* Extract load and store statements on pointers with zero-stride + accesses. */ + if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) +{ + /* In the loop body, we iterate each statement to check if it is a load +or store. Then we check the DR_STEP of the data reference. If +DR_STEP is zero, then we will hoist the load statement to the loop +preheader, and move the store statement to the loop exit. */ + + for (gimple_stmt_iterator si = gsi_start_bb (loop-header); + !gsi_end_p (si);) + { + gimple stmt = gsi_stmt (si); + stmt_vec_info stmt_info = vinfo_for_stmt (stmt); + struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info); + + if (dr integer_zerop (DR_STEP (dr))) + { + if (DR_IS_READ (dr)) + { + if (dump_enabled_p ()) + { + dump_printf_loc + (MSG_NOTE, vect_location, + hoist the statement to outside of the loop ); + dump_gimple_stmt (MSG_NOTE, TDF_SLIM, stmt, 0); + dump_printf (MSG_NOTE, \n); + } + + gsi_remove (si, false); + gsi_insert_on_edge_immediate (loop_preheader_edge (loop), stmt); + } + /* TODO: We also consider vectorizing loops containing zero-step +data refs as writes. For example: + +int a[N], *s; +for (i = 0; i N; i++) + *s += a[i]; + +In this case the write to *s can be also moved after the +loop. */ + + continue; + } + else if (!dr) + { + bool hoist = true; + for (size_t i = 0; i gimple_num_ops (stmt); i++) + { + tree op = gimple_op (stmt, i); + if (TREE_CODE (op) == INTEGER_CST + || TREE_CODE (op) == REAL_CST) + continue; + if (TREE_CODE (op) == SSA_NAME) + { + gimple def = SSA_NAME_DEF_STMT (op); + if (def == stmt + || gimple_nop_p (def) + || !flow_bb_inside_loop_p (loop, gimple_bb (def))) + continue; + } + hoist = false; + break; + } + + if (hoist) + { + gsi_remove (si, false); + gsi_insert_on_edge_immediate (loop_preheader_edge
Fwd: [PATCH] Reducing number of alias checks in vectorization.
Sorry for forgetting using plain-text mode. Resend it. -- Forwarded message -- From: Cong Hou co...@google.com Date: Mon, Oct 14, 2013 at 3:29 PM Subject: Re: [PATCH] Reducing number of alias checks in vectorization. To: Richard Biener rguent...@suse.de, GCC Patches gcc-patches@gcc.gnu.org Cc: Jakub Jelinek ja...@redhat.com I have made a new patch for this issue according to your comments. There are several modifications to my previous patch: 1. Remove the use of STL features such as vector and sort. Use GCC's vec and qsort instead. 2. Comparisons between tree nodes are not based on their addresses any more. Use compare_tree() function instead. 3. The function vect_create_cond_for_alias_checks() now returns the number of alias checks. If its second parameter cond_expr is NULL, then this function only calculate the number of alias checks after the merging and won't generate comparison expressions. 4. The function vect_prune_runtime_alias_test_list() now uses vect_create_cond_for_alias_checks() to get the number of alias checks. The patch is attached as a text file. Please give me your comment on this patch. Thank you! Cong On Thu, Oct 3, 2013 at 2:35 PM, Cong Hou co...@google.com wrote: Forget about this aux idea as the segment length for one data ref can be different in different dr pairs. In my patch I created a struct as shown below: struct dr_addr_with_seg_len { data_reference *dr; tree basic_addr; tree offset; tree seg_len; }; Note that basic_addr and offset can always obtained from dr, but we need to store two segment lengths for each dr pair. It is improper to add a field to data_dependence_relation as it is defined outside of vectorizer. We can change the type (a new one combining data_dependence_relation and segment length) of may_alias_ddrs in loop_vec_info to include such information, but we have to add a new type to tree-vectorizer.h which is only used in two places - still too much. One possible solution is that we create a local struct as shown above and a new function which returns the merged alias check information. This function will be called twice: once during analysis phase and once in transformation phase. Then we don't have to store the merged alias check information during those two phases. The additional time cost is minimal as there will not be too many data dependent dr pairs in a loop. Any comment? thanks, Cong On Thu, Oct 3, 2013 at 10:57 AM, Cong Hou co...@google.com wrote: I noticed that there is a struct dataref_aux defined in tree-vectorizer.h which is specific to the vectorizer pass and is stored in (void*)aux in struct data_reference. Can we add one more field segment_length to dataref_aux so that we can pass this information for merging alias checks? Then we can avoid to modify or create other structures. thanks, Cong On Wed, Oct 2, 2013 at 2:34 PM, Cong Hou co...@google.com wrote: On Wed, Oct 2, 2013 at 4:24 AM, Richard Biener rguent...@suse.de wrote: On Tue, 1 Oct 2013, Cong Hou wrote: When alias exists between data refs in a loop, to vectorize it GCC does loop versioning and adds runtime alias checks. Basically for each pair of data refs with possible data dependence, there will be two comparisons generated to make sure there is no aliasing between them in each iteration of the vectorized loop. If there are many such data refs pairs, the number of comparisons can be very large, which is a big overhead. However, in some cases it is possible to reduce the number of those comparisons. For example, for the following loop, we can detect that b[0] and b[1] are two consecutive member accesses so that we can combine the alias check between a[0:100]b[0] and a[0:100]b[1] into checking a[0:100]b[0:2]: void foo(int*a, int* b) { for (int i = 0; i 100; ++i) a[i] = b[0] + b[1]; } Actually, the requirement of consecutive memory accesses is too strict. For the following loop, we can still combine the alias checks between a[0:100]b[0] and a[0:100]b[100]: void foo(int*a, int* b) { for (int i = 0; i 100; ++i) a[i] = b[0] + b[100]; } This is because if b[0] is not in a[0:100] and b[100] is not in a[0:100] then a[0:100] cannot be between b[0] and b[100]. We only need to check a[0:100] and b[0:101] don't overlap. More generally, consider two pairs of data refs (a, b1) and (a, b2). Suppose addr_b1 and addr_b2 are basic addresses of data ref b1 and b2; offset_b1 and offset_b2 (offset_b1 offset_b2) are offsets of b1 and b2, and segment_length_a, segment_length_b1, and segment_length_b2 are segment length of a, b1, and b2. Then we can combine the two comparisons into one if the following condition is satisfied: offset_b2- offset_b1 - segment_length_b1 segment_length_a This patch detects those combination opportunities to reduce the number of alias checks. It is tested
Re: [PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.
Any comment on this patch? thanks, Cong On Thu, Oct 3, 2013 at 3:59 PM, Cong Hou co...@google.com wrote: During loop versioning in vectorization, the alias check guarantees that any load of a data reference with zero-step is a loop invariant, which can be hoisted outside of the loop. After hoisting the load statement, there may exist more loop invariant statements. This patch tries to find all those statements and hoists them before the loop. An example is shown below: for (i = 0; i N; ++i) a[i] = *b + 1; After loop versioning the loop to be vectorized is guarded by if (b + 1 a a + N b) which means there is no aliasing between *b and a[i]. The GIMPLE code of the loop body is: bb 5: # i_18 = PHI 0(4), i_29(6) # ivtmp_22 = PHI 1(4), ivtmp_30(6) _23 = (long unsigned int) i_18; _24 = _23 * 4; _25 = a_6(D) + _24; _26 = *b_8(D);= loop invariant _27 = _26 + 1;= loop invariant *_25 = _27; i_29 = i_18 + 1; ivtmp_30 = ivtmp_22 - 1; if (ivtmp_30 != 0) goto bb 6; else goto bb 21; After hoisting loop invariant statements: _26 = *b_8(D); _27 = _26 + 1; bb 5: # i_18 = PHI 0(4), i_29(6) # ivtmp_22 = PHI 1(4), ivtmp_30(6) _23 = (long unsigned int) i_18; _24 = _23 * 4; _25 = a_6(D) + _24; *_25 = _27; i_29 = i_18 + 1; ivtmp_30 = ivtmp_22 - 1; if (ivtmp_30 != 0) goto bb 6; else goto bb 21; This patch is related to the bug report http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58508 thanks, Cong
Re: [PATCH] Relax the requirement of reduction pattern in GCC vectorizer.
Ping... thanks, Cong On Wed, Oct 2, 2013 at 11:18 AM, Cong Hou co...@google.com wrote: Ping.. Any comment on this patch? thanks, Cong On Sat, Sep 28, 2013 at 9:34 AM, Xinliang David Li davi...@google.com wrote: You can also add a test case of this form: int foo( int t, int n, int *dst) { int j = 0; int s = 1; t++; for (j = 0; j n; j++) { dst[j] = t; s *= t; } return s; } where without the fix the loop vectorization is missed. David On Fri, Sep 27, 2013 at 6:28 PM, Cong Hou co...@google.com wrote: The current GCC vectorizer requires the following pattern as a simple reduction computation: loop_header: a1 = phi a0, a2 a3 = ... a2 = operation (a3, a1) But a3 can also be defined outside of the loop. For example, the following loop can benefit from vectorization but the GCC vectorizer fails to vectorize it: int foo(int v) { int s = 1; ++v; for (int i = 0; i 10; ++i) s *= v; return s; } This patch relaxes the original requirement by also considering the following pattern: a3 = ... loop_header: a1 = phi a0, a2 a2 = operation (a3, a1) A test case is also added. The patch is tested on x86-64. thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 39c786e..45c1667 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,9 @@ +2013-09-27 Cong Hou co...@google.com + + * tree-vect-loop.c: Relax the requirement of the reduction + pattern so that one operand of the reduction operation can + come from outside of the loop. + 2013-09-25 Tom Tromey tro...@redhat.com * Makefile.in (PARTITION_H, LTO_SYMTAB_H, COMMON_TARGET_DEF_H) diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 09644d2..90496a2 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,7 @@ +2013-09-27 Cong Hou co...@google.com + + * gcc.dg/vect/vect-reduc-pattern-3.c: New test. + 2013-09-25 Marek Polacek pola...@redhat.com PR sanitizer/58413 diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 2871ba1..3c51c3b 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -2091,6 +2091,13 @@ vect_is_slp_reduction (loop_vec_info loop_info, gimple phi, gimple first_stmt) a3 = ... a2 = operation (a3, a1) + or + + a3 = ... + loop_header: + a1 = phi a0, a2 + a2 = operation (a3, a1) + such that: 1. operation is commutative and associative and it is safe to change the order of the computation (if CHECK_REDUCTION is true) @@ -2451,6 +2458,7 @@ vect_is_simple_reduction_1 (loop_vec_info loop_info, gimple phi, if (def2 def2 == phi (code == COND_EXPR || !def1 || gimple_nop_p (def1) + || !flow_bb_inside_loop_p (loop, gimple_bb (def1)) || (def1 flow_bb_inside_loop_p (loop, gimple_bb (def1)) (is_gimple_assign (def1) || is_gimple_call (def1) @@ -2469,6 +2477,7 @@ vect_is_simple_reduction_1 (loop_vec_info loop_info, gimple phi, if (def1 def1 == phi (code == COND_EXPR || !def2 || gimple_nop_p (def2) + || !flow_bb_inside_loop_p (loop, gimple_bb (def2)) || (def2 flow_bb_inside_loop_p (loop, gimple_bb (def2)) (is_gimple_assign (def2) || is_gimple_call (def2) diff --git gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c new file mode 100644 index 000..06a9416 --- /dev/null +++ gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-3.c @@ -0,0 +1,41 @@ +/* { dg-require-effective-target vect_int } */ + +#include stdarg.h +#include tree-vect.h + +#define N 10 +#define RES 1024 + +/* A reduction pattern in which there is no data ref in + the loop and one operand is defined outside of the loop. */ + +__attribute__ ((noinline)) int +foo (int v) +{ + int i; + int result = 1; + + ++v; + for (i = 0; i N; i++) +result *= v; + + return result; +} + +int +main (void) +{ + int res; + + check_vect (); + + res = foo (1); + if (res != RES) +abort (); + + return 0; +} + +/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect } } */ +/* { dg-final { cleanup-tree-dump vect } } */ +
Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.
You are right. I am not an expert on numerical analysis, but I tested your case and it proves the number 4 conversion is not safe. Now we have four conversions which are safe once the precision requirement is satisfied. I added a condition if (type != newtype) to remove the unsafe one, as in this case once more conversion is added which leads to the unsafe issue. If you think this condition does not make sense please let me know. The new patch is shown below (the attached file has tabs). Thank you very much! thanks, Cong Index: gcc/convert.c === --- gcc/convert.c (revision 203250) +++ gcc/convert.c (working copy) @@ -135,16 +135,19 @@ convert_to_real (tree type, tree expr) CASE_MATHFN (COS) CASE_MATHFN (ERF) CASE_MATHFN (ERFC) - CASE_MATHFN (FABS) CASE_MATHFN (LOG) CASE_MATHFN (LOG10) CASE_MATHFN (LOG2) CASE_MATHFN (LOG1P) - CASE_MATHFN (LOGB) CASE_MATHFN (SIN) - CASE_MATHFN (SQRT) CASE_MATHFN (TAN) CASE_MATHFN (TANH) +/* The above functions are not safe to do this conversion. */ +if (!flag_unsafe_math_optimizations) + break; + CASE_MATHFN (SQRT) + CASE_MATHFN (FABS) + CASE_MATHFN (LOGB) #undef CASE_MATHFN { tree arg0 = strip_float_extensions (CALL_EXPR_ARG (expr, 0)); @@ -155,13 +158,43 @@ convert_to_real (tree type, tree expr) if (TYPE_PRECISION (TREE_TYPE (arg0)) TYPE_PRECISION (type)) newtype = TREE_TYPE (arg0); + /* We consider to convert + + (T1) sqrtT2 ((T2) exprT3) + to + (T1) sqrtT4 ((T4) exprT3) + + , where T1 is TYPE, T2 is ITYPE, T3 is TREE_TYPE (ARG0), + and T4 is NEWTYPE. All those types are of floating point types. + T4 (NEWTYPE) should be narrower than T2 (ITYPE). This conversion + is safe only if P1 = P2*2+2, where P1 and P2 are precisions of + T2 and T4. See the following URL for a reference: + http://stackoverflow.com/questions/9235456/determining-floating-point-square-root + */ + if ((fcode == BUILT_IN_SQRT || fcode == BUILT_IN_SQRTL) + !flag_unsafe_math_optimizations) + { + /* The following conversion is unsafe even the precision condition + below is satisfied: + + (float) sqrtl ((long double) double_val) - (float) sqrt (double_val) +*/ + if (type != newtype) +break; + + int p1 = REAL_MODE_FORMAT (TYPE_MODE (itype))-p; + int p2 = REAL_MODE_FORMAT (TYPE_MODE (newtype))-p; + if (p1 p2 * 2 + 2) +break; + } + /* Be careful about integer to fp conversions. These may overflow still. */ if (FLOAT_TYPE_P (TREE_TYPE (arg0)) TYPE_PRECISION (newtype) TYPE_PRECISION (itype) (TYPE_MODE (newtype) == TYPE_MODE (double_type_node) || TYPE_MODE (newtype) == TYPE_MODE (float_type_node))) -{ + { tree fn = mathfn_built_in (newtype, fcode); if (fn) Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 203250) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,9 @@ +2013-10-07 Cong Hou co...@google.com + + * convert.c (convert_to_real): Forbid unsafe math function + conversions including sin/cos/log etc. Add precision check + for sqrt. + 2013-10-07 Bill Schmidt wschm...@linux.vnet.ibm.com * config/rs6000/rs6000.c (altivec_expand_vec_perm_const_le): New. Index: gcc/testsuite/ChangeLog === --- gcc/testsuite/ChangeLog (revision 203250) +++ gcc/testsuite/ChangeLog (working copy) @@ -1,3 +1,7 @@ +2013-10-07 Cong Hou co...@google.com + + * gcc.c-torture/execute/20030125-1.c: Update. + 2013-10-07 Bill Schmidt wschm...@linux.vnet.ibm.com * gcc.target/powerpc/pr43154.c: Skip for ppc64 little endian. Index: gcc/testsuite/gcc.c-torture/execute/20030125-1.c === --- gcc/testsuite/gcc.c-torture/execute/20030125-1.c (revision 203250) +++ gcc/testsuite/gcc.c-torture/execute/20030125-1.c (working copy) @@ -44,11 +44,11 @@ __attribute__ ((noinline)) double sin(double a) { - abort (); + return a; } __attribute__ ((noinline)) float sinf(float a) { - return a; + abort (); } On Thu, Oct 3, 2013 at 5:06 PM, Joseph S. Myers jos...@codesourcery.com wrote: On Fri, 6 Sep 2013, Cong Hou wrote: 4: (float) sqrtl ((long double) double_val) - (float) sqrt (double_val) I don't believe this case is in fact safe even if precision (long double) = precision (double) * 2 + 2 (when your patch would allow it). The result that precision (double) * 2 + 2 is sufficient for the result of rounding the long double value to double to be the same as the result of rounding once from infinite precision to double would I think also mean the same when rounding of the infinite-precision result to float happens once - that is, if instead of (float) sqrt (double_val) you have fsqrt (double_val) (fsqrt being the proposed function in draft TS 18661-1 for computing a square root of a double value
Re: [PATCH] Reducing number of alias checks in vectorization.
I noticed that there is a struct dataref_aux defined in tree-vectorizer.h which is specific to the vectorizer pass and is stored in (void*)aux in struct data_reference. Can we add one more field segment_length to dataref_aux so that we can pass this information for merging alias checks? Then we can avoid to modify or create other structures. thanks, Cong On Wed, Oct 2, 2013 at 2:34 PM, Cong Hou co...@google.com wrote: On Wed, Oct 2, 2013 at 4:24 AM, Richard Biener rguent...@suse.de wrote: On Tue, 1 Oct 2013, Cong Hou wrote: When alias exists between data refs in a loop, to vectorize it GCC does loop versioning and adds runtime alias checks. Basically for each pair of data refs with possible data dependence, there will be two comparisons generated to make sure there is no aliasing between them in each iteration of the vectorized loop. If there are many such data refs pairs, the number of comparisons can be very large, which is a big overhead. However, in some cases it is possible to reduce the number of those comparisons. For example, for the following loop, we can detect that b[0] and b[1] are two consecutive member accesses so that we can combine the alias check between a[0:100]b[0] and a[0:100]b[1] into checking a[0:100]b[0:2]: void foo(int*a, int* b) { for (int i = 0; i 100; ++i) a[i] = b[0] + b[1]; } Actually, the requirement of consecutive memory accesses is too strict. For the following loop, we can still combine the alias checks between a[0:100]b[0] and a[0:100]b[100]: void foo(int*a, int* b) { for (int i = 0; i 100; ++i) a[i] = b[0] + b[100]; } This is because if b[0] is not in a[0:100] and b[100] is not in a[0:100] then a[0:100] cannot be between b[0] and b[100]. We only need to check a[0:100] and b[0:101] don't overlap. More generally, consider two pairs of data refs (a, b1) and (a, b2). Suppose addr_b1 and addr_b2 are basic addresses of data ref b1 and b2; offset_b1 and offset_b2 (offset_b1 offset_b2) are offsets of b1 and b2, and segment_length_a, segment_length_b1, and segment_length_b2 are segment length of a, b1, and b2. Then we can combine the two comparisons into one if the following condition is satisfied: offset_b2- offset_b1 - segment_length_b1 segment_length_a This patch detects those combination opportunities to reduce the number of alias checks. It is tested on an x86-64 machine. Apart from the other comments you got (to which I agree) the patch seems to do two things, namely also: + /* Extract load and store statements on pointers with zero-stride + accesses. */ + if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) +{ which I'd rather see in a separate patch (and done also when the loop doesn't require versioning for alias). My mistake.. I am working on those two patches at the same time and pasted that one also here by mistake. I will send another patch about the hoist topic. Also combining the alias checks in vect_create_cond_for_alias_checks is nice but doesn't properly fix the use of the vect-max-version-for-alias-checks param which currently inhibits vectorization of the HIMENO benchmark by default (and make us look bad compared to LLVM). So I believe this merging should be done incrementally when we collect the DDRs we need to test in vect_mark_for_runtime_alias_test. I agree that vect-max-version-for-alias-checks param should count the number of checks after the merge. However, the struct data_dependence_relation could not record the new information produced by the merge. The new information I mentioned contains the new segment length for comparisons. This length is calculated right in vect_create_cond_for_alias_checks() function. Since vect-max-version-for-alias-checks is used during analysis phase, shall we move all those (get segment length for each data ref and merge alias checks) from transformation to analysis phase? If we cannot store the result properly (data_dependence_relation is not enough), shall we do it twice in both phases? I also noticed a possible bug in the function vect_same_range_drs() called by vect_prune_runtime_alias_test_list(). For the following code I get two pairs of data refs after vect_prune_runtime_alias_test_list(), but in vect_create_cond_for_alias_checks() after detecting grouped accesses I got two identical pairs of data refs. The consequence is two identical alias checks are produced. void yuv2yuyv_ref (int *d, int *src, int n) { char *dest = (char *)d; int i; for(i=0;in/2;i++){ dest[i*4 + 0] = (src[i*2 + 0])16; dest[i*4 + 1] = (src[i*2 + 1])8; dest[i*4 + 2] = (src[i*2 + 0])16; dest[i*4 + 3] = (src[i*2 + 0])0; } } I think the solution to this problem is changing GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt_i)) == GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt_j) into STMT_VINFO_DATA_REF (vinfo_for_stmt (GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt_i
Re: [PATCH] Reducing number of alias checks in vectorization.
On Thu, Oct 3, 2013 at 2:06 PM, Joseph S. Myers jos...@codesourcery.com wrote: On Tue, 1 Oct 2013, Cong Hou wrote: +#include vector +#include utility +#include algorithm + #include config.h Whatever the other issues about including these headers at all, any system header (C or C++) must always be included *after* config.h, as config.h may define feature test macros that are only properly effective if defined before any system headers are included, and these macros (affecting such things as the size of off_t) need to be consistent throughout GCC. OK. Actually I did meet some conflicts when I put those three C++ headers after all other includes. Thank you for the comments. Cong -- Joseph S. Myers jos...@codesourcery.com
Re: [PATCH] Reducing number of alias checks in vectorization.
Forget about this aux idea as the segment length for one data ref can be different in different dr pairs. In my patch I created a struct as shown below: struct dr_addr_with_seg_len { data_reference *dr; tree basic_addr; tree offset; tree seg_len; }; Note that basic_addr and offset can always obtained from dr, but we need to store two segment lengths for each dr pair. It is improper to add a field to data_dependence_relation as it is defined outside of vectorizer. We can change the type (a new one combining data_dependence_relation and segment length) of may_alias_ddrs in loop_vec_info to include such information, but we have to add a new type to tree-vectorizer.h which is only used in two places - still too much. One possible solution is that we create a local struct as shown above and a new function which returns the merged alias check information. This function will be called twice: once during analysis phase and once in transformation phase. Then we don't have to store the merged alias check information during those two phases. The additional time cost is minimal as there will not be too many data dependent dr pairs in a loop. Any comment? thanks, Cong On Thu, Oct 3, 2013 at 10:57 AM, Cong Hou co...@google.com wrote: I noticed that there is a struct dataref_aux defined in tree-vectorizer.h which is specific to the vectorizer pass and is stored in (void*)aux in struct data_reference. Can we add one more field segment_length to dataref_aux so that we can pass this information for merging alias checks? Then we can avoid to modify or create other structures. thanks, Cong On Wed, Oct 2, 2013 at 2:34 PM, Cong Hou co...@google.com wrote: On Wed, Oct 2, 2013 at 4:24 AM, Richard Biener rguent...@suse.de wrote: On Tue, 1 Oct 2013, Cong Hou wrote: When alias exists between data refs in a loop, to vectorize it GCC does loop versioning and adds runtime alias checks. Basically for each pair of data refs with possible data dependence, there will be two comparisons generated to make sure there is no aliasing between them in each iteration of the vectorized loop. If there are many such data refs pairs, the number of comparisons can be very large, which is a big overhead. However, in some cases it is possible to reduce the number of those comparisons. For example, for the following loop, we can detect that b[0] and b[1] are two consecutive member accesses so that we can combine the alias check between a[0:100]b[0] and a[0:100]b[1] into checking a[0:100]b[0:2]: void foo(int*a, int* b) { for (int i = 0; i 100; ++i) a[i] = b[0] + b[1]; } Actually, the requirement of consecutive memory accesses is too strict. For the following loop, we can still combine the alias checks between a[0:100]b[0] and a[0:100]b[100]: void foo(int*a, int* b) { for (int i = 0; i 100; ++i) a[i] = b[0] + b[100]; } This is because if b[0] is not in a[0:100] and b[100] is not in a[0:100] then a[0:100] cannot be between b[0] and b[100]. We only need to check a[0:100] and b[0:101] don't overlap. More generally, consider two pairs of data refs (a, b1) and (a, b2). Suppose addr_b1 and addr_b2 are basic addresses of data ref b1 and b2; offset_b1 and offset_b2 (offset_b1 offset_b2) are offsets of b1 and b2, and segment_length_a, segment_length_b1, and segment_length_b2 are segment length of a, b1, and b2. Then we can combine the two comparisons into one if the following condition is satisfied: offset_b2- offset_b1 - segment_length_b1 segment_length_a This patch detects those combination opportunities to reduce the number of alias checks. It is tested on an x86-64 machine. Apart from the other comments you got (to which I agree) the patch seems to do two things, namely also: + /* Extract load and store statements on pointers with zero-stride + accesses. */ + if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) +{ which I'd rather see in a separate patch (and done also when the loop doesn't require versioning for alias). My mistake.. I am working on those two patches at the same time and pasted that one also here by mistake. I will send another patch about the hoist topic. Also combining the alias checks in vect_create_cond_for_alias_checks is nice but doesn't properly fix the use of the vect-max-version-for-alias-checks param which currently inhibits vectorization of the HIMENO benchmark by default (and make us look bad compared to LLVM). So I believe this merging should be done incrementally when we collect the DDRs we need to test in vect_mark_for_runtime_alias_test. I agree that vect-max-version-for-alias-checks param should count the number of checks after the merge. However, the struct data_dependence_relation could not record the new information produced by the merge. The new information I mentioned contains the new segment length for comparisons. This length is calculated right
[PATCH] Hoist loop invariant statements containing data refs with zero-step during loop-versioning in vectorization.
During loop versioning in vectorization, the alias check guarantees that any load of a data reference with zero-step is a loop invariant, which can be hoisted outside of the loop. After hoisting the load statement, there may exist more loop invariant statements. This patch tries to find all those statements and hoists them before the loop. An example is shown below: for (i = 0; i N; ++i) a[i] = *b + 1; After loop versioning the loop to be vectorized is guarded by if (b + 1 a a + N b) which means there is no aliasing between *b and a[i]. The GIMPLE code of the loop body is: bb 5: # i_18 = PHI 0(4), i_29(6) # ivtmp_22 = PHI 1(4), ivtmp_30(6) _23 = (long unsigned int) i_18; _24 = _23 * 4; _25 = a_6(D) + _24; _26 = *b_8(D);= loop invariant _27 = _26 + 1;= loop invariant *_25 = _27; i_29 = i_18 + 1; ivtmp_30 = ivtmp_22 - 1; if (ivtmp_30 != 0) goto bb 6; else goto bb 21; After hoisting loop invariant statements: _26 = *b_8(D); _27 = _26 + 1; bb 5: # i_18 = PHI 0(4), i_29(6) # ivtmp_22 = PHI 1(4), ivtmp_30(6) _23 = (long unsigned int) i_18; _24 = _23 * 4; _25 = a_6(D) + _24; *_25 = _27; i_29 = i_18 + 1; ivtmp_30 = ivtmp_22 - 1; if (ivtmp_30 != 0) goto bb 6; else goto bb 21; This patch is related to the bug report http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58508 thanks, Cong diff --git gcc/testsuite/gcc.dg/vect/pr58508.c gcc/testsuite/gcc.dg/vect/pr58508.c new file mode 100644 index 000..cb22b50 --- /dev/null +++ gcc/testsuite/gcc.dg/vect/pr58508.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */ + + +/* The GCC vectorizer generates loop versioning for the following loop + since there may exist aliasing between A and B. The predicate checks + if A may alias with B across all iterations. Then for the loop in + the true body, we can assert that *B is a loop invariant so that + we can hoist the load of *B before the loop body. */ + +void foo (int* a, int* b) +{ + int i; + for (i = 0; i 10; ++i) +a[i] = *b + 1; +} + + +/* { dg-final { scan-tree-dump-times hoist 2 vect } } */ +/* { dg-final { cleanup-tree-dump vect } } */
Re: [PATCH] Fixing improper conversion from sin() to sinf() in optimization mode.
Ping... thanks, Cong On Fri, Sep 20, 2013 at 9:49 AM, Cong Hou co...@google.com wrote: Any comment or more suggestions on this patch? thanks, Cong On Mon, Sep 9, 2013 at 7:28 PM, Cong Hou co...@google.com wrote: On Mon, Sep 9, 2013 at 6:26 PM, Xinliang David Li davi...@google.com wrote: On Fri, Sep 6, 2013 at 3:24 PM, Cong Hou co...@google.com wrote: First, thank you for your detailed comments again! Then I deeply apologize for not explaining my patch properly and responding to your previous comment. I didn't understand thoroughly the problem before submitting the patch. Previously I only considered the following three conversions for sqrt(): 1: (float) sqrt ((double) float_val) - sqrtf (float_val) 2: (float) sqrtl ((long double) float_val) - sqrtf (float_val) 3: (double) sqrtl ((long double) double_val) - sqrt (double_val) We have four types here: TYPE is the type to which the result of the function call is converted. ITYPE is the type of the math call expression. TREE_TYPE(arg0) is the type of the function argument (before type conversion). NEWTYPE is chosen from TYPE and TREE_TYPE(arg0) with higher precision. It will be the type of the new math call expression after conversion. For all three cases above, TYPE is always the same as NEWTYPE. That is why I only considered TYPE during the precision comparison. ITYPE can only be double_type_node or long_double_type_node depending on the type of the math function. That is why I explicitly used those two types instead of ITYPE (no correctness issue). But you are right, ITYPE is more elegant and better here. After further analysis, I found I missed two more cases. Note that we have the following conditions according to the code in convert.c: TYPE_PRECISION(NEWTYPE) = TYPE_PRECISION(TYPE) TYPE_PRECISION(NEWTYPE) = TYPE_PRECISION(TREE_TYPE(arg0)) TYPE_PRECISION (NEWTYPE) TYPE_PRECISION (ITYPE) the last condition comes from the fact that we only consider converting a math function call into another one with narrower type. Therefore we have TYPE_PRECISION(TYPE) TYPE_PRECISION (ITYPE) TYPE_PRECISION(TREE_TYPE(arg0)) TYPE_PRECISION (ITYPE) So for sqrt(), TYPE and TREE_TYPE(arg0) can only be float, and for sqrtl(), TYPE and TREE_TYPE(arg0) can be either float or double with four possible combinations. Therefore we have two more conversions to consider besides the three ones I mentioned above: 4: (float) sqrtl ((long double) double_val) - (float) sqrt (double_val) 5: (double) sqrtl ((long double) float_val) - sqrt ((double) float_val) For the first conversion here, TYPE (float) is different from NEWTYPE (double), and my previous patch doesn't handle this case.The correct way is to compare precisions of ITYPE and NEWTYPE now. To sum up, we are converting the expression (TYPE) sqrtITYPE ((ITYPE) expr) to (TYPE) sqrtNEWTYPE ((NEWTYPE) expr) and we require PRECISION (ITYPE) = PRECISION (NEWTYPE) * 2 + 2 to make it a safe conversion. The new patch is pasted below. I appreciate your detailed comments and analysis, and next time when I submit a patch I will be more carefully about the reviewer's comment. Thank you! Cong Index: gcc/convert.c === --- gcc/convert.c (revision 201891) +++ gcc/convert.c (working copy) @@ -135,16 +135,19 @@ convert_to_real (tree type, tree expr) CASE_MATHFN (COS) CASE_MATHFN (ERF) CASE_MATHFN (ERFC) - CASE_MATHFN (FABS) CASE_MATHFN (LOG) CASE_MATHFN (LOG10) CASE_MATHFN (LOG2) CASE_MATHFN (LOG1P) - CASE_MATHFN (LOGB) CASE_MATHFN (SIN) - CASE_MATHFN (SQRT) CASE_MATHFN (TAN) CASE_MATHFN (TANH) +/* The above functions are not safe to do this conversion. */ +if (!flag_unsafe_math_optimizations) + break; + CASE_MATHFN (SQRT) + CASE_MATHFN (FABS) + CASE_MATHFN (LOGB) #undef CASE_MATHFN { tree arg0 = strip_float_extensions (CALL_EXPR_ARG (expr, 0)); @@ -155,6 +158,27 @@ convert_to_real (tree type, tree expr) if (TYPE_PRECISION (TREE_TYPE (arg0)) TYPE_PRECISION (type)) newtype = TREE_TYPE (arg0); + /* We consider to convert + + (T1) sqrtT2 ((T2) exprT3) + to + (T1) sqrtT4 ((T4) exprT3) Should this be (T4) sqrtT4 ((T4) exprT3) T4 is not necessarily the same as T1. For the conversion: (float) sqrtl ((long double) double_val) - (float) sqrt (double_val) T4 is double and T1 is float. + + , where T1 is TYPE, T2 is ITYPE, T3 is TREE_TYPE (ARG0), + and T4 is NEWTYPE. NEWTYPE is also the wider one between T1 and T3. Right. Actually this is easy to catch from the context just before this comment. tree newtype = type; if (TYPE_PRECISION (TREE_TYPE (arg0)) TYPE_PRECISION (type)) newtype = TREE_TYPE (arg0); thanks, Cong David All those types are of floating point types. + T4 (NEWTYPE) should be narrower than T2 (ITYPE
Re: [PATCH] Reducing number of alias checks in vectorization.
On Tue, Oct 1, 2013 at 11:35 PM, Jakub Jelinek ja...@redhat.com wrote: On Tue, Oct 01, 2013 at 07:12:54PM -0700, Cong Hou wrote: --- gcc/tree-vect-loop-manip.c (revision 202662) +++ gcc/tree-vect-loop-manip.c (working copy) Your mailer ate all the tabs, so the formatting of the whole patch can't be checked. I'll pay attention to this problem in my later patch submission. @@ -19,6 +19,10 @@ You should have received a copy of the G along with GCC; see the file COPYING3. If not see http://www.gnu.org/licenses/. */ +#include vector +#include utility +#include algorithm Why? GCC has it's vec.h vectors, why don't you use those? There is even qsort method for you in there. And for pairs, you can easily just use structs with two members as structure elements in the vector. GCC is now restructured using C++ and STL is one of the most important part of C++. I am new to GCC community and more familiar to STL (and I think allowing STL in GCC could attract more new developers for GCC). I agree using GCC's vec can maintain a uniform style but STL is just so powerful and easy to use... I just did a search in GCC source tree and found vector is not used yet. I will change std::vector to GCC's vec for now (and also qsort), but am still wondering if one day GCC would accept STL. +struct dr_addr_with_seg_len +{ + dr_addr_with_seg_len (data_reference* d, tree addr, tree off, tree len) +: dr (d), basic_addr (addr), offset (off), seg_len (len) {} + + data_reference* dr; Space should be before *, not after it. + if (TREE_CODE (p11.offset) != INTEGER_CST + || TREE_CODE (p21.offset) != INTEGER_CST) +return p11.offset p21.offset; If offset isn't INTEGER_CST, you are comparing the pointer values? That is never a good idea, then compilation will depend on how say address space randomization randomizes virtual address space. GCC needs to have reproduceable compilations. I this scenario comparing pointers is safe. The sort is used to put together any two pairs of data refs which can be merged. For example, if we have (a, b) (a, c), (a, b+1), then after sorting them we should have either (a, b), (a, b+1), (a, c) or (a, c), (a, b), (a, b+1). We don't care the relative order of non-mergable dr pairs here. So although the sorting result may vary the final result we get should not change. + if (int_cst_value (p11.offset) != int_cst_value (p21.offset)) +return int_cst_value (p11.offset) int_cst_value (p21.offset); This is going to ICE whenever the offsets wouldn't fit into a HOST_WIDE_INT. I'd say you just shouldn't put into the vector entries where offset isn't host_integerp, those would never be merged with other checks, or something similar. Do you mean I should use widest_int_cst_value()? Then I will replace all int_cst_value() here with it. I also changed the type of diff variable into HOST_WIDEST_INT. Thank you very much for your comments! Cong Jakub