[Bug tree-optimization/64365] [4.9 Regression] Predictive commoning after loop vectorization produces incorrect code.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64365 --- Comment #9 from Cong Hou --- Thanks for the fix, Richard!
[Bug tree-optimization/64365] Predictive commoning after loop vectorization produces incorrect code.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64365 --- Comment #1 from Cong Hou --- Ping on this bug.
[Bug tree-optimization/64365] New: Predictive commoning after loop vectorization produces incorrect code.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64365 Bug ID: 64365 Summary: Predictive commoning after loop vectorization produces incorrect code. Product: gcc Version: 5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: congh at google dot com Compiling the following loop with -O3 on x86-64 produces incorrect code: void foo(int *in) { for (int i = 14; i >= 10; i--) { in[i - 8] -= in[i]; in[i - 5] += in[i] * 2; in[i - 4] += in[i]; } } The incorrect code appears starting from pcom pass. Note that after this loop is vectorized there exists read-after-write data dependence between the second and third statements in the loop. The correct way to get the vector from in[i - 4] in the third statement is reading the memory after the write from the second statement. However, in pcom pass, that vector is actually preloaded before the loop. I think pcom ignores the aliasing between the memory addresses of vector types (in this case MEM[&{in[i-3] : in[i-0]}] and MEM[&{in[i-5] : in[i-1]}].
[Bug tree-optimization/63530] GCC generates incorrect aligned store on ARM after the loop is unrolled.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63530 --- Comment #2 from Cong Hou --- This issue can also be reproduced on x86_64. Compile the following code with options (assume the file name is t.c): -O2 -ftree-vectorize t.c -fdump-tree-all-alias #include typedef struct { unsigned char map[256]; int i; } A, *AP; AP foo(int n) { AP b = malloc(sizeof(A)); int i; for (i = n; i < 256; i++) b->map[i] = i; return b; } The from t.c.116t.vect we can find such a statement: # ALIGN = 8, MISALIGN = 0 vectp_b.15_47 = b_5 + _48; Here b_5 is obtained from malloc which can be 8 bytes aligned, but _48 is from input parameter n, and the alignment of vectp_b.15_47 should be unknown instead of 8 here. I suspect the ptr_info_def object of vectp_b.15_47 is just copied from that of b_5, which is incorrect.
[Bug tree-optimization/63530] New: GCC generates incorrect aligned store on ARM after the loop is unrolled.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63530 Bug ID: 63530 Summary: GCC generates incorrect aligned store on ARM after the loop is unrolled. Product: gcc Version: 5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: congh at google dot com Created attachment 33710 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33710&action=edit assembly When compile the code shown below using GCC 5.0 for ARM with the following options: -O2 -ftree-vectorize -march=armv7-a -mfpu=neon -funroll-loops --param=max-completely-peeled-insns=400 // The code: typedef struct { unsigned char map[256]; int i; } A, *AP; void* calloc(int, int); AP foo(int n) { AP b = calloc(1, sizeof(A)); int i; for (i = n; i < 256; i++) b->map[i] = i; return b; } A instruction vst1.64{d0-d1}, [r2:64] is generated, which is an aligned store with 8 bytes alignment requirement. However this requirement cannot be satisfied as the loop is not peeled for alignment, and the start address on the array is unknown at compile time. I have attached the generated assembly code here.
[Bug c++/61507] New: GCC does not compile function with parameter pack.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61507 Bug ID: 61507 Summary: GCC does not compile function with parameter pack. Product: gcc Version: 4.10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: congh at google dot com GCC fails to compile the following code: struct A { void foo(const int &); void foo(float); }; template void bar(void (A::*memfun)(Args...), Args... args); void go(const int& i) { bar(&A::foo, i); } The error message is shown below: t.C:10:30: error: no matching function for call to ‘bar(, const int&)’ bar(&A::foo, i); ^ t.C:7:6: note: candidate: template void bar(void (A::*)(Args ...), Args ...) void bar(void (A::*memfun)(Args...), Args... args); ^ t.C:7:6: note: template argument deduction/substitution failed: t.C:10:30: note: inconsistent parameter pack deduction with ‘const int&’ and ‘int’ bar(&A::foo, i); As the type is explicitly specified, why GCC would like to "deduce" it?
[Bug tree-optimization/60896] [4.10 Regression] ICE: in vect_get_vec_def_for_operand, at tree-vect-stmts.c:1449
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60896 --- Comment #3 from Cong Hou --- Created attachment 32668 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=32668&action=edit The patch to fix PR60896 The reason of this issue is that those statements in PATTERN_DEF_SEQ in pre-recognized widen-mult pattern are not forwarded to later recognized dot-product pattern. I have created a patch to fix this. Another issue is that the def types of statements in PATTERN_DEF_SEQ are assigned with the def type of the pattern statement. This is incorrect for reduction pattern statement, in which case all statements in PATTERN_DEF_SEQ will all be vect_reduction_def, and none of them will be vectorized later. The def type of statement in PATTERN_DEF_SEQ should always be vect_internal_def. This patch will also be submitted to gcc-patch.
[Bug testsuite/60773] [4.9 Regression] FAIL: gcc.dg/vect/pr60656.c -flto -ffat-lto-objects scan-tree-dump-times vect "vectorized 1 loops" 1
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60773 --- Comment #5 from Cong Hou --- Hi Jakub Thank you very much for the commit! thanks, Cong On Wed, Apr 9, 2014 at 4:39 AM, jakub at gcc dot gnu.org wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60773 > > Jakub Jelinek changed: > >What|Removed |Added > > Status|UNCONFIRMED |RESOLVED > CC||jakub at gcc dot gnu.org > Resolution|--- |FIXED > > --- Comment #4 from Jakub Jelinek --- > I went ahead and committed the fix. > > -- > You are receiving this mail because: > You are on the CC list for the bug.
[Bug testsuite/60773] [4.9 Regression] FAIL: gcc.dg/vect/pr60656.c -flto -ffat-lto-objects scan-tree-dump-times vect "vectorized 1 loops" 1
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60773 Cong Hou changed: What|Removed |Added CC||congh at google dot com --- Comment #2 from Cong Hou --- This is my bad. I have created a new patch as below to fix this issue. Another email is sent to gcc-patches also. diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 414a745..ea860e7 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,11 @@ +2014-04-07 Cong Hou + +PR testsuite/60773 +* testsuite/lib/target-supports.exp: +Add check_effective_target_vect_widen_mult_si_to_di_pattern. +* gcc.dg/vect/pr60656.c: Update the test by checking if the targets +vect_widen_mult_si_to_di_pattern and vect_long are supported. + 2014-03-28 Cong Hou PR tree-optimization/60656 diff --git a/gcc/testsuite/gcc.dg/vect/pr60656.c b/gcc/testsuite/gcc.dg/vect/pr60656.c index ebaab62..b80e008 100644 --- a/gcc/testsuite/gcc.dg/vect/pr60656.c +++ b/gcc/testsuite/gcc.dg/vect/pr60656.c @@ -1,5 +1,7 @@ /* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target vect_long } */ +#include #include "tree-vect.h" __attribute__ ((noinline)) long @@ -12,7 +14,7 @@ foo () for(i = 0; i < 4; ++i) { long P = v[i]; - s += P*P*P; + s += P * P * P; } return s; } @@ -27,7 +29,7 @@ bar () for(i = 0; i < 4; ++i) { long P = v[i]; - s += P*P*P; + s += P * P * P; __asm__ volatile (""); } return s; @@ -35,11 +37,12 @@ bar () int main() { + check_vect (); + if (foo () != bar ()) abort (); return 0; } -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_widen_mult_si_to_di_pattern } } } */ /* { dg-final { cleanup-tree-dump "vect" } } */ - diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp index bee8471..6d9d689 100644 --- a/gcc/testsuite/lib/target-supports.exp +++ b/gcc/testsuite/lib/target-supports.exp @@ -3732,6 +3732,27 @@ proc check_effective_target_vect_widen_mult_hi_to_si_pattern { } { } # Return 1 if the target plus current options supports a vector +# widening multiplication of *int* args into *long* result, 0 otherwise. +# +# This won't change for different subtargets so cache the result. + +proc check_effective_target_vect_widen_mult_si_to_di_pattern { } { +global et_vect_widen_mult_si_to_di_pattern + +if [info exists et_vect_widen_mult_si_to_di_pattern_saved] { +verbose "check_effective_target_vect_widen_mult_si_to_di_pattern: using cached result" 2 +} else { +if {[istarget ia64-*-*] + || [istarget i?86-*-*] + || [istarget x86_64-*-*] } { +set et_vect_widen_mult_si_to_di_pattern_saved 1 +} +} +verbose "check_effective_target_vect_widen_mult_si_to_di_pattern: returning $et_vect_widen_mult_si_to_di_pattern_saved" 2 +return $et_vect_widen_mult_si_to_di_pattern_saved +} + +# Return 1 if the target plus current options supports a vector # widening shift, 0 otherwise. # # This won't change for different subtargets so cache the result.
[Bug tree-optimization/60656] [4.8/4.9 regression] x86 vectorization produces wrong code
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60656 --- Comment #7 from Cong Hou --- Yes, will do it. Thank you a lot!
[Bug tree-optimization/60656] [4.8/4.9 regression] x86 vectorization produces wrong code
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60656 --- Comment #4 from Cong Hou --- Yes, there is a quick fix: we can check if the def with vect_used_by_reduction is immediately used by a reduction stmt. After all, it seems that supportable_widening_operation() is the only place that takes advantage of this "the element order doesn't matter" feature. diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c index 70fb411..7442d0c 100644 --- a/gcc/tree-vect-stmts.c +++ b/gcc/tree-vect-stmts.c @@ -7827,7 +7827,16 @@ supportable_widening_operation (enum tree_code code, gimple stmt, stmt, vectype_out, vectype_in, code1, code2, multi_step_cvt, interm_types)) - return true; +{ + tree lhs = gimple_assign_lhs (stmt); + use_operand_p dummy; + gimple use_stmt; + stmt_vec_info use_stmt_info = NULL; + if (single_imm_use (lhs, &dummy, &use_stmt) + && (use_stmt_info = vinfo_for_stmt (use_stmt)) + && STMT_VINFO_DEF_TYPE (use_stmt_info) == vect_reduction_def) +return true; +} c1 = VEC_WIDEN_MULT_LO_EXPR; c2 = VEC_WIDEN_MULT_HI_EXPR; break;
[Bug tree-optimization/60656] [4.8/4.9 regression] x86 vectorization produces wrong code
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60656 Cong Hou changed: What|Removed |Added CC||congh at google dot com --- Comment #2 from Cong Hou --- This bug is caused by an optimization in GCC vectorizer that is not implemented properly. When a reduction operation is vectorized, the order of elements in vectors directly used in reduction does not matter. In some cases the vectorizer may generate less code based on this fact. GCC assigns a property named "vect_used_by_reduction" to all vectors participating in reductions. However, vectors that are indirectly used in reduction also have this property. For example, consider the following three statements (all operands are vectors): a = b op1 c; d = a op2 e; s1 = s0 op3 d; Here assume the last statement is a reduction one, then a,b,c,d,e all have the property "vect_used_by_reduction". However, if op2 is different from op3, then a's element order can affect the final result. GCC does not check this.
[Bug tree-optimization/60505] Warning caused by GCC vectorizer.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60505 --- Comment #1 from Cong Hou --- Google ref: b/13403465
[Bug tree-optimization/60505] New: Warning caused by GCC vectorizer.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60505 Bug ID: 60505 Summary: Warning caused by GCC vectorizer. Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: congh at google dot com The compilation on the code below fails with options "-Wall -Werror -O2 -ftree-loop-vectorize". The reason is that the epilogue generated by the vectorizer tries to access the memory outside of ovec[16] and the the vrp pass emits the warning "array subscript is above array bounds" for the access to ovec[i]. The vectorizer should not generate the epilogue for this loop. void foo(char *in, char *out, int num) { int i; unsigned char ovec[16] = {0}; for(i=0; i < num ; ++i) out[i] = (ovec[i] = in[i]); out[num] = ovec[num/2]; }
[Bug target/58762] [missed optimization] Vectorizing abs(int).
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58762 Cong Hou changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #4 from Cong Hou --- (In reply to Cong Hou from comment #3) > Author: congh > Date: Thu Oct 31 00:50:47 2013 > New Revision: 204241 > > URL: http://gcc.gnu.org/viewcvs?rev=204241&root=gcc&view=rev > Log: > 2013-10-30 Cong Hou > > Backport from mainline: > 2013-10-30 Cong Hou > > PR target/58762 > * config/i386/i386-protos.h (ix86_expand_sse2_abs): New function. > * config/i386/i386.c (ix86_expand_sse2_abs): New function. > * config/i386/sse.md: Add SSE2 support to abs (8/16/32-bit-int). > > > Modified: > branches/google/gcc-4_8/gcc/ChangeLog > branches/google/gcc-4_8/gcc/config/i386/i386-protos.h > branches/google/gcc-4_8/gcc/config/i386/i386.c > branches/google/gcc-4_8/gcc/config/i386/sse.md
[Bug tree-optimization/57512] Vectorizer: cannot handle accumulation loop of signed char type
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57512 Cong Hou changed: What|Removed |Added CC||congh at google dot com --- Comment #2 from Cong Hou --- Together with the phi function, consider the following gimple code: loop: # sum_phi = phi (sum_signed, sum_init_signed); sum_temp = (short unsigned int) sum_phi; sum_unsigned = a + sum_temp; sum_signed = (short int) sum_unsigned; Can we transform the above code to the following one? sum_init_unsigned = (unsigned short int) sum_init_signed; loop: # sum_phi = phi (sum_unsigned, sum_init_unsigned); sum_unsigned = a + sum_phi; sum_signed = (short int) sum_unsigned; This transformation should let the vectorizer detect the reduction pattern.
[Bug tree-optimization/59006] [4.9 Regression] internal compiler error: in vect_transform_stmt, at tree-vect-stmts.c:5963
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59006 Cong Hou changed: What|Removed |Added CC||congh at google dot com --- Comment #5 from Cong Hou --- Hoisting all vectorized statements may not be the best solution (some loads may not be necessary outside of the loop), but I think it works and can solve the current issues. Richard, are you working on this? If you'd like I could also make a patch with this idea. thanks, Cong
[Bug tree-optimization/56902] Fails to SLP with mismatched +/- and negatable constants
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56902 --- Comment #3 from Cong Hou --- How do you generate the final operations in vectorized code? I just submitted a patch on this issue. The patch supports non-isomorphic operations with the restriction that all operations on even/odd elements still be isomorphic. Please give me the comment on this patch. Thank you! Cong
[Bug c++/58963] Does C++ need flag_complex_method = 2?
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58963 --- Comment #3 from Cong Hou --- Suppose there is a third-party complex library, which is written in the same way as . Then GCC could not recognize that as complex type, and will not use builtin calls to calculate multiplication and division. So why there should be a difference when I use the third-party complex lib and the standard library lib. After all, is all written in source code. is not the same as _Complex in C99. If we can use _Complex in C++, it is fine. But C does not have : we won't meet the situation that building the same file t.c using gcc and g++, and g++ is faster. gcc cannot recognize .
[Bug tree-optimization/59050] [4.9 Regression] ICE: tree check: expected integer_cst, have nop_expr in tree_int_cst_lt, at tree.c:7083
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59050 Cong Hou changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #5 from Cong Hou --- (In reply to congh from comment #4) > Author: congh > Date: Mon Nov 11 19:03:39 2013 > New Revision: 204683 > > URL: http://gcc.gnu.org/viewcvs?rev=204683&root=gcc&view=rev > Log: > 2013-11-11 Cong Hou > > PR tree-optimization/59050 > * tree-vect-data-refs.c (comp_dr_addr_with_seg_len_pair): Bug fix. > > > Modified: > trunk/gcc/ChangeLog > trunk/gcc/tree-vect-data-refs.c
[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 Bug 53947 depends on bug 58508, which changed state. Bug 58508 Summary: [Missed-Optimization] Redundant vector load of "actual" loop invariant in loop body. http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58508 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug tree-optimization/58508] [Missed-Optimization] Redundant vector load of "actual" loop invariant in loop body.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58508 Cong Hou changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #9 from Cong Hou --- (In reply to congh from comment #8) > Author: congh > Date: Fri Nov 8 18:44:46 2013 > New Revision: 204590 > > URL: http://gcc.gnu.org/viewcvs?rev=204590&root=gcc&view=rev > Log: > 2013-11-08 Cong Hou > > PR tree-optimization/58508 > * gcc.dg/vect/pr58508.c: Update. > > > Modified: > trunk/gcc/testsuite/ChangeLog > trunk/gcc/testsuite/gcc.dg/vect/pr58508.c
[Bug tree-optimization/56902] Fails to SLP with mismatched +/- and negatable constants
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56902 Cong Hou changed: What|Removed |Added CC||congh at google dot com --- Comment #1 from Cong Hou --- I just made a patch which supports limited non-isomorphic operations (operations on even/odd elements are still isomorphic) for SLP. Then the three loops you listed can be vectorized using SLP by using new VEC_ADDSUB_EXPR or VEC_SUBADD_EXPR. For x86, SSE3 provides ADDSUBPD/ADDSUBPS instructions which can do the job, but I also emulated them for SSE (use mask to negate the even/odd elements and then add). I think we will need to support more general non-isomorphic operations, which is more difficult and challenging. But I think the limited support in this patch is also useful at this time. I will send the patch later.
[Bug tree-optimization/56717] Enhance Dot-product pattern recognition to avoid mult widening.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56717 --- Comment #2 from Cong Hou --- I examined the GCC generated code, and found the main problem is that the load of 'scale' (rhs operand of >>) to an xmm register is in the loop body, which could be moved outside. This happened during rtl-reload pass. For the following code, the load to scale is still outside of the loop body. void foo(short* a, short scale, int n) { int i; for (i=0; i> scale; } But for your code here, it is not. I suspect there may exist some issue in that pass. By the way, from my test it turns out that using PMADDWD is no faster than the way used by GCC now.
[Bug tree-optimization/56717] Enhance Dot-product pattern recognition to avoid mult widening.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56717 Cong Hou changed: What|Removed |Added CC||congh at google dot com --- Comment #1 from Cong Hou --- The way ICC uses is not related to dot-product. It just finds out a smart way to implement widen-mult (s16 to s32) using PMADDWD. I will try to make a patch on this issue. thanks, Cong
[Bug c++/58963] Does C++ need flag_complex_method = 2?
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58963 --- Comment #1 from Cong Hou --- Any comment on this topic? thanks, Cong
[Bug tree-optimization/56764] vect_prune_runtime_alias_test_list not smart enough
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56764 Cong Hou changed: What|Removed |Added CC||congh at google dot com --- Comment #2 from Cong Hou --- I have made a patch on this issue. However, I don't think the example here is proper. Say z1 == &(x[0][4]) (assume VF=4). Then after unrolling the loop for 4 times, there is still no data dependence that prevents vectorization. I think a better example is like the one shown below: __attribute__((noinline, noclone)) void foo (float x[3][32], float y1, float y2, float y3, float *z1, float *z2, float *z3) { int i; for (i = 0; i < 16; i++) { z1[i] = -y1 * x[0][i*2]; z2[i] = -y2 * x[1][i*2]; z3[i] = -y3 * x[2][i*2]; } } Here we have to make sure z1/z2/z3 does not alias with x across the whole range being traversed. Then we could merge the alias checks between z1 and &x[0][0:32]/&x[1][0:32]/&x[2][0:32] into one.
[Bug c++/58963] New: Does C++ need flag_complex_method = 2?
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58963 Bug ID: 58963 Summary: Does C++ need flag_complex_method = 2? Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: congh at google dot com In the patch http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00560.html, the builtin function is used to perform complex multiplication and division. This is to comply with C99 standard, but I am wondering if C++ also needs this. There is no complex keyword in C++, and no content in C++ standard of the behavior of operations on complex types. header file is all written in source code, including complex multiplication and division. GCC should not do too much for them by using builtin calls by default (also we can set -fcx-limited-range to prevent GCC doing this), which has a big impact on performance (let alone there may exist vectorization opportunities). So I propose to not set flag_complex_method to 2 for C++. Any comment? thanks, Cong
[Bug tree-optimization/58915] [missed optimization] GCC fails to get the loop bound for some loops.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58915 --- Comment #2 from Cong Hou --- I am afraid that get_range_info () has little use here. The value range we care about may only exist under specific conditions and is hence flow sensitive. For example, we may need the value range of n in the if body: if (n > 0) if (n < 4) /* use of n */ However, n does not have a new name under the condition n>0 && n<4, making it impossible to get the range (0, 4) from the SSA_NAME of n.
[Bug tree-optimization/58915] New: [missed optimization] GCC fails to get the loop bound for some loops.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58915 Bug ID: 58915 Summary: [missed optimization] GCC fails to get the loop bound for some loops. Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: congh at google dot com Getting the correct loop upper bound is important for some optimizations. GCC tries to get this bound by calling bound_difference() in tree-ssa-loop-niters.c, where GCC finds all control-dependent predicates of the loop and attempt to extract bound information from each predicate. However, GCC fails to get the bound for some loops. Below shows such an example: unsigned int i; if (i > 0) { ... if (i < 4) { do { ... --i; } while (i > 0); } } Clearly the upper bound is 3. But GCC could not get it for this loop. The reason is that GCC check i<4 (i could be zero) and i>0 separately and from neither condition can the upper bound be calculated. Those two conditions may not be combined into one as there may exist other statements between them. One possible solution is letting GCC collect all conditions first then merge them before calculating the upper bound. Any comments?
[Bug tree-optimization/58508] [Missed-Optimization] Redundant vector load of "actual" loop invariant in loop body.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58508 --- Comment #7 from Cong Hou --- OK. I made a new patch to fix this problem. Waiting to be approved. thanks, Cong diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 9d0f4a5..3d9916d 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,7 @@ +2013-10-29 Cong Hou + + * gcc.dg/vect/pr58508.c: Update. + 2013-10-15 Cong Hou * gcc.dg/vect/pr58508.c: New test. diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c b/gcc/testsuite/gcc.dg/vect/pr58508.c index 6484a65..fff7a04 100644 --- a/gcc/testsuite/gcc.dg/vect/pr58508.c +++ b/gcc/testsuite/gcc.dg/vect/pr58508.c @@ -1,3 +1,4 @@ +/* { dg-require-effective-target vect_int } */ /* { dg-do compile } */ /* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */ On Tue, Oct 29, 2013 at 6:50 AM, bernd.edlinger at hotmail dot de wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58508 > > --- Comment #6 from Bernd Edlinger --- > (In reply to Cong Hou from comment #5) >> I guess I should add >> >> /* { dg-require-effective-target vect_int } */ >> >> to the test case. It is right? > > Yes. > > -- > You are receiving this mail because: > You reported the bug.
[Bug tree-optimization/58508] [Missed-Optimization] Redundant vector load of "actual" loop invariant in loop body.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58508 --- Comment #5 from Cong Hou --- I guess I should add /* { dg-require-effective-target vect_int } */ to the test case. It is right?
[Bug tree-optimization/58728] [missed optimization] == or != comparisons may affect range test optimization.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58728 --- Comment #1 from Cong Hou --- Any comment on this? thanks, Cong
[Bug tree-optimization/58728] New: [missed optimization] == or != comparisons may affect range test optimization.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58728 Bug ID: 58728 Summary: [missed optimization] == or != comparisons may affect range test optimization. Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: congh at google dot com Created attachment 31002 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=31002&action=edit Patch Look at the following code: int foo(unsigned int n) { if (n != 0) if (n != 1) if (n != 2) if (n != 3) if (n != 4) return ++n; return n; } Those five comparisons should be able to be merged into one during range test optimization but they are not. The reason is that GCC checks the phi args of n after the branch to make sure two false edges of two neighboring ifs define the same phi arg at the join node (thus guarantees side-effect free). However, the "vrp" pass replaced the phi arg by the identical value of the original phi arg deducted from == or != comparisons, hence preventing the range test optimization. The same case is in if-combine pass. I made a patch for this issue which is attached here.
[Bug tree-optimization/58686] vect_get_loop_niters() fails for some loops
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58686 --- Comment #2 from Cong Hou --- I think this issue is more like a missed optimization. If the iteration number can be calculated as a constant value at compile time, then the function assert_loop_rolls_lt() won't be called due to an early exit (specifically in the function number_of_iterations_lt() at the call to number_of_iterations_lt_to_ne()). That is why I could not craft a testcase showing miscompile. A better test case is shown below: #define N 4 void foo(int* a, unsigned int i) { int j = 0; do { a[j++] = 0; i -= 4; } while (i >= N); } Compile it with -O3 and the produced result is using __builtin_memset() as the niter can be calculated. But if the value of N is replaced by others like 3 or 5, GCC won't optimize this loop into __builtin_memset() any more.
[Bug tree-optimization/58686] New: [BUG] vect_get_loop_niters() cound not get the correct result for some loops.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58686 Bug ID: 58686 Summary: [BUG] vect_get_loop_niters() cound not get the correct result for some loops. Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: congh at google dot com Look at the following loop: unsigned int t = ...; do { ... t -= 4; } while (t >= 5); When I tried to get the iteration number of this loop as an expression using vect_get_loop_niters(), it gave me the result "scev_not_known". If I changed the type of t into signed int, then I can get the result as below: t > 4 ? ((unsigned int) t + 4294967291) / 4 : 0 But even when t is unsigned, we should still get the result as: t != 4 ? (t + 4294967291) / 4 : 0 I spent some time on tracking the reason why it failed to do so, and then reached the function assert_loop_rolls_lt(), in which the assumptions are built to make sure we can get the iteration number from the following formula: (iv1->base - iv0->base + step - 1) / step In the example above, iv1->base is t-4, iv0->base is 4 (t>=5 is t>4), and step is 4. This formula works only if -step + 1 <= (iv1->base - iv0->base) <= MAX - step + 1 (MAX is the maximum value of the unsigned variant of type of t, and in this formula we don't have to take care of overflow.) I think when (iv1->base - iv0->base) < -step + 1, then we can assume the number of times the back edge is taken is 0, and that is how niter->may_be_zero is built in this function. And niter->assumptions is built based on (iv1->base - iv0->base) <= MAX - step + 1. Note that we can only get the iteration number of the loop if niter->assumptions is always evaluated as true. However, I found that the build of niter->assumptions does not involve both iv1->base and iv0->base, but only one of them. I think this is possibly a potential bug. Further, the reason why we can get the iteration number if t is of unsigned int type is that niter->assumptions built here t-4 < MAX-3 is evaluated to true, by taking advantage of the fact that the overflow on signed int is undefined (so t-4 < MAX-3 can be converted to t < MAX+1, where MAX+1 is assumed to not overflow). But this is not working for unsigned int. One more problem is the way how niter->may_be_zero is built. For the loop above, niter->may_be_zero I got is 4 > t - 4 - (-4 + 1), but we should make sure t-4 here does not overflow. Otherwise niter->may_be_zero is invalid. I think the function assert_loop_rolls_lt() should take care more of unsigned int types. With this issue we cannot vectorize this loop as its iteration number is unknown. Thank you! Cong
[Bug tree-optimization/58513] New: *var and MEM[(const int &)var] (var has int* type) are not treated as the same data ref.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58513 Bug ID: 58513 Summary: *var and MEM[(const int &)var] (var has int* type) are not treated as the same data ref. Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: congh at google dot com First look at the code below: int op (const int& x, const int& y) { return x + y; } void foo(int* a) { for (int i = 0; i < 10; ++i) a[i] = op(a[i], 1); } GCC will generate the following GIMPLE for this loop after inlining op(): : # i_17 = PHI <0(2), i_23(4)> # ivtmp_13 = PHI <10(2), ivtmp_24(4)> _12 = (long unsigned int) i_17; _2 = _12 * 4; _1 = a_6(D) + _2; _20 = MEM[(const int &)_1]; _19 = _20 + 1; *_1 = _19; i_23 = i_17 + 1; ivtmp_24 = ivtmp_13 - 1; if (ivtmp_24 != 0) goto ; else goto ; Here each element of the array a is loaded by MEM[(const int &)_1] and stored by *_1, which are the only two data refs in the loop body. The GCC vectorizer needs to check the possible aliasing between data refs with potential data dependence. Here those two data refs are actually the same one, but GCC could not recognize this fact. As a result, the aliasing checking predicate will always return false at runtime (GCC 4.9 could eliminate this generated branch at the end of the vectorization pass). The reason why GCC thinks that MEM[(const int &)_1] and *_1 are two different data refs is that there is a possible defect in the function operand_equal_p(), which is used to compare two data refs. The current implementation uses == to compare the types of the second argument of MEM_REF operator, which is too strict. Using types_compatible_p() instead can fix the issue above. I have produced a patch to fix it and the patch is shown below. Please give me the comment on this patch. (bootstrapping and "make check" passed). thanks, Cong Index: gcc/fold-const.c === --- gcc/fold-const.c(revision 202662) +++ gcc/fold-const.c(working copy) @@ -2693,8 +2693,9 @@ operand_equal_p (const_tree arg0, const_ && operand_equal_p (TYPE_SIZE (TREE_TYPE (arg0)), TYPE_SIZE (TREE_TYPE (arg1)), flags))) && types_compatible_p (TREE_TYPE (arg0), TREE_TYPE (arg1)) - && (TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (arg0, 1))) - == TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (arg1, 1 + && types_compatible_p ( +TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (arg0, 1))), +TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (arg1, 1 && OP_SAME (0) && OP_SAME (1)); case ARRAY_REF:
[Bug tree-optimization/58508] New: Redundant vector load of "actual" loop invariant in loop body.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58508 Bug ID: 58508 Summary: Redundant vector load of "actual" loop invariant in loop body. Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: congh at google dot com When GCC vectorizes the loop below, it will firstly do loop versioning with aliasing check on a and b. Since a and b have different strides (1 and 0), the check guarantees that there is no aliasing between a and b across all iterations. Then with this precondition *b becomes a loop invariant so that it can be loaded outside the loop during vectorization (Note that this precondition always holds when the loop is being vectorized). This can save us a load and a shuffle instruction in each iteration. void foo (int* a, int* b, int n) { for (int i = 0; i < n; ++i) a[i] += *b; } I have a patch handling this case as an optimization. After loop versioning, I detect all zero-strided data references and hoist the loads of them to the loop header. The patch is shown below. thanks, Cong Index: gcc/tree-vect-loop-manip.c === --- gcc/tree-vect-loop-manip.c(revision 202662) +++ gcc/tree-vect-loop-manip.c(working copy) @@ -2477,6 +2477,37 @@ vect_loop_versioning (loop_vec_info loop adjust_phi_and_debug_stmts (orig_phi, e, PHI_RESULT (new_phi)); } + /* Extract load and store statements on pointers with zero-stride + accesses. */ + if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) +{ + + /* In the loop body, we iterate each statement to check if it is a load + or store. Then we check the DR_STEP of the data reference. If + DR_STEP is zero, then we will hoist the load statement to the loop + preheader, and move the store statement to the loop exit. */ + + for (gimple_stmt_iterator si = gsi_start_bb (loop->header); +!gsi_end_p (si); ) +{ + gimple stmt = gsi_stmt (si); + stmt_vec_info stmt_info = vinfo_for_stmt (stmt); + struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info); + + if (dr && integer_zerop (DR_STEP (dr))) +{ + if (DR_IS_READ (dr)) +{ + basic_block preheader = loop_preheader_edge (loop)->src; + gimple_stmt_iterator si_dst = gsi_last_bb (preheader); + gsi_move_after (&si, &si_dst); +} +} + else +gsi_next (&si); +} +} + /* End loop-exit-fixes after versioning. */ if (cond_expr_stmt_list)