[Bug tree-optimization/64365] [4.9 Regression] Predictive commoning after loop vectorization produces incorrect code.

2015-01-15 Thread congh at google dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64365

--- Comment #9 from Cong Hou congh at google dot com ---
Thanks for the fix, Richard!


[Bug tree-optimization/64365] Predictive commoning after loop vectorization produces incorrect code.

2015-01-13 Thread congh at google dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64365

--- Comment #1 from Cong Hou congh at google dot com ---
Ping on this bug.


[Bug tree-optimization/64365] New: Predictive commoning after loop vectorization produces incorrect code.

2014-12-19 Thread congh at google dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64365

Bug ID: 64365
   Summary: Predictive commoning after loop vectorization produces
incorrect code.
   Product: gcc
   Version: 5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: congh at google dot com

Compiling the following loop with -O3 on x86-64 produces incorrect code:


void foo(int *in) {
  for (int i = 14; i = 10; i--) {
in[i - 8] -= in[i];
in[i - 5] += in[i] * 2;
in[i - 4] += in[i];
  }
}


The incorrect code appears starting from pcom pass. Note that after this loop
is vectorized there exists read-after-write data dependence between the second
and third statements in the loop. The correct way to get the vector from in[i -
4] in the third statement is reading the memory after the write from the second
statement. However, in pcom pass, that vector is actually preloaded before the
loop. I think pcom ignores the aliasing between the memory addresses of vector
types (in this case MEM[{in[i-3] : in[i-0]}] and MEM[{in[i-5] : in[i-1]}].


[Bug tree-optimization/63530] GCC generates incorrect aligned store on ARM after the loop is unrolled.

2014-10-15 Thread congh at google dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63530

--- Comment #2 from Cong Hou congh at google dot com ---
This issue can also be reproduced on x86_64. Compile the following code with
options (assume the file name is t.c): -O2 -ftree-vectorize t.c
-fdump-tree-all-alias


#include stdlib.h

typedef struct {
  unsigned char map[256];
  int i;
} A, *AP;

AP foo(int n)
{
  AP b = malloc(sizeof(A));
  int i;
  for (i = n; i  256; i++)
b-map[i] = i;
  return b;
}

The from t.c.116t.vect we can find such a statement:

  # ALIGN = 8, MISALIGN = 0
  vectp_b.15_47 = b_5 + _48;

Here b_5 is obtained from malloc which can be 8 bytes aligned, but _48 is from
input parameter n, and the alignment of vectp_b.15_47 should be unknown instead
of 8 here. I suspect the ptr_info_def object of vectp_b.15_47 is just copied
from that of b_5, which is incorrect.


[Bug tree-optimization/63530] New: GCC generates incorrect aligned store on ARM after the loop is unrolled.

2014-10-13 Thread congh at google dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63530

Bug ID: 63530
   Summary: GCC generates incorrect aligned store on ARM after the
loop is unrolled.
   Product: gcc
   Version: 5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: congh at google dot com

Created attachment 33710
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33710action=edit
assembly

When compile the code shown below using GCC 5.0 for ARM with the following
options:

-O2 -ftree-vectorize -march=armv7-a -mfpu=neon -funroll-loops
--param=max-completely-peeled-insns=400


// The code:

typedef struct {
  unsigned char map[256];
  int i;
} A, *AP;

void* calloc(int, int);

AP foo(int n)
{
  AP b = calloc(1, sizeof(A));
  int i;
  for (i = n; i  256; i++)
b-map[i] = i;
  return b;
}


A instruction

vst1.64{d0-d1}, [r2:64]

is generated, which is an aligned store with 8 bytes alignment requirement.
However this requirement cannot be satisfied as the loop is not peeled for
alignment, and the start address on the array is unknown at compile time.

I have attached the generated assembly code here.


[Bug c++/61507] New: GCC does not compile function with parameter pack.

2014-06-13 Thread congh at google dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61507

Bug ID: 61507
   Summary: GCC does not compile function with parameter pack.
   Product: gcc
   Version: 4.10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: congh at google dot com

GCC fails to compile the following code:



struct A {
  void foo(const int );
  void foo(float);
};

template typename... Args
void bar(void (A::*memfun)(Args...), Args... args);

void go(const int i) {
  barconst int (A::foo, i);
}




The error message is shown below:



t.C:10:30: error: no matching function for call to ‘bar(unresolved overloaded
function type, const int)’
   barconst int (A::foo, i);
  ^
t.C:7:6: note: candidate: templateclass ... Args void bar(void (A::*)(Args
...), Args ...)
 void bar(void (A::*memfun)(Args...), Args... args);
  ^
t.C:7:6: note:   template argument deduction/substitution failed:
t.C:10:30: note:   inconsistent parameter pack deduction with ‘const int’ and
‘int’
   barconst int (A::foo, i);



As the type is explicitly specified, why GCC would like to deduce it?

[Bug tree-optimization/60896] [4.10 Regression] ICE: in vect_get_vec_def_for_operand, at tree-vect-stmts.c:1449

2014-04-23 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60896

--- Comment #3 from Cong Hou congh at google dot com ---
Created attachment 32668
  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=32668action=edit
The patch to fix PR60896

The reason of this issue is that those statements in PATTERN_DEF_SEQ in
pre-recognized widen-mult pattern are not forwarded to later recognized
dot-product pattern. I have created a patch to fix this.

Another issue is that the def types of statements in PATTERN_DEF_SEQ are
assigned with the def type of the pattern statement. This is incorrect for
reduction pattern statement, in which case all statements in PATTERN_DEF_SEQ
will all be vect_reduction_def, and none of them will be vectorized later. The
def type of statement in PATTERN_DEF_SEQ should always be vect_internal_def.

This patch will also be submitted to gcc-patch.


[Bug testsuite/60773] [4.9 Regression] FAIL: gcc.dg/vect/pr60656.c -flto -ffat-lto-objects scan-tree-dump-times vect vectorized 1 loops 1

2014-04-09 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60773

--- Comment #5 from Cong Hou congh at google dot com ---
Hi Jakub

Thank you very much for the commit!


thanks,
Cong


On Wed, Apr 9, 2014 at 4:39 AM, jakub at gcc dot gnu.org
gcc-bugzi...@gcc.gnu.org wrote:
 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60773

 Jakub Jelinek jakub at gcc dot gnu.org changed:

What|Removed |Added
 
  Status|UNCONFIRMED |RESOLVED
  CC||jakub at gcc dot gnu.org
  Resolution|--- |FIXED

 --- Comment #4 from Jakub Jelinek jakub at gcc dot gnu.org ---
 I went ahead and committed the fix.

 --
 You are receiving this mail because:
 You are on the CC list for the bug.


[Bug testsuite/60773] [4.9 Regression] FAIL: gcc.dg/vect/pr60656.c -flto -ffat-lto-objects scan-tree-dump-times vect vectorized 1 loops 1

2014-04-07 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60773

Cong Hou congh at google dot com changed:

   What|Removed |Added

 CC||congh at google dot com

--- Comment #2 from Cong Hou congh at google dot com ---
This is my bad. I have created a new patch as below to fix this issue. Another
email is sent to gcc-patches also.



diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 414a745..ea860e7 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,11 @@
+2014-04-07  Cong Hou  co...@google.com
+
+PR testsuite/60773
+* testsuite/lib/target-supports.exp:
+Add check_effective_target_vect_widen_mult_si_to_di_pattern.
+* gcc.dg/vect/pr60656.c: Update the test by checking if the targets
+vect_widen_mult_si_to_di_pattern and vect_long are supported.
+
 2014-03-28  Cong Hou  co...@google.com

 PR tree-optimization/60656
diff --git a/gcc/testsuite/gcc.dg/vect/pr60656.c
b/gcc/testsuite/gcc.dg/vect/pr60656.c
index ebaab62..b80e008 100644
--- a/gcc/testsuite/gcc.dg/vect/pr60656.c
+++ b/gcc/testsuite/gcc.dg/vect/pr60656.c
@@ -1,5 +1,7 @@
 /* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target vect_long } */

+#include stdarg.h
 #include tree-vect.h

 __attribute__ ((noinline)) long
@@ -12,7 +14,7 @@ foo ()
   for(i = 0; i  4; ++i)
 {
   long P = v[i];
-  s += P*P*P;
+  s += P * P * P;
 }
   return s;
 }
@@ -27,7 +29,7 @@ bar ()
   for(i = 0; i  4; ++i)
 {
   long P = v[i];
-  s += P*P*P;
+  s += P * P * P;
   __asm__ volatile ();
 }
   return s;
@@ -35,11 +37,12 @@ bar ()

 int main()
 {
+  check_vect ();
+
   if (foo () != bar ())
 abort ();
   return 0;
 }

-/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect } } */
+/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect { target
vect_widen_mult_si_to_di_pattern } } } */
 /* { dg-final { cleanup-tree-dump vect } } */
-
diff --git a/gcc/testsuite/lib/target-supports.exp
b/gcc/testsuite/lib/target-supports.exp
index bee8471..6d9d689 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -3732,6 +3732,27 @@ proc
check_effective_target_vect_widen_mult_hi_to_si_pattern { } {
 }

 # Return 1 if the target plus current options supports a vector
+# widening multiplication of *int* args into *long* result, 0 otherwise.
+#
+# This won't change for different subtargets so cache the result.
+
+proc check_effective_target_vect_widen_mult_si_to_di_pattern { } {
+global et_vect_widen_mult_si_to_di_pattern
+
+if [info exists et_vect_widen_mult_si_to_di_pattern_saved] {
+verbose check_effective_target_vect_widen_mult_si_to_di_pattern:
using cached result 2
+} else {
+if {[istarget ia64-*-*]
+  || [istarget i?86-*-*]
+  || [istarget x86_64-*-*] } {
+set et_vect_widen_mult_si_to_di_pattern_saved 1
+}
+}
+verbose check_effective_target_vect_widen_mult_si_to_di_pattern:
returning $et_vect_widen_mult_si_to_di_pattern_saved 2
+return $et_vect_widen_mult_si_to_di_pattern_saved
+}
+
+# Return 1 if the target plus current options supports a vector
 # widening shift, 0 otherwise.
 #
 # This won't change for different subtargets so cache the result.


[Bug tree-optimization/60656] [4.8/4.9 regression] x86 vectorization produces wrong code

2014-03-28 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60656

--- Comment #7 from Cong Hou congh at google dot com ---
Yes, will do it. Thank you a lot!


[Bug tree-optimization/60656] [4.8/4.9 regression] x86 vectorization produces wrong code

2014-03-25 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60656

Cong Hou congh at google dot com changed:

   What|Removed |Added

 CC||congh at google dot com

--- Comment #2 from Cong Hou congh at google dot com ---
This bug is caused by an optimization in GCC vectorizer that is not
implemented properly. When a reduction operation is vectorized, the
order of elements in vectors directly used in reduction does not
matter. In some cases the vectorizer may generate less code based on
this fact. GCC assigns a property named vect_used_by_reduction to
all vectors participating in reductions. However, vectors that are
indirectly used in reduction also have this property. For example,
consider the following three statements (all operands are vectors):

a = b op1 c;
d = a op2 e;
s1 = s0 op3 d;

Here assume the last statement is a reduction one, then a,b,c,d,e all
have the property vect_used_by_reduction. However, if op2 is
different from op3, then a's element order can affect the final
result. GCC does not check this.


[Bug tree-optimization/60656] [4.8/4.9 regression] x86 vectorization produces wrong code

2014-03-25 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60656

--- Comment #4 from Cong Hou congh at google dot com ---
Yes, there is a quick fix: we can check if the def with vect_used_by_reduction
is immediately used by a reduction stmt. After all, it seems that
supportable_widening_operation() is the only place that takes advantage of this
the element order doesn't matter feature.


diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 70fb411..7442d0c 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -7827,7 +7827,16 @@ supportable_widening_operation (enum tree_code code,
gimple stmt,
 stmt, vectype_out, vectype_in,
 code1, code2, multi_step_cvt,
 interm_types))
-   return true;
+{
+  tree lhs = gimple_assign_lhs (stmt);
+  use_operand_p dummy;
+  gimple use_stmt;
+  stmt_vec_info use_stmt_info = NULL;
+  if (single_imm_use (lhs, dummy, use_stmt)
+   (use_stmt_info = vinfo_for_stmt (use_stmt))
+   STMT_VINFO_DEF_TYPE (use_stmt_info) == vect_reduction_def)
+return true;
+}
   c1 = VEC_WIDEN_MULT_LO_EXPR;
   c2 = VEC_WIDEN_MULT_HI_EXPR;
   break;


[Bug tree-optimization/60505] New: Warning caused by GCC vectorizer.

2014-03-11 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60505

Bug ID: 60505
   Summary: Warning caused by GCC vectorizer.
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: congh at google dot com

The compilation on the code below fails with options -Wall  -Werror -O2
-ftree-loop-vectorize. The reason is that the epilogue generated by the
vectorizer tries to access the memory outside of ovec[16] and the the vrp pass
emits the warning array subscript is above array bounds for the access to
ovec[i]. The vectorizer should not generate the epilogue for this loop.



void foo(char *in, char *out, int num)
{
 int i;
 unsigned char ovec[16] = {0};

 for(i=0; i  num ; ++i)
   out[i] = (ovec[i] = in[i]);
 out[num] = ovec[num/2];
}


[Bug tree-optimization/60505] Warning caused by GCC vectorizer.

2014-03-11 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60505

--- Comment #1 from Cong Hou congh at google dot com ---
Google ref: b/13403465


[Bug target/58762] [missed optimization] Vectorizing abs(int).

2014-03-05 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58762

Cong Hou congh at google dot com changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Cong Hou congh at google dot com ---
(In reply to Cong Hou from comment #3)
 Author: congh
 Date: Thu Oct 31 00:50:47 2013
 New Revision: 204241
 
 URL: http://gcc.gnu.org/viewcvs?rev=204241root=gccview=rev
 Log:
 2013-10-30  Cong Hou  co...@google.com
 
 Backport from mainline:
 2013-10-30  Cong Hou  co...@google.com
 
 PR target/58762
 * config/i386/i386-protos.h (ix86_expand_sse2_abs): New function.
 * config/i386/i386.c (ix86_expand_sse2_abs): New function.
 * config/i386/sse.md: Add SSE2 support to abs (8/16/32-bit-int).
 
 
 Modified:
 branches/google/gcc-4_8/gcc/ChangeLog
 branches/google/gcc-4_8/gcc/config/i386/i386-protos.h
 branches/google/gcc-4_8/gcc/config/i386/i386.c
 branches/google/gcc-4_8/gcc/config/i386/sse.md


[Bug tree-optimization/57512] Vectorizer: cannot handle accumulation loop of signed char type

2013-12-19 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57512

Cong Hou congh at google dot com changed:

   What|Removed |Added

 CC||congh at google dot com

--- Comment #2 from Cong Hou congh at google dot com ---
Together with the phi function, consider the following gimple code:


loop:
  # sum_phi = phi (sum_signed, sum_init_signed);
  sum_temp = (short unsigned int) sum_phi;
  sum_unsigned = a + sum_temp;
  sum_signed = (short int) sum_unsigned;


Can we transform the above code to the following one?


sum_init_unsigned = (unsigned short int) sum_init_signed;
loop:
  # sum_phi = phi (sum_unsigned, sum_init_unsigned);
  sum_unsigned = a + sum_phi;

sum_signed = (short int) sum_unsigned;


This transformation should let the vectorizer detect the reduction pattern.


[Bug tree-optimization/59006] [4.9 Regression] internal compiler error: in vect_transform_stmt, at tree-vect-stmts.c:5963

2013-11-19 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59006

Cong Hou congh at google dot com changed:

   What|Removed |Added

 CC||congh at google dot com

--- Comment #5 from Cong Hou congh at google dot com ---
Hoisting all vectorized statements may not be the best solution (some loads may
not be necessary outside of the loop), but I think it works and can solve the
current issues. Richard, are you working on this? If you'd like I could also
make a patch with this idea.


thanks,
Cong


[Bug c++/58963] Does C++ need flag_complex_method = 2?

2013-11-14 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58963

--- Comment #3 from Cong Hou congh at google dot com ---
Suppose there is a third-party complex library, which is written in the same
way as complex. Then GCC could not recognize that as complex type, and will
not use builtin calls to calculate multiplication and division. 

So why there should be a difference when I use the third-party complex lib and
the standard library lib. After all, complex is all written in source code.
complex is not the same as _Complex in C99.

If we can use _Complex in C++, it is fine. But C does not have complex: we
won't meet the situation that building the same file t.c using gcc and g++, and
g++ is faster. gcc cannot recognize complex.


[Bug tree-optimization/56902] Fails to SLP with mismatched +/- and negatable constants

2013-11-14 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56902

--- Comment #3 from Cong Hou congh at google dot com ---
How do you generate the final operations in vectorized code?

I just submitted a patch on this issue. The patch supports non-isomorphic
operations with the restriction that all operations on even/odd elements still
be isomorphic. Please give me the comment on this patch.

Thank you!


Cong


[Bug tree-optimization/56902] Fails to SLP with mismatched +/- and negatable constants

2013-11-11 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56902

Cong Hou congh at google dot com changed:

   What|Removed |Added

 CC||congh at google dot com

--- Comment #1 from Cong Hou congh at google dot com ---
I just made a patch which supports limited non-isomorphic operations
(operations on even/odd elements are still isomorphic) for SLP. Then the three
loops you listed can be vectorized using SLP by using new VEC_ADDSUB_EXPR or
VEC_SUBADD_EXPR. For x86, SSE3 provides ADDSUBPD/ADDSUBPS instructions which
can do the job, but I also emulated them for SSE (use mask to negate the
even/odd elements and then add).

I think we will need to support more general non-isomorphic operations, which
is more difficult and challenging. But I think the limited support in this
patch is also useful at this time.

I will send the patch later.


[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2013-11-11 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947

Bug 53947 depends on bug 58508, which changed state.

Bug 58508 Summary: [Missed-Optimization] Redundant vector load of actual loop 
invariant in loop body.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58508

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED


[Bug tree-optimization/58508] [Missed-Optimization] Redundant vector load of actual loop invariant in loop body.

2013-11-11 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58508

Cong Hou congh at google dot com changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #9 from Cong Hou congh at google dot com ---
(In reply to congh from comment #8)
 Author: congh
 Date: Fri Nov  8 18:44:46 2013
 New Revision: 204590
 
 URL: http://gcc.gnu.org/viewcvs?rev=204590root=gccview=rev
 Log:
 2013-11-08  Cong Hou  co...@google.com
 
   PR tree-optimization/58508
   * gcc.dg/vect/pr58508.c: Update.
 
 
 Modified:
 trunk/gcc/testsuite/ChangeLog
 trunk/gcc/testsuite/gcc.dg/vect/pr58508.c

[Bug tree-optimization/59050] [4.9 Regression] ICE: tree check: expected integer_cst, have nop_expr in tree_int_cst_lt, at tree.c:7083

2013-11-11 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59050

Cong Hou congh at google dot com changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from Cong Hou congh at google dot com ---
(In reply to congh from comment #4)
 Author: congh
 Date: Mon Nov 11 19:03:39 2013
 New Revision: 204683
 
 URL: http://gcc.gnu.org/viewcvs?rev=204683root=gccview=rev
 Log:
 2013-11-11  Cong Hou  co...@google.com
 
 PR tree-optimization/59050
 * tree-vect-data-refs.c (comp_dr_addr_with_seg_len_pair): Bug fix.
 
 
 Modified:
 trunk/gcc/ChangeLog
 trunk/gcc/tree-vect-data-refs.c


[Bug c++/58963] Does C++ need flag_complex_method = 2?

2013-11-08 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58963

--- Comment #1 from Cong Hou congh at google dot com ---
Any comment on this topic?


thanks,
Cong


[Bug tree-optimization/56717] Enhance Dot-product pattern recognition to avoid mult widening.

2013-11-08 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56717

Cong Hou congh at google dot com changed:

   What|Removed |Added

 CC||congh at google dot com

--- Comment #1 from Cong Hou congh at google dot com ---
The way ICC uses is not related to dot-product. It just finds out a smart way
to implement widen-mult (s16 to s32) using PMADDWD.

I will try to make a patch on this issue.


thanks,
Cong


[Bug tree-optimization/56717] Enhance Dot-product pattern recognition to avoid mult widening.

2013-11-08 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56717

--- Comment #2 from Cong Hou congh at google dot com ---
I examined the GCC generated code, and found the main problem is that the load
of 'scale' (rhs operand of ) to an xmm register is in the loop body, which
could be moved outside.

This happened during rtl-reload pass. For the following code, the load to scale
is still outside of the loop body.


void foo(short* a, short scale, int n) {
  int i;
  for (i=0; in; i++)
a[i] = a[i]  scale;
}


But for your code here, it is not. I suspect there may exist some issue in that
pass.

By the way, from my test it turns out that using PMADDWD is no faster than the
way used by GCC now.


[Bug tree-optimization/56764] vect_prune_runtime_alias_test_list not smart enough

2013-11-07 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56764

Cong Hou congh at google dot com changed:

   What|Removed |Added

 CC||congh at google dot com

--- Comment #2 from Cong Hou congh at google dot com ---
I have made a patch on this issue. However, I don't think the example here is
proper. Say z1 == (x[0][4]) (assume VF=4). Then after unrolling the loop for 4
times, there is still no data dependence that prevents vectorization.

I think a better example is like the one shown below:



__attribute__((noinline, noclone)) void
foo (float x[3][32], float y1, float y2, float y3, float *z1, float *z2, float
*z3)
{
  int i;
  for (i = 0; i  16; i++)
{
  z1[i] = -y1 * x[0][i*2];
  z2[i] = -y2 * x[1][i*2];
  z3[i] = -y3 * x[2][i*2];
}
}


Here we have to make sure z1/z2/z3 does not alias with x across the whole range
being traversed. Then we could merge the alias checks between z1 and
x[0][0:32]/x[1][0:32]/x[2][0:32] into one.


[Bug c++/58963] New: Does C++ need flag_complex_method = 2?

2013-11-01 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58963

Bug ID: 58963
   Summary: Does C++ need flag_complex_method = 2?
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: congh at google dot com

In the patch http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00560.html, the
builtin function is used to perform complex multiplication and division. This
is to comply with C99 standard, but I am wondering if C++ also needs this.

There is no complex keyword in C++, and no content in C++ standard of the
behavior of operations on complex types. complex header file is all written
in source code, including complex multiplication and division. GCC should not
do too much for them by using builtin calls by default (also we can set
-fcx-limited-range to prevent GCC doing this), which has a big impact on
performance (let alone there may exist vectorization opportunities).

So I propose to not set flag_complex_method to 2 for C++. Any comment?


thanks,
Cong


[Bug tree-optimization/58915] [missed optimization] GCC fails to get the loop bound for some loops.

2013-10-30 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58915

--- Comment #2 from Cong Hou congh at google dot com ---
I am afraid that get_range_info () has little use here. The value range we care
about may only exist under specific conditions and is hence flow sensitive. For
example, we may need the value range of n in the if body:

if (n  0)
  if (n  4)
/* use of n */

However, n does not have a new name under the condition n0  n4, making it
impossible to get the range (0, 4) from the SSA_NAME of n.


[Bug tree-optimization/58508] [Missed-Optimization] Redundant vector load of actual loop invariant in loop body.

2013-10-29 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58508

--- Comment #7 from Cong Hou congh at google dot com ---
OK. I made a new patch to fix this problem. Waiting to be approved.


thanks,
Cong



diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 9d0f4a5..3d9916d 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,7 @@
+2013-10-29  Cong Hou  co...@google.com
+
+   * gcc.dg/vect/pr58508.c: Update.
+
 2013-10-15  Cong Hou  co...@google.com

* gcc.dg/vect/pr58508.c: New test.
diff --git a/gcc/testsuite/gcc.dg/vect/pr58508.c
b/gcc/testsuite/gcc.dg/vect/pr58508.c
index 6484a65..fff7a04 100644
--- a/gcc/testsuite/gcc.dg/vect/pr58508.c
+++ b/gcc/testsuite/gcc.dg/vect/pr58508.c
@@ -1,3 +1,4 @@
+/* { dg-require-effective-target vect_int } */
 /* { dg-do compile } */
 /* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */





On Tue, Oct 29, 2013 at 6:50 AM, bernd.edlinger at hotmail dot de
gcc-bugzi...@gcc.gnu.org wrote:
 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58508

 --- Comment #6 from Bernd Edlinger bernd.edlinger at hotmail dot de ---
 (In reply to Cong Hou from comment #5)
 I guess I should add

 /* { dg-require-effective-target vect_int } */

 to the test case. It is right?

 Yes.

 --
 You are receiving this mail because:
 You reported the bug.


[Bug tree-optimization/58915] New: [missed optimization] GCC fails to get the loop bound for some loops.

2013-10-29 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58915

Bug ID: 58915
   Summary: [missed optimization] GCC fails to get the loop bound
for some loops.
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: congh at google dot com

Getting the correct loop upper bound is important for some optimizations. GCC
tries to get this bound by calling bound_difference() in
tree-ssa-loop-niters.c, where GCC finds all control-dependent predicates of the
loop and attempt to extract bound information from each predicate. 

However, GCC fails to get the bound for some loops. Below shows such an
example:

unsigned int i;
if (i  0) {
   ...
   if (i  4) {
  do {
 ...
 --i;
  } while (i  0);
   }
}

Clearly the upper bound is 3. But GCC could not get it for this loop. The
reason is that GCC check i4 (i could be zero) and i0 separately and from
neither condition can the upper bound be calculated. Those two conditions may
not be combined into one as there may exist other statements between them. 

One possible solution is letting GCC collect all conditions first then merge
them before calculating the upper bound.


Any comments?


[Bug tree-optimization/58508] [Missed-Optimization] Redundant vector load of actual loop invariant in loop body.

2013-10-28 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58508

--- Comment #5 from Cong Hou congh at google dot com ---
I guess I should add 

/* { dg-require-effective-target vect_int } */

to the test case. It is right?


[Bug tree-optimization/58728] [missed optimization] == or != comparisons may affect range test optimization.

2013-10-21 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58728

--- Comment #1 from Cong Hou congh at google dot com ---
Any comment on this?


thanks,
Cong


[Bug tree-optimization/58728] New: [missed optimization] == or != comparisons may affect range test optimization.

2013-10-14 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58728

Bug ID: 58728
   Summary: [missed optimization] == or != comparisons may affect
range test optimization.
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: congh at google dot com

Created attachment 31002
  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=31002action=edit
Patch

Look at the following code:


int foo(unsigned int n)
{
  if (n != 0)
  if (n != 1)
  if (n != 2)
  if (n != 3)
  if (n != 4)
return ++n;
  return n;
}


Those five comparisons should be able to be merged into one during range test
optimization but they are not. The reason is that GCC checks the phi args of n
after the branch to make sure two false edges of two neighboring ifs define the
same phi arg at the join node (thus guarantees side-effect free). However, the
vrp pass replaced the phi arg by the identical value of the original phi arg
deducted from == or != comparisons, hence preventing the range test
optimization.

The same case is in if-combine pass.

I made a patch for this issue which is attached here.


[Bug tree-optimization/58686] vect_get_loop_niters() fails for some loops

2013-10-11 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58686

--- Comment #2 from Cong Hou congh at google dot com ---
I think this issue is more like a missed optimization. 

If the iteration number can be calculated as a constant value at compile time,
then the function assert_loop_rolls_lt() won't be called due to an early exit
(specifically in the function number_of_iterations_lt() at the call to
number_of_iterations_lt_to_ne()). That is why I could not craft a testcase
showing miscompile.

A better test case is shown below:


#define N 4
void foo(int* a, unsigned int i)
{
  int j = 0;
  do
  {
a[j++] = 0;
i -= 4;
  }
  while (i = N);
}


Compile it with -O3 and the produced result is using __builtin_memset() as the
niter can be calculated. But if the value of N is replaced by others like 3 or
5, GCC won't optimize this loop into __builtin_memset() any more.


[Bug tree-optimization/58686] New: [BUG] vect_get_loop_niters() cound not get the correct result for some loops.

2013-10-10 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58686

Bug ID: 58686
   Summary: [BUG] vect_get_loop_niters() cound not get the correct
result for some loops.
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: congh at google dot com

Look at the following loop:


  unsigned int t = ...;
  do {
...
t -= 4;
  } while (t = 5);


When I tried to get the iteration number of this loop as an expression using
vect_get_loop_niters(), it gave me the result scev_not_known. If I changed
the type of t into signed int, then I can get the result as below: 


t  4 ? ((unsigned int) t + 4294967291) / 4 : 0


But even when t is unsigned, we should still get the result as:


t != 4 ? (t + 4294967291) / 4 : 0


I spent some time on tracking the reason why it failed to do so, and then
reached the function assert_loop_rolls_lt(), in which the assumptions are built
to make sure we can get the iteration number from the following formula:


(iv1-base - iv0-base + step - 1) / step


In the example above, iv1-base is t-4, iv0-base is 4 (t=5 is t4), and step
is 4. This formula works only if


-step + 1 = (iv1-base - iv0-base) = MAX - step + 1

(MAX is the maximum value of the unsigned variant of type of t, and in this
formula we don't have to take care of overflow.)


I think when (iv1-base - iv0-base)  -step + 1, then we can assume the number
of times the back edge is taken is 0, and that is how niter-may_be_zero is
built in this function. And niter-assumptions is built based on (iv1-base -
iv0-base) = MAX - step + 1. Note that we can only get the iteration number of
the loop if niter-assumptions is always evaluated as true.

However, I found that the build of niter-assumptions does not involve both
iv1-base and iv0-base, but only one of them. I think this is possibly a
potential bug.

Further, the reason why we can get the iteration number if t is of unsigned int
type is that niter-assumptions built here t-4  MAX-3 is evaluated to true, by
taking advantage of the fact that the overflow on signed int is undefined (so
t-4  MAX-3 can be converted to t  MAX+1, where MAX+1 is assumed to not
overflow). But this is not working for unsigned int.

One more problem is the way how niter-may_be_zero is built. For the loop
above, niter-may_be_zero I got is 4  t - 4 - (-4 + 1), but we should make
sure t-4 here does not overflow. Otherwise niter-may_be_zero is invalid. I
think the function assert_loop_rolls_lt() should take care more of unsigned int
types.

With this issue we cannot vectorize this loop as its iteration number is
unknown.


Thank you!

Cong


[Bug tree-optimization/58508] New: Redundant vector load of actual loop invariant in loop body.

2013-09-23 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58508

Bug ID: 58508
   Summary: Redundant vector load of actual loop invariant in
loop body.
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: congh at google dot com

When GCC vectorizes the loop below, it will firstly do loop versioning with
aliasing check on a and b. Since a and b have different strides (1 and 0), the
check guarantees that there is no aliasing between a and b across all
iterations. Then with this precondition *b becomes a loop invariant so that it
can be loaded outside the loop during vectorization (Note that this
precondition always holds when the loop is being vectorized). This can save us
a load and a shuffle instruction in each iteration.


void foo (int* a, int* b, int n)
{
  for (int i = 0; i  n; ++i)
a[i] += *b;
}


I have a patch handling this case as an optimization. After loop versioning, I
detect all zero-strided data references and hoist the loads of them to the loop
header. The patch is shown below.


thanks,
Cong



Index: gcc/tree-vect-loop-manip.c
===
--- gcc/tree-vect-loop-manip.c(revision 202662)
+++ gcc/tree-vect-loop-manip.c(working copy)
@@ -2477,6 +2477,37 @@ vect_loop_versioning (loop_vec_info loop
   adjust_phi_and_debug_stmts (orig_phi, e, PHI_RESULT (new_phi));
 }

+  /* Extract load and store statements on pointers with zero-stride 
+ accesses.  */
+  if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
+{
+
+  /* In the loop body, we iterate each statement to check if it is a load 
+ or store. Then we check the DR_STEP of the data reference.  If 
+ DR_STEP is zero, then we will hoist the load statement to the loop 
+ preheader, and move the store statement to the loop exit.  */
+
+  for (gimple_stmt_iterator si = gsi_start_bb (loop-header); 
+!gsi_end_p (si); )
+{
+  gimple stmt = gsi_stmt (si);
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+  struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
+
+  if (dr  integer_zerop (DR_STEP (dr)))
+{
+  if (DR_IS_READ (dr))
+{
+  basic_block preheader = loop_preheader_edge (loop)-src;
+  gimple_stmt_iterator si_dst = gsi_last_bb (preheader);
+  gsi_move_after (si, si_dst);
+}
+}
+  else
+gsi_next (si);
+}
+} 
+
   /* End loop-exit-fixes after versioning.  */

   if (cond_expr_stmt_list)


[Bug tree-optimization/58513] New: *var and MEM[(const int )var] (var has int* type) are not treated as the same data ref.

2013-09-23 Thread congh at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58513

Bug ID: 58513
   Summary: *var and MEM[(const int )var]  (var has int* type)
are not treated as the same data ref.
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: congh at google dot com

First look at the code below:


int op (const int x, const int y) { return x + y; }

void foo(int* a)
{
 for (int i = 0; i  10; ++i)
   a[i] = op(a[i], 1);
}


GCC will generate the following GIMPLE for this loop after inlining op():


  bb 3:
  # i_17 = PHI 0(2), i_23(4)
  # ivtmp_13 = PHI 10(2), ivtmp_24(4)
  _12 = (long unsigned int) i_17;
  _2 = _12 * 4;
  _1 = a_6(D) + _2;
  _20 = MEM[(const int )_1];
  _19 = _20 + 1;
  *_1 = _19;
  i_23 = i_17 + 1;
  ivtmp_24 = ivtmp_13 - 1;
  if (ivtmp_24 != 0)
goto bb 4;
  else
goto bb 5;


Here each element of the array a is loaded by MEM[(const int )_1] and stored
by *_1, which are the only two data refs in the loop body. The GCC vectorizer
needs to check the possible aliasing between data refs with potential data
dependence. Here those two data refs are actually the same one, but GCC could
not recognize this fact. As a result, the aliasing checking predicate will
always return false at runtime (GCC 4.9 could eliminate this generated branch
at the end of the vectorization pass). 

The reason why GCC thinks that MEM[(const int )_1] and *_1 are two different
data refs is that there is a possible defect in the function operand_equal_p(),
which is used to compare two data refs. The current implementation uses == to
compare the types of the second argument of MEM_REF operator, which is too
strict. Using types_compatible_p() instead can fix the issue above. I have
produced a patch to fix it and the patch is shown below. Please give me the
comment on this patch. (bootstrapping and make check passed).


thanks,
Cong



Index: gcc/fold-const.c
===
--- gcc/fold-const.c(revision 202662)
+++ gcc/fold-const.c(working copy)
@@ -2693,8 +2693,9 @@ operand_equal_p (const_tree arg0, const_
 operand_equal_p (TYPE_SIZE (TREE_TYPE (arg0)),
TYPE_SIZE (TREE_TYPE (arg1)), flags)))
types_compatible_p (TREE_TYPE (arg0), TREE_TYPE (arg1))
-   (TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (arg0, 1)))
-  == TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (arg1, 1
+   types_compatible_p (
+TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (arg0, 1))),
+TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (arg1, 1
OP_SAME (0)  OP_SAME (1));

 case ARRAY_REF: