An issue on loop optimization/vectorization

2018-07-11 Thread jiangning liu
For the case below, the code generated by “gcc -O3” is very ugly.



char g_d[1024], g_s1[1024], g_s2[1024];

void test_loop(void)

{

char *d = g_d, *s1 = g_s1, *s2 = g_s2;



for( int y = 0; y < 128; y++ )

{

for( int x = 0; x < 16; x++ )

d[x] = s1[x] + s2[x];

d += 16;

}

}



If we change “for( int x = 0; x < 16; x++ )” to be like “for( int x = 0; x
< 32; x++ )”, very beautiful vectorization code would be generated,



test_loop:

.LFB0:

.cfi_startproc

adrpx2, g_s1

adrpx3, g_s2

add x2, x2, :lo12:g_s1

add x3, x3, :lo12:g_s2

adrpx0, g_d

adrpx1, g_d+2048

add x0, x0, :lo12:g_d

add x1, x1, :lo12:g_d+2048

ldp q1, q2, [x2]

ldp q3, q0, [x3]

add v1.16b, v1.16b, v3.16b

add v0.16b, v0.16b, v2.16b

.p2align 3,,7

.L2:

str q1, [x0]

str q0, [x0, 16]!

cmp x0, x1

bne .L2

ret


The code generated for " for( int x = 0; x < 8; x++ )" is also very ugly.


It looks gcc has potential bugs on loop vectorization. Any idea?



Thanks,

-Jiangning


Re: [RFC, ivopts] fix bugs in ivopts address cost computation

2012-07-05 Thread Jiangning Liu
Hi,

For the following code change,

@@ -4212,11 +4064,6 @@ get_computation_cost_at (struct ivopts_d
 cost.cost += adjust_setup_cost (data,
add_cost (TYPE_MODE (ctype), speed));

-  /* Having offset does not affect runtime cost in case it is added to
- symbol, but it increases complexity.  */
-  if (offset)
-cost.complexity++;
-
   cost.cost += add_cost (TYPE_MODE (ctype), speed);

   aratio = ratio  0 ? ratio : -ratio;

I think this shouldn't be removed. The offset may be affected by the
position of inserting reduction variable accumulation statement. There
will be different cases between before and after reduction variable
accumulation. The cost of replacing use point with reduction variable
should be different accordingly.

BTW, I personally think the current ivopt cost modelling basically
works fine, although there might be some tunings needed. The most
difficult part is the choice of reduction variable candidates has
something to do with register pressure cost, while the register cost
estimate is not accurate enough at this stage because we don't have
back-end live range interference graph at all. we are always able to
find holes on some particular cases or benchmarks, but we can't only
want to find a optimal result for them, and the tuning needs to be
backed by more comprehensive result.

Thanks,
-Jiangning

2012/6/6 Sandra Loosemore san...@codesourcery.com:
 My colleagues and I have been working on the GCC port for the Qualcomm
 Hexagon.  Along the way I noticed that we were getting poor results
 from the ivopts pass no matter how we adjusted the target-specific RTX
 costs.  In many cases ivopts was coming up with candidate use costs
 that seemed completely inconsistent with the target cost model.  On
 further inspection, I found what appears to be a whole bunch of bugs
 in the way ivopts is computing address costs:

 (1) While the address cost computation is assuming in some situations
 that pre/post increment/decrement addressing will be used if
 supported by the target, it isn't actually using the target's address
 cost for such forms -- instead, just the cost of the form that would
 be used if autoinc weren't available/applicable.

 (2) The computation to determine which multiplier values are supported
 by target addressing modes is constructing an address rtx of the form
 (reg * ratio) to do the tests.  This isn't a valid address RTX on
 Hexagon, although both (reg + reg * ratio) and (sym + reg * ratio)
 are.  Because it's choosing the wrong address form to probe with, it
 thinks that the target doesn't support multipliers at all and is
 incorrectly tacking on an extra cost for them.  I also note that it's
 assuming that the same set of ratios are supported by all three
 address forms that can potentially include them, and that all valid
 ratios have the same cost.

 (3) The computation to determine the range of valid constant offsets
 for address forms that can include them is probing the upper end of
 the range using constants of the form ((1n) - 1).  On Hexagon, the
 offsets have to be aligned appropriately for the mode, so it's
 incorrectly rejecting all positive offsets for non-char modes.  And
 again, it's assuming that the same range of offsets are supported by
 all address forms that can legitimately include them, and that all
 valid offsets have the same cost.  The latter isn't true on Hexagon.

 (4) The cost adjustment for converting the symbol_present address to a
 var_present address seems overly optimistic in assuming that the
 symbol load will be hoisted outside the loop.  I looked at a lot of
 code where this was not happening no matter how expensive I made the
 absolute addressing forms in the target-specific costs.  Also, if
 subsequent passes actually do hoist the symbol load, this requires an
 additional register to be available over the entire loop, which ivopts
 isn't accounting for in any way.  It seems to me that this adjustment
 shouldn't be attempted when the symbol_present form is a legitimate
 address (because subsequent passes are not doing the anticipated
 optimization in that case).  There's also a bug present in the cost
 accounting: it's adding the full cost of the symbol + var addition,
 whereas it should be pro-rated across iterations (see the way this is
 handled for the non-address cases in get_computation_cost_at).

 (5) The documentation of TARGET_ADDRESS_COST says that when two
 (legitimate) address expressions have the same lowest cost, the one
 with the higher complexity is used.  But, the ivopts code is doing the
 opposite of this and choosing the lower complexity as the tie-breaker.
 (Actually, it's also computing the complexity without regard to
 whether the address rtx is even legitimate on the target.)

 (6) The way get_address_costs is precomputing and caching a complete
 set of cost data for each addressing mode seems incorrect to me, in
 the general case.  For example, consider MIPS where MIPS16 

RE: A question about loop ivopt

2012-05-22 Thread Jiangning Liu


 -Original Message-
 From: Richard Guenther [mailto:richard.guent...@gmail.com]
 Sent: Tuesday, May 15, 2012 10:17 PM
 To: Zdenek Dvorak
 Cc: Jiangning Liu; gcc@gcc.gnu.org; Jiangning Liu
 Subject: Re: A question about loop ivopt
 
 On Tue, May 15, 2012 at 4:13 PM, Zdenek Dvorak
 rakd...@iuuk.mff.cuni.cz wrote:
  Hi,
 
 Why can't we replace function force_expr_to_var_cost directly
 with
  function
 computation_cost in tree-ssa-loop-ivopt.c?

 Actually I think it is inaccurate for the current recursive
 algorithm
  in
 force_expr_to_var_cost to estimate expr cost. Instead
  computation_cost can
 count some back-end factors and make the estimation more
 accurate.

 For example, using computation_cost, we may check whether
 back-ends
  can tie
 some modes transformation expressed by SUBREG or not. If we
 use
 force_expr_to_var_cost, some more computations around type
 promotion/demotion would increase the cost estimated.

 Looking at the algorithm in force_expr_to_var_cost, it is just
 to
  analyze
 the operator in the expression and give estimation. Should it
 be the
  same as
 expanding EXPR to RTX and give estimation like in
 computation_cost?

 Any thoughts?
   
I suppose Zdenek may remember.
  
   computation_cost actually expands the expression to RTL.  Since
 cost
  estimates
   are computed a lot in ivopts, using it directly would require a
 huge
  amount of memory,
 
  Zdenek,
 
  Do you know how huge is it? Any data on this? For GCC, is this
 huge
  memory consumption is critical enough, and there aren't any other
 else
  consuming even more memory?
 
  no, not really (I haven't worked on this for a few years now),
 although
  of course I did some measurements when ivopts were created.  Feel
 free
  to experiment with that, if it turned out that the memory consumption
  and extra time spent by it is negligible, it would be a nice cleanup.
 
 Well, I don't think we should feed arbitrary expressions to expand at
 IVOPTs time.  What probably matters most is address costs, no?
 At least that is where expand probably makes the most difference.
 So why not improve force_expr_to_var_cost instead?

OK, yes, the thing that matter most is just address cost, so I can improve
force_expr_to_var_cost.

Would it sound OK if I expose MODES_TIEABLE_P to middle-end by defining a
new target hook? I need this function to strip some operations and make the
cost estimate more accurate. If I don't expand to RTL, I would need a method
to check the modes conversion in middle end, anyway. Any idea?

Thanks,
-Jiangning

 
 Richard.
 
 
  Zdenek






RE: A question about loop ivopt

2012-05-22 Thread Jiangning Liu


 -Original Message-
 From: Richard Guenther [mailto:richard.guent...@gmail.com]
 Sent: Tuesday, May 22, 2012 6:36 PM
 To: Jiangning Liu
 Cc: Zdenek Dvorak; Jiangning Liu; gcc@gcc.gnu.org
 Subject: Re: A question about loop ivopt
 
 On Tue, May 22, 2012 at 11:19 AM, Jiangning Liu jiangning@arm.com
 wrote:
 
 
  -Original Message-
  From: Richard Guenther [mailto:richard.guent...@gmail.com]
  Sent: Tuesday, May 15, 2012 10:17 PM
  To: Zdenek Dvorak
  Cc: Jiangning Liu; gcc@gcc.gnu.org; Jiangning Liu
  Subject: Re: A question about loop ivopt
 
  On Tue, May 15, 2012 at 4:13 PM, Zdenek Dvorak
  rakd...@iuuk.mff.cuni.cz wrote:
   Hi,
  
  Why can't we replace function force_expr_to_var_cost
 directly
  with
   function
  computation_cost in tree-ssa-loop-ivopt.c?
 
  Actually I think it is inaccurate for the current recursive
  algorithm
   in
  force_expr_to_var_cost to estimate expr cost. Instead
   computation_cost can
  count some back-end factors and make the estimation more
  accurate.
 
  For example, using computation_cost, we may check whether
  back-ends
   can tie
  some modes transformation expressed by SUBREG or not. If we
  use
  force_expr_to_var_cost, some more computations around type
  promotion/demotion would increase the cost estimated.
 
  Looking at the algorithm in force_expr_to_var_cost, it is
 just
  to
   analyze
  the operator in the expression and give estimation. Should
 it
  be the
   same as
  expanding EXPR to RTX and give estimation like in
  computation_cost?
 
  Any thoughts?

 I suppose Zdenek may remember.
   
computation_cost actually expands the expression to RTL.  Since
  cost
   estimates
are computed a lot in ivopts, using it directly would require a
  huge
   amount of memory,
  
   Zdenek,
  
   Do you know how huge is it? Any data on this? For GCC, is this
  huge
   memory consumption is critical enough, and there aren't any other
  else
   consuming even more memory?
  
   no, not really (I haven't worked on this for a few years now),
  although
   of course I did some measurements when ivopts were created.  Feel
  free
   to experiment with that, if it turned out that the memory
 consumption
   and extra time spent by it is negligible, it would be a nice
 cleanup.
 
  Well, I don't think we should feed arbitrary expressions to expand
 at
  IVOPTs time.  What probably matters most is address costs, no?
  At least that is where expand probably makes the most difference.
  So why not improve force_expr_to_var_cost instead?
 
  OK, yes, the thing that matter most is just address cost, so I can
 improve
  force_expr_to_var_cost.
 
  Would it sound OK if I expose MODES_TIEABLE_P to middle-end by
 defining a
  new target hook? I need this function to strip some operations and
 make the
  cost estimate more accurate. If I don't expand to RTL, I would need a
 method
  to check the modes conversion in middle end, anyway. Any idea?
 
 You are already in the middle-end and thus can use MODES_TIABLE_P
 directly.  Modes are also available on gimple variables via
 DECL/TYPE_MODE.

Richard,

But MODES_TIEABLE_P is a macro hook and isn't exposed to TREE level, so I
would have to modify xxx-protos.h for all back-ends.

An alternative way is I define a new function hook. This way I needn't to
change all back-ends, but support several back-ends required first.

Which solution is usually preferred?

Thanks,
-Jiangning

 
 Richard.
 
  Thanks,
  -Jiangning
 
 
  Richard.
 
 
   Zdenek
 
 
 
 






A question about loop ivopt

2012-05-14 Thread Jiangning Liu
Hi,

Why can't we replace function force_expr_to_var_cost directly with function
computation_cost in tree-ssa-loop-ivopt.c?

Actually I think it is inaccurate for the current recursive algorithm in
force_expr_to_var_cost to estimate expr cost. Instead computation_cost can
count some back-end factors and make the estimation more accurate.

For example, using computation_cost, we may check whether back-ends can tie
some modes transformation expressed by SUBREG or not. If we use
force_expr_to_var_cost, some more computations around type
promotion/demotion would increase the cost estimated.

Looking at the algorithm in force_expr_to_var_cost, it is just to analyze
the operator in the expression and give estimation. Should it be the same as
expanding EXPR to RTX and give estimation like in computation_cost?

Any thoughts?

Thanks,
-Jiangning





Why can't copy renaming capture this assignment?

2012-03-30 Thread Jiangning Liu
Hi,

For this small case, 

char garr[100];
void f(void)
{
unsigned short h, s;

s = 20;

for (h = 1; h  (s-1); h++)
{
garr[h] = 0;
}
}

After copyrename3, we have the following dump,

f ()
{
  short unsigned int h;
  int D.4066;

bb 2:
  D.4066_14 = 1;
  if (D.4066_14 = 18)
goto bb 3;
  else
goto bb 4;

bb 3:
  # h_15 = PHI h_8(3), 1(2)
  # D.4066_16 = PHI D.4066_4(3), D.4066_14(2)
  garr[D.4066_16] = 0;
  h_8 = h_15 + 1;
  D.4066_4 = (int) h_8;
  if (D.4066_4 = 18)
goto bb 3;
  else
goto bb 4;

bb 4:
  return;

}

copy renaming fails to capture the assignment statement D.4066_4 = (int)
h_8; to trigger renaming partition coalesce. 

I find gimple_assign_single_p invoked by gimple_assign_ssa_name_copy_p
always returns false, because for this statement  gs-gsbase.subcode is
NOP_EXPR rather than GIMPLE_SINGLE_RHS.

Should subcode be correctly initialized anywhere to fix this problem?

BTW, my expectation after copy renaming is like below, 

f ()
{
  int D.4679;

bb 2:
  D.4679_7 = 1;
  if (D.4679_7 != 19)
goto bb 3;
  else
goto bb 4;

bb 3:
  # D.4679_15 = PHI D.4679_4(3), D.4679_7(2)
  # D.4679_17 = PHI D.4679_14(3), 1(2)
  garr[D.4679_15] = 0;
  D.4679_14 = D.4679_17 + 1;
  D.4679_4 = D.4679_14;
  if (D.4679_4 != 19)
goto bb 3;
  else
goto bb 4;

bb 4:
  return;

}

and then PRE can finally remove that redundancy for symbol D. away.

Thanks,
-Jiangning







[PATCH][Testsuite] XFAIL scev-3/4.c and add scev-5.c

2012-03-21 Thread Jiangning Liu
Hi,

This patch is to XFAIL scev-3.c and scev-5.c. 

The bug is going to be fixed after Richard Guenther fix a serials of
problems related to POINTER_PLUS_EXPR and sizetype precision.

Thanks,
-Jiangning 

ChangeLog for testsuite:

2012-03-21  Jiangning Liu  jiangning@arm.com

PR tree-optimization/52563
* gcc.dg/tree-ssa/scev-3.c: XFAIL on lp64.
* gcc.dg/tree-ssa/scev-4.c: XFAIL on lp64.
* gcc.dg/tree-ssa/scev-5.c: New.

Thanks,
-Jiangning

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c
b/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c
index 28d5c93..ed63a18 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c
@@ -14,5 +14,5 @@ f(int k)
 }
 }
 
-/* { dg-final { scan-tree-dump-times a 1 optimized } } */
+/* { dg-final { scan-tree-dump-times a 1 optimized { xfail lp64 } 
+} } */
 /* { dg-final { cleanup-tree-dump optimized } } */ diff --git
a/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c
b/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c
index 6c1e530..a538c32 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c
@@ -19,5 +19,5 @@ f(int k)
 }
 }
 
-/* { dg-final { scan-tree-dump-times a 1 optimized } } */
+/* { dg-final { scan-tree-dump-times a 1 optimized { xfail lp64 } 
+} } */
 /* { dg-final { cleanup-tree-dump optimized } } */ diff --git
a/gcc/testsuite/gcc.dg/tree-ssa/scev-5.c
b/gcc/testsuite/gcc.dg/tree-ssa/scev-5.c
new file mode 100644
index 000..b9de36a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/scev-5.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options -O2 -fdump-tree-optimized } */
+
+int *a_p;
+int a[1000];
+
+f(int k)
+{
+long long i;
+
+for (i=k; i1000; i+=k) {
+a_p = a[i];
+*a_p = 100;
+}
+}
+
+/* { dg-final { scan-tree-dump-times a 1 optimized } } */
+/* { dg-final { cleanup-tree-dump optimized } } */





RE: [PATCH] Improve SCEV for array element

2012-03-07 Thread Jiangning Liu


 -Original Message-
 From: Richard Guenther [mailto:richard.guent...@gmail.com]
 Sent: Tuesday, March 06, 2012 9:12 PM
 To: Jiangning Liu
 Cc: gcc-patches@gcc.gnu.org
 Subject: Re: [PATCH] Improve SCEV for array element
 
 On Fri, Jan 20, 2012 at 10:06 AM, Jiangning Liu jiangning@arm.com
 wrote:
  It's definitely not ok at this stage but at most for next stage1.
 
  OK. I may wait until next stage1.
 
  This is a very narrow pattern-match.  It doesn't allow for a[i].x
 for
  example,
  even if a[i] is a one-element structure.  I think the canonical way
 of
  handling
  ADDR_EXPR is to use sth like
 
  base = get_inner_reference (TREE_OPERAND (rhs1, 0), ...,
 offset,  ...);
  base = build1 (ADDR_EXPR, TREE_TYPE (rhs1), base);
          chrec1 = analyze_scalar_evolution (loop, base);
          chrec2 = analyze_scalar_evolution (loop, offset);
          chrec1 = chrec_convert (type, chrec1, at_stmt);
          chrec2 = chrec_convert (TREE_TYPE (offset), chrec2, at_stmt);
          res = chrec_fold_plus (type, chrec1, chrec2);
 
  where you probably need to handle scev_not_known when analyzing
 offset
  (which might be NULL).  You also need to add bitpos to the base
 address
  (in bytes, of course).  Note that the MEM_REF case would naturally
  work
  with this as well.
 
  OK. New patch is like below, and bootstrapped on x86-32.
 
 You want instead of
 
 +  if (TREE_CODE (TREE_OPERAND (rhs1, 0)) == ARRAY_REF
 +  || TREE_CODE (TREE_OPERAND (rhs1, 0)) == MEM_REF
 +  || TREE_CODE (TREE_OPERAND (rhs1, 0)) == COMPONENT_REF)
 +{
 
 if (TREE_CODE (TREE_OPERAND (rhs1, 0)) == MEM_REF
 || handled_component_p (TREE_OPERAND (rhs1, 0)))
   {
 
 + base = build1 (ADDR_EXPR, TREE_TYPE (rhs1), base);
 + chrec1 = analyze_scalar_evolution (loop, base);
 
 can you please add a wrapper
 
 tree
 analyze_scalar_evolution_for_address_of (struct loop *loop, tree var)
 {
   return analyze_scalar_evolution (loop, build_fold_addr_expr (var));
 }
 
 and call that instead of building the ADDR_EXPR there?  We want
 to avoid building that tree node, but even such a simple wrapper would
 be prefered.
 
 + if (bitpos)
 
 if (bitpos != 0)
 
 + chrec3 = build_int_cst (integer_type_node,
 + bitpos / BITS_PER_UNIT);
 
 please use size_int (bitpos / BITS_PER_UNIT) instead.  Using
 integer_type_node is definitely wrong.
 
 Ok with that changes.
 

Richard,

Modified as all you suggested, and new code is like below. Bootstrapped on
x86-32. OK for trunk now?

Thanks,
-Jiangning

ChangeLog:

2012-03-08  Jiangning Liu  jiangning@arm.com

* tree-scalar-evolution (interpret_rhs_expr): generate chrec for
array reference and component reference.
(analyze_scalar_evolution_for_address_of): New.

ChangeLog for testsuite:

2012-03-08  Jiangning Liu  jiangning@arm.com

* gcc.dg/tree-ssa/scev-3.c: New.
* gcc.dg/tree-ssa/scev-4.c: New.

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c
b/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c
new file mode 100644
index 000..28d5c93
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options -O2 -fdump-tree-optimized } */
+
+int *a_p;
+int a[1000];
+
+f(int k)
+{
+   int i;
+
+   for (i=k; i1000; i+=k) {
+   a_p = a[i];
+   *a_p = 100;
+}
+}
+
+/* { dg-final { scan-tree-dump-times a 1 optimized } } */
+/* { dg-final { cleanup-tree-dump optimized } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c
b/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c
new file mode 100644
index 000..6c1e530
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options -O2 -fdump-tree-optimized } */
+
+typedef struct {
+   int x;
+   int y;
+} S;
+
+int *a_p;
+S a[1000];
+
+f(int k)
+{
+   int i;
+
+   for (i=k; i1000; i+=k) {
+   a_p = a[i].y;
+   *a_p = 100;
+}
+}
+
+/* { dg-final { scan-tree-dump-times a 1 optimized } } */
+/* { dg-final { cleanup-tree-dump optimized } } */
diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c
index 2077c8d..c719984
--- a/gcc/tree-scalar-evolution.c
+++ b/gcc/tree-scalar-evolution.c
@@ -266,6 +266,8 @@ along with GCC; see the file COPYING3.  If not see
 #include params.h
 
 static tree analyze_scalar_evolution_1 (struct loop *, tree, tree);
+static tree analyze_scalar_evolution_for_address_of (struct loop *loop,
+tree var);
 
 /* The cached information about an SSA name VAR, claiming that below
basic block INSTANTIATED_BELOW, the value of VAR can be expressed
@@ -1712,16 +1714,59 @@ interpret_rhs_expr (struct loop *loop, gimple
at_stmt,
   switch (code)
 {
 case ADDR_EXPR:
-  /* Handle MEM[ptr + CST] which is equivalent to POINTER_PLUS_EXPR

RE: A question about redundant PHI expression stmt

2012-02-28 Thread Jiangning Liu


 -Original Message-
 From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
 Jiangning Liu
 Sent: Tuesday, February 28, 2012 11:19 AM
 To: 'William J. Schmidt'
 Cc: gcc@gcc.gnu.org; wschm...@gcc.gnu.org
 Subject: RE: A question about redundant PHI expression stmt
 
 
 
  -Original Message-
  From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf
 Of
  William J. Schmidt
  Sent: Monday, February 27, 2012 11:32 PM
  To: Jiangning Liu
  Cc: gcc@gcc.gnu.org; wschm...@gcc.gnu.org
  Subject: Re: A question about redundant PHI expression stmt
 
  On Fri, 2012-02-24 at 16:07 +0800, Jiangning Liu wrote:
   Hi,
  
   For the small case below, there are some redundant PHI expression
  stmt
   generated, and finally cause back-end generates redundant copy
  instructions
   due to some reasons around IRA.
  
   int *l, *r, *g;
   void test_func(int n)
   {
 int i;
 static int j;
 static int pos, direction, direction_pre;
  
 pos = 0;
 direction = 1;
  
 for ( i = 0; i  n; i++ )
 {
 direction_pre = direction;
  
 for ( j = 0; j = 400; j++ )
 {
  
 if ( g[pos] == 200 )
 break;
  
 if ( direction == 0 )
 pos = l[pos];
 else
 pos = r[pos];
  
 if ( pos == -1 )
 {
 if ( direction_pre != direction )
 break;
  
 pos = 0;
 direction = !direction;
 }
  
 }
  
 f(g[pos]);
 }
   }
  
   I know PR39976 has something to do with this case, and check-in
  r182140
   caused big degradation on the real benchmark, but I'm still
 confusing
  around
   this issue.
  
   The procedure is like this,
  
   Loop store motion generates code below,
  
   bb 6:
 D.4084_17 = l.4_13 + D.4077_70;
 pos.5_18 = *D.4084_17;
 pos_lsm.34_103 = pos.5_18;
 goto bb 8;
  
   bb 7:
 D.4088_23 = r.6_19 + D.4077_70;
 pos.7_24 = *D.4088_23;
 pos_lsm.34_104 = pos.7_24;
  
   bb 8:
 # prephitmp.29_89 = PHI pos.5_18(6), pos.7_24(7)
 # pos_lsm.34_53 = PHI pos_lsm.34_103(6), pos_lsm.34_104(7)
 pos.2_25 = prephitmp.29_89;
 if (pos.2_25 == -1)
   goto bb 9;
 else
   goto bb 20;
  
   And then, copy propagation transforms block 8 it into
  
   bb 8:
 # prephitmp.29_89 = PHI pos.5_18(11), pos.7_24(12)
 # pos_lsm.34_53 = PHI pos.5_18(11), pos.7_24(12)
 ...
  
   And then, these two duplicated PHI stmts have been kept until
 expand
  pass.
  
   In dom2, a stmt like below
  
 # pos_lsm.34_54 = PHI pos_lsm.34_53(13), 0(16)
  
   is transformed into
  
 # pos_lsm.34_54 = PHI prephitmp.29_89(13), 0(16)
  
   just because they are equal.
  
   Unfortunately, this transformation changed back-end behavior to
  generate
   redundant copy instructions and hurt performance. This case is from
 a
  real
   benchmark and hurt performance a lot.
  
   Does the fix in r182140 intend to totally clean up this kind of
  redundancy?
   Where should we get it fixed?
  
 
  Hi, sorry not to have responded sooner -- I just now got some time to
  look at this.
 
  I compiled your code with -O3 for revisions 182139 and 182140 of GCC
  trunk, and found no difference between the code produced by the
 middle
  end for the two versions.  So something else has apparently come
 along
  since then that helped produce the problematic code generation for
 you.
  Either that or I need to use different compile flags; you didn't
  specify
  what you used.
 
  The fix in r182140 does just what you saw in dom2:  identifies
  duplicate
  PHIs in the same block and ensures only one of them is used.  This
  actually avoids inserting extra blocks during expand in certain loop
  cases.  I am not sure why you are getting redundant copies as a
 result,
  but it sounds from your comments like IRA didn't coalesce a register
  copy or something like that.  You may want to bisect revisions on the
  trunk to see where your bad code generation started to occur to get a
  better handle on what happened.
 
  As Richard said, the dom pass is likely to be removed someday,
 whenever
  someone can get around to it.  My redundant-phi band-aid in dom would
  go
  away then as well, but presumably an extra pass of PRE would replace
 it,
  and redundant PHIs would still be removed.
 
 
 Bill,
 
 Thanks a lot for your explanation!
 
 The bug could be exposed with -O2 if you apply the check-in r184435
 made by Richard.
 
 BTW, at present is there anybody working on the NEW pass replacing dom?
 If no, in short term, it would be good to still get it fixed just
 because it is hurting benchmark. If Yes, may I know what that NEW pass
 will be focusing on?
 

With dump enabled, I do see a PHI copy

RE: A question about redundant PHI expression stmt

2012-02-28 Thread Jiangning Liu


 -Original Message-
 From: Jiangning Liu
 Sent: Tuesday, February 28, 2012 4:07 PM
 To: Jiangning Liu; 'William J. Schmidt'
 Cc: gcc@gcc.gnu.org; wschm...@gcc.gnu.org
 Subject: RE: A question about redundant PHI expression stmt
 
 
 
  -Original Message-
  From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf
 Of
  Jiangning Liu
  Sent: Tuesday, February 28, 2012 11:19 AM
  To: 'William J. Schmidt'
  Cc: gcc@gcc.gnu.org; wschm...@gcc.gnu.org
  Subject: RE: A question about redundant PHI expression stmt
 
 
 
   -Original Message-
   From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On
 Behalf
  Of
   William J. Schmidt
   Sent: Monday, February 27, 2012 11:32 PM
   To: Jiangning Liu
   Cc: gcc@gcc.gnu.org; wschm...@gcc.gnu.org
   Subject: Re: A question about redundant PHI expression stmt
  
   On Fri, 2012-02-24 at 16:07 +0800, Jiangning Liu wrote:
Hi,
   
For the small case below, there are some redundant PHI expression
   stmt
generated, and finally cause back-end generates redundant copy
   instructions
due to some reasons around IRA.
   
int *l, *r, *g;
void test_func(int n)
{
int i;
static int j;
static int pos, direction, direction_pre;
   
pos = 0;
direction = 1;
   
for ( i = 0; i  n; i++ )
{
direction_pre = direction;
   
for ( j = 0; j = 400; j++ )
{
   
if ( g[pos] == 200 )
break;
   
if ( direction == 0 )
pos = l[pos];
else
pos = r[pos];
   
if ( pos == -1 )
{
if ( direction_pre != direction )
break;
   
pos = 0;
direction = !direction;
}
   
}
   
f(g[pos]);
}
}
   
I know PR39976 has something to do with this case, and check-in
   r182140
caused big degradation on the real benchmark, but I'm still
  confusing
   around
this issue.
   
The procedure is like this,
   
Loop store motion generates code below,
   
bb 6:
  D.4084_17 = l.4_13 + D.4077_70;
  pos.5_18 = *D.4084_17;
  pos_lsm.34_103 = pos.5_18;
  goto bb 8;
   
bb 7:
  D.4088_23 = r.6_19 + D.4077_70;
  pos.7_24 = *D.4088_23;
  pos_lsm.34_104 = pos.7_24;
   
bb 8:
  # prephitmp.29_89 = PHI pos.5_18(6), pos.7_24(7)
  # pos_lsm.34_53 = PHI pos_lsm.34_103(6), pos_lsm.34_104(7)
  pos.2_25 = prephitmp.29_89;
  if (pos.2_25 == -1)
goto bb 9;
  else
goto bb 20;
   
And then, copy propagation transforms block 8 it into
   
bb 8:
  # prephitmp.29_89 = PHI pos.5_18(11), pos.7_24(12)
  # pos_lsm.34_53 = PHI pos.5_18(11), pos.7_24(12)
  ...
   
And then, these two duplicated PHI stmts have been kept until
  expand
   pass.
   
In dom2, a stmt like below
   
  # pos_lsm.34_54 = PHI pos_lsm.34_53(13), 0(16)
   
is transformed into
   
  # pos_lsm.34_54 = PHI prephitmp.29_89(13), 0(16)
   
just because they are equal.
   
Unfortunately, this transformation changed back-end behavior to
   generate
redundant copy instructions and hurt performance. This case is
 from
  a
   real
benchmark and hurt performance a lot.
   
Does the fix in r182140 intend to totally clean up this kind of
   redundancy?
Where should we get it fixed?
   
  
   Hi, sorry not to have responded sooner -- I just now got some time
 to
   look at this.
  
   I compiled your code with -O3 for revisions 182139 and 182140 of
 GCC
   trunk, and found no difference between the code produced by the
  middle
   end for the two versions.  So something else has apparently come
  along
   since then that helped produce the problematic code generation for
  you.
   Either that or I need to use different compile flags; you didn't
   specify
   what you used.
  
   The fix in r182140 does just what you saw in dom2:  identifies
   duplicate
   PHIs in the same block and ensures only one of them is used.  This
   actually avoids inserting extra blocks during expand in certain
 loop
   cases.  I am not sure why you are getting redundant copies as a
  result,
   but it sounds from your comments like IRA didn't coalesce a
 register
   copy or something like that.  You may want to bisect revisions on
 the
   trunk to see where your bad code generation started to occur to get
 a
   better handle on what happened.
  
   As Richard said, the dom pass is likely to be removed someday,
  whenever
   someone can get around to it.  My redundant-phi band-aid

RE: A question about redundant PHI expression stmt

2012-02-27 Thread Jiangning Liu


 -Original Message-
 From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
 William J. Schmidt
 Sent: Monday, February 27, 2012 11:32 PM
 To: Jiangning Liu
 Cc: gcc@gcc.gnu.org; wschm...@gcc.gnu.org
 Subject: Re: A question about redundant PHI expression stmt
 
 On Fri, 2012-02-24 at 16:07 +0800, Jiangning Liu wrote:
  Hi,
 
  For the small case below, there are some redundant PHI expression
 stmt
  generated, and finally cause back-end generates redundant copy
 instructions
  due to some reasons around IRA.
 
  int *l, *r, *g;
  void test_func(int n)
  {
  int i;
  static int j;
  static int pos, direction, direction_pre;
 
  pos = 0;
  direction = 1;
 
  for ( i = 0; i  n; i++ )
  {
  direction_pre = direction;
 
  for ( j = 0; j = 400; j++ )
  {
 
  if ( g[pos] == 200 )
  break;
 
  if ( direction == 0 )
  pos = l[pos];
  else
  pos = r[pos];
 
  if ( pos == -1 )
  {
  if ( direction_pre != direction )
  break;
 
  pos = 0;
  direction = !direction;
  }
 
  }
 
  f(g[pos]);
  }
  }
 
  I know PR39976 has something to do with this case, and check-in
 r182140
  caused big degradation on the real benchmark, but I'm still confusing
 around
  this issue.
 
  The procedure is like this,
 
  Loop store motion generates code below,
 
  bb 6:
D.4084_17 = l.4_13 + D.4077_70;
pos.5_18 = *D.4084_17;
pos_lsm.34_103 = pos.5_18;
goto bb 8;
 
  bb 7:
D.4088_23 = r.6_19 + D.4077_70;
pos.7_24 = *D.4088_23;
pos_lsm.34_104 = pos.7_24;
 
  bb 8:
# prephitmp.29_89 = PHI pos.5_18(6), pos.7_24(7)
# pos_lsm.34_53 = PHI pos_lsm.34_103(6), pos_lsm.34_104(7)
pos.2_25 = prephitmp.29_89;
if (pos.2_25 == -1)
  goto bb 9;
else
  goto bb 20;
 
  And then, copy propagation transforms block 8 it into
 
  bb 8:
# prephitmp.29_89 = PHI pos.5_18(11), pos.7_24(12)
# pos_lsm.34_53 = PHI pos.5_18(11), pos.7_24(12)
...
 
  And then, these two duplicated PHI stmts have been kept until expand
 pass.
 
  In dom2, a stmt like below
 
# pos_lsm.34_54 = PHI pos_lsm.34_53(13), 0(16)
 
  is transformed into
 
# pos_lsm.34_54 = PHI prephitmp.29_89(13), 0(16)
 
  just because they are equal.
 
  Unfortunately, this transformation changed back-end behavior to
 generate
  redundant copy instructions and hurt performance. This case is from a
 real
  benchmark and hurt performance a lot.
 
  Does the fix in r182140 intend to totally clean up this kind of
 redundancy?
  Where should we get it fixed?
 
 
 Hi, sorry not to have responded sooner -- I just now got some time to
 look at this.
 
 I compiled your code with -O3 for revisions 182139 and 182140 of GCC
 trunk, and found no difference between the code produced by the middle
 end for the two versions.  So something else has apparently come along
 since then that helped produce the problematic code generation for you.
 Either that or I need to use different compile flags; you didn't
 specify
 what you used.
 
 The fix in r182140 does just what you saw in dom2:  identifies
 duplicate
 PHIs in the same block and ensures only one of them is used.  This
 actually avoids inserting extra blocks during expand in certain loop
 cases.  I am not sure why you are getting redundant copies as a result,
 but it sounds from your comments like IRA didn't coalesce a register
 copy or something like that.  You may want to bisect revisions on the
 trunk to see where your bad code generation started to occur to get a
 better handle on what happened.
 
 As Richard said, the dom pass is likely to be removed someday, whenever
 someone can get around to it.  My redundant-phi band-aid in dom would
 go
 away then as well, but presumably an extra pass of PRE would replace it,
 and redundant PHIs would still be removed.
 

Bill,

Thanks a lot for your explanation!

The bug could be exposed with -O2 if you apply the check-in r184435 made by 
Richard.

BTW, at present is there anybody working on the NEW pass replacing dom? If no, 
in short term, it would be good to still get it fixed just because it is 
hurting benchmark. If Yes, may I know what that NEW pass will be focusing on?

Thanks,
-Jiangning

 Thanks,
 Bill
 
  Thanks,
  -Jiangning
 
 
 
 






A question about redundant PHI expression stmt

2012-02-24 Thread Jiangning Liu
Hi,

For the small case below, there are some redundant PHI expression stmt
generated, and finally cause back-end generates redundant copy instructions
due to some reasons around IRA.

int *l, *r, *g;
void test_func(int n)
{
int i;
static int j;
static int pos, direction, direction_pre;

pos = 0;
direction = 1;

for ( i = 0; i  n; i++ )
{
direction_pre = direction;

for ( j = 0; j = 400; j++ )
{

if ( g[pos] == 200 )
break;

if ( direction == 0 )
pos = l[pos];
else
pos = r[pos];

if ( pos == -1 )
{
if ( direction_pre != direction )
break;

pos = 0;
direction = !direction;
}

}

f(g[pos]);
}
}

I know PR39976 has something to do with this case, and check-in r182140
caused big degradation on the real benchmark, but I'm still confusing around
this issue.

The procedure is like this,

Loop store motion generates code below,

bb 6:
  D.4084_17 = l.4_13 + D.4077_70;
  pos.5_18 = *D.4084_17;
  pos_lsm.34_103 = pos.5_18;
  goto bb 8;

bb 7:
  D.4088_23 = r.6_19 + D.4077_70;
  pos.7_24 = *D.4088_23;
  pos_lsm.34_104 = pos.7_24;

bb 8:
  # prephitmp.29_89 = PHI pos.5_18(6), pos.7_24(7)
  # pos_lsm.34_53 = PHI pos_lsm.34_103(6), pos_lsm.34_104(7)
  pos.2_25 = prephitmp.29_89;
  if (pos.2_25 == -1)
goto bb 9;
  else
goto bb 20;

And then, copy propagation transforms block 8 it into 

bb 8:
  # prephitmp.29_89 = PHI pos.5_18(11), pos.7_24(12)
  # pos_lsm.34_53 = PHI pos.5_18(11), pos.7_24(12)
  ...

And then, these two duplicated PHI stmts have been kept until expand pass. 

In dom2, a stmt like below

  # pos_lsm.34_54 = PHI pos_lsm.34_53(13), 0(16)

is transformed into

  # pos_lsm.34_54 = PHI prephitmp.29_89(13), 0(16)

just because they are equal.

Unfortunately, this transformation changed back-end behavior to generate
redundant copy instructions and hurt performance. This case is from a real
benchmark and hurt performance a lot.

Does the fix in r182140 intend to totally clean up this kind of redundancy?
Where should we get it fixed?

Thanks,
-Jiangning





RE: A problem about loop store motion

2012-02-21 Thread Jiangning Liu
 The MEM form is more canonical, so the loop SM machinery to detect
 equality should be adjusted accordingly.  Alternatively you can teach
 PRE insertion to strip off the MEM if possible (though
 fold_stmt_inplace should
 arelady do this if possible).

Richard,

Thank you! You are right. I noticed on latest trunk the problem in PRE was
already fixed by invoking fold_stmt_inplace.

Unfortunately for this small case, the latest trunk code still can't do SM
for variable pos, because refs_may_alias_p(*D.4074_10, pos) is true, that
is, pos has alias with l[pos].

I think alias analysis should be able to know they don't have alias with
each other, unless there is an assignment statement like l=pos;. 

Can alias analysis fix the problem?

Thanks,
-Jiangning

 
 Richard.
 





RE: A problem about loop store motion

2012-02-21 Thread Jiangning Liu


 -Original Message-
 From: Richard Guenther [mailto:richard.guent...@gmail.com]
 Sent: Tuesday, February 21, 2012 5:40 PM
 To: Jiangning Liu
 Cc: gcc@gcc.gnu.org
 Subject: Re: A problem about loop store motion
 
 On Tue, Feb 21, 2012 at 9:54 AM, Jiangning Liu jiangning@arm.com
 wrote:
  The MEM form is more canonical, so the loop SM machinery to detect
  equality should be adjusted accordingly.  Alternatively you can
 teach
  PRE insertion to strip off the MEM if possible (though
  fold_stmt_inplace should
  arelady do this if possible).
 
  Richard,
 
  Thank you! You are right. I noticed on latest trunk the problem in
 PRE was
  already fixed by invoking fold_stmt_inplace.
 
  Unfortunately for this small case, the latest trunk code still can't
 do SM
  for variable pos, because refs_may_alias_p(*D.4074_10, pos) is true,
 that
  is, pos has alias with l[pos].
 
  I think alias analysis should be able to know they don't have alias
 with
  each other, unless there is an assignment statement like l=pos;.
 
  Can alias analysis fix the problem?
 
 The problem is that 'pos' is marked TREE_ADDRESSABLE, so we think
 its address got taken.  'l' points to any global possibly address-taken
 variable.  It get's the TREE_ADDRESSABLE flag via PRE insertion which
 re-gimplifies the tree it creates which marks the variable as
 addressable.
 
 I'll have a look.

Terrific! :-) Could you please let me know when you have a patch to fix
this, because I want to double check the big case, and there might be other
hidden problems?

Thanks,
-Jiangning

 
 Richard.
 
 
 
  Thanks,
  -Jiangning
 
 
  Richard.
 
 
 
 






RE: A problem about loop store motion

2012-02-21 Thread Jiangning Liu


 -Original Message-
 From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
 Richard Guenther
 Sent: Tuesday, February 21, 2012 6:19 PM
 To: Jiangning Liu
 Cc: gcc@gcc.gnu.org
 Subject: Re: A problem about loop store motion
 
 On Tue, Feb 21, 2012 at 10:57 AM, Jiangning Liu jiangning@arm.com
 wrote:
 
 
  -Original Message-
  From: Richard Guenther [mailto:richard.guent...@gmail.com]
  Sent: Tuesday, February 21, 2012 5:40 PM
  To: Jiangning Liu
  Cc: gcc@gcc.gnu.org
  Subject: Re: A problem about loop store motion
 
  On Tue, Feb 21, 2012 at 9:54 AM, Jiangning Liu
 jiangning@arm.com
  wrote:
   The MEM form is more canonical, so the loop SM machinery to
 detect
   equality should be adjusted accordingly.  Alternatively you can
  teach
   PRE insertion to strip off the MEM if possible (though
   fold_stmt_inplace should
   arelady do this if possible).
  
   Richard,
  
   Thank you! You are right. I noticed on latest trunk the problem in
  PRE was
   already fixed by invoking fold_stmt_inplace.
  
   Unfortunately for this small case, the latest trunk code still
 can't
  do SM
   for variable pos, because refs_may_alias_p(*D.4074_10, pos) is
 true,
  that
   is, pos has alias with l[pos].
  
   I think alias analysis should be able to know they don't have
 alias
  with
   each other, unless there is an assignment statement like l=pos;.
  
   Can alias analysis fix the problem?
 
  The problem is that 'pos' is marked TREE_ADDRESSABLE, so we think
  its address got taken.  'l' points to any global possibly address-
 taken
  variable.  It get's the TREE_ADDRESSABLE flag via PRE insertion
 which
  re-gimplifies the tree it creates which marks the variable as
  addressable.
 
  I'll have a look.
 
  Terrific! :-) Could you please let me know when you have a patch to
 fix
  this, because I want to double check the big case, and there might be
 other
  hidden problems?
 
 PR52324, I am testing the following patch (in reality the gimplifier
 should not
 mark things addressable unless it itself makes them so, but frontends
 are
 broken, so we do that ... ugh).
 

Richard,

Now trunk works for the big case as well. Thanks a lot for your quick fix.

BTW, why can't we fix front-end directly?

Thanks,
-Jiangning

 Index: gcc/gimplify.c
 ===
 --- gcc/gimplify.c  (revision 184428)
 +++ gcc/gimplify.c  (working copy)
 @@ -7061,15 +7061,20 @@ gimplify_expr (tree *expr_p, gimple_seq
   ret = GS_OK;
   break;
 }
 - ret = gimplify_expr (TREE_OPERAND (*expr_p, 0), pre_p,
 post_p,
 -  is_gimple_mem_ref_addr, fb_rvalue);
 - if (ret == GS_ERROR)
 -   break;
 + /* Avoid re-gimplifying the address operand if it is already
 +in suitable form.  */
 + if (!is_gimple_mem_ref_addr (TREE_OPERAND (*expr_p, 0)))
 +   {
 + ret = gimplify_expr (TREE_OPERAND (*expr_p, 0), pre_p,
 post_p,
 +  is_gimple_mem_ref_addr, fb_rvalue);
 + if (ret == GS_ERROR)
 +   break;
 +   }
   recalculate_side_effects (*expr_p);
   ret = GS_ALL_DONE;
   break;
 
 - /* Constants need not be gimplified.  */
 +   /* Constants need not be gimplified.  */
 case INTEGER_CST:
 case REAL_CST:
 case FIXED_CST:






RE: A problem about loop store motion

2012-02-20 Thread Jiangning Liu


 -Original Message-
 From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
 Jiangning Liu
 Sent: Friday, February 17, 2012 5:53 PM
 To: gcc@gcc.gnu.org
 Subject: A problem about loop store motion
 
 Hi,
 
 For this small test case,
 
 int *l, *r;
 int test_func(void)
 {
   int i;
   int direction;
   static int pos;
 
   pos = 0;
   direction = 1;
 
   for ( i = 0; i = 400; i++ )
   {
   if ( direction == 0 )
   pos = l[pos];
   else
   pos = r[pos];
 
   if ( pos == -1 )
   {
   pos = 0;
   direction = !direction;
   }
   }
   return i;
 }
 
 In middle end, I don't see pos is sunk out of loop by loop store motion.
 Any
 idea?
 
 The dump after lim is like below, and I expect a SSA symbole xxx_lsm
 could
 be created with this pass.
 
 ;; Function test_func (test_func, funcdef_no=0, decl_uid=4057,
 cgraph_uid=0)
 
 
 Symbols to be put in SSA form
 { .MEM }
 Incremental SSA update started at block: 0
 Number of blocks in CFG: 12
 Number of blocks to update: 11 ( 92%)
 
 
 test_func ()
 {
   int pretmp.14;
   unsigned int pretmp.13;
   int prephitmp.12;
   int pretmp.11;
   unsigned int pretmp.10;
   int pretmp.9;
   int D.4088;
   static int pos;
   int direction;
   int i;
   _Bool D.4082;
   int pos.5;
   int * D.4078;
   int * r.4;
   int pos.3;
   int * D.4074;
   unsigned int D.4073;
   unsigned int pos.2;
   int pos.1;
   int * l.0;
 
 bb 2:
   pos = 0;
   l.0_6 = l;
   r.4_12 = r;
 
 bb 3:
   # i_32 = PHI i_21(11), 0(2)
   # direction_37 = PHI direction_2(11), 1(2)
   # prephitmp.12_35 = PHI pretmp.11_1(11), 0(2)
   if (direction_37 == 0)
 goto bb 4;
   else
 goto bb 5;
 
 bb 4:
   pos.1_7 = prephitmp.12_35;
   pos.2_8 = (unsigned int) pos.1_7;
   D.4073_9 = pos.2_8 * 4;
   D.4074_10 = l.0_6 + D.4073_9;
   pos.3_11 = *D.4074_10;
   pos = pos.3_11;
   goto bb 6;
 
 bb 5:
   pos.1_13 = prephitmp.12_35;
   pos.2_14 = (unsigned int) pos.1_13;
   D.4073_15 = pos.2_14 * 4;
   D.4078_16 = r.4_12 + D.4073_15;
   pos.5_17 = *D.4078_16;
   pos = pos.5_17;
 
 bb 6:
   # prephitmp.12_31 = PHI pos.3_11(4), pos.5_17(5)
   pos.1_18 = prephitmp.12_31;
   if (pos.1_18 == -1)
 goto bb 7;
   else
 goto bb 10;
 
 bb 10:
   goto bb 8;
 
 bb 7:
   pos = 0;
   D.4088_36 = direction_37 ^ 1;
   direction_20 = D.4088_36  1;
 
 bb 8:
   # direction_2 = PHI direction_37(10), direction_20(7)
   i_21 = i_32 + 1;
   if (i_21 != 401)
 goto bb 11;
   else
 goto bb 9;
 
 bb 11:
   pretmp.11_1 = pos;

Hi,

pos in RHS seems to be transformed to MEM[pos] by the code in
tree-ssa-sccvn.c like below,

case VAR_DECL:
  if (DECL_HARD_REGISTER (ref))
{
  temp.op0 = ref;
  break;
}
  /* Fallthru.  */
case PARM_DECL:
case CONST_DECL:
case RESULT_DECL:
  /* Canonicalize decls to MEM[decl] which is what we end up with
 when valueizing MEM[ptr] with ptr = decl.  */
  temp.opcode = MEM_REF;
  temp.op0 = build_int_cst (build_pointer_type (TREE_TYPE (ref)),
0);
  temp.off = 0;
  VEC_safe_push (vn_reference_op_s, heap, *result, temp);
  temp.opcode = ADDR_EXPR;
  temp.op0 = build_fold_addr_expr (ref);
  temp.type = TREE_TYPE (temp.op0);
  temp.off = -1;
  break;

But the LHS doesn't have this kind of transformation on gimple,

So the code below in gather_mem_refs_stmt of tree-ssa-loop-im.c can't find
MEM[pos] is just pos,

  hash = iterative_hash_expr (*mem, 0);
  slot = htab_find_slot_with_hash (memory_accesses.refs, *mem, hash,
INSERT);

Finally pos is set to be dependent on pretmp.11_1 at code below in
ref_indep_loop_p_1, and then loop store motion fails to find pos.

  EXECUTE_IF_SET_IN_BITMAP (refs_to_check, 0, i, bi)
{
  aref = VEC_index (mem_ref_p, memory_accesses.refs_list, i);
  if (!refs_independent_p (ref, aref))
{
  ret = false;
  record_indep_loop (loop, aref, false);
  break;
}
}

Should we fix this problem by enhancing iterative_hash_expr, or we should
use MEM[pos] to describe pos at all?

Any hints? I need to understand why we have to use MEM[pos] to describe
pos? This looks strange to me.

BTW, this bug caused significant regression for an important benchmark.

Thanks,
-Jiangning

   goto bb 3;
 
 bb 9:
   return 401;
 
 }
 
 Thanks,
 -Jiangning
 
 
 






A problem about loop store motion

2012-02-17 Thread Jiangning Liu
Hi,

For this small test case,

int *l, *r;
int test_func(void)
{
int i;
int direction;
static int pos;

pos = 0;
direction = 1;

for ( i = 0; i = 400; i++ )
{
if ( direction == 0 )
pos = l[pos];
else
pos = r[pos];

if ( pos == -1 )
{
pos = 0;
direction = !direction;
}
}
return i;
}

In middle end, I don't see pos is sunk out of loop by loop store motion. Any
idea?

The dump after lim is like below, and I expect a SSA symbole xxx_lsm could
be created with this pass.

;; Function test_func (test_func, funcdef_no=0, decl_uid=4057, cgraph_uid=0)


Symbols to be put in SSA form
{ .MEM }
Incremental SSA update started at block: 0
Number of blocks in CFG: 12
Number of blocks to update: 11 ( 92%)


test_func ()
{
  int pretmp.14;
  unsigned int pretmp.13;
  int prephitmp.12;
  int pretmp.11;
  unsigned int pretmp.10;
  int pretmp.9;
  int D.4088;
  static int pos;
  int direction;
  int i;
  _Bool D.4082;
  int pos.5;
  int * D.4078;
  int * r.4;
  int pos.3;
  int * D.4074;
  unsigned int D.4073;
  unsigned int pos.2;
  int pos.1;
  int * l.0;

bb 2:
  pos = 0;
  l.0_6 = l;
  r.4_12 = r;

bb 3:
  # i_32 = PHI i_21(11), 0(2)
  # direction_37 = PHI direction_2(11), 1(2)
  # prephitmp.12_35 = PHI pretmp.11_1(11), 0(2)
  if (direction_37 == 0)
goto bb 4;
  else
goto bb 5;

bb 4:
  pos.1_7 = prephitmp.12_35;
  pos.2_8 = (unsigned int) pos.1_7;
  D.4073_9 = pos.2_8 * 4;
  D.4074_10 = l.0_6 + D.4073_9;
  pos.3_11 = *D.4074_10;
  pos = pos.3_11;
  goto bb 6;

bb 5:
  pos.1_13 = prephitmp.12_35;
  pos.2_14 = (unsigned int) pos.1_13;
  D.4073_15 = pos.2_14 * 4;
  D.4078_16 = r.4_12 + D.4073_15;
  pos.5_17 = *D.4078_16;
  pos = pos.5_17;

bb 6:
  # prephitmp.12_31 = PHI pos.3_11(4), pos.5_17(5)
  pos.1_18 = prephitmp.12_31;
  if (pos.1_18 == -1)
goto bb 7;
  else
goto bb 10;

bb 10:
  goto bb 8;

bb 7:
  pos = 0;
  D.4088_36 = direction_37 ^ 1;
  direction_20 = D.4088_36  1;

bb 8:
  # direction_2 = PHI direction_37(10), direction_20(7)
  i_21 = i_32 + 1;
  if (i_21 != 401)
goto bb 11;
  else
goto bb 9;

bb 11:
  pretmp.11_1 = pos;
  goto bb 3;

bb 9:
  return 401;

}

Thanks,
-Jiangning





RE: [PATCH] Improve SCEV for array element

2012-02-13 Thread Jiangning Liu


 -Original Message-
 From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
 ow...@gcc.gnu.org] On Behalf Of Jiangning Liu
 Sent: Friday, January 20, 2012 5:07 PM
 To: 'Richard Guenther'
 Cc: gcc-patches@gcc.gnu.org
 Subject: RE: [PATCH] Improve SCEV for array element
 
  It's definitely not ok at this stage but at most for next stage1.
 
 OK. I may wait until next stage1.
 
  This is a very narrow pattern-match.  It doesn't allow for a[i].x
 for
  example, even if a[i] is a one-element structure.  I think the
  canonical way of handling ADDR_EXPR is to use sth like
 
  base = get_inner_reference (TREE_OPERAND (rhs1, 0), ..., offset,
  ...); base = build1 (ADDR_EXPR, TREE_TYPE (rhs1), base);
  chrec1 = analyze_scalar_evolution (loop, base);
  chrec2 = analyze_scalar_evolution (loop, offset);
  chrec1 = chrec_convert (type, chrec1, at_stmt);
  chrec2 = chrec_convert (TREE_TYPE (offset), chrec2, at_stmt);
  res = chrec_fold_plus (type, chrec1, chrec2);
 
  where you probably need to handle scev_not_known when analyzing
 offset
  (which might be NULL).  You also need to add bitpos to the base
  address (in bytes, of course).  Note that the MEM_REF case would
  naturally work with this as well.
 
 OK. New patch is like below, and bootstrapped on x86-32.
 
 ChangeLog:
 
 2012-01-20  Jiangning Liu  jiangning@arm.com
 
 * tree-scalar-evolution (interpret_rhs_expr): generate chrec
 for
 array reference and component reference.
 
 
 ChangeLog for testsuite:
 
 2012-01-20  Jiangning Liu  jiangning@arm.com
 
 * gcc.dg/tree-ssa/scev-3.c: New.
 * gcc.dg/tree-ssa/scev-4.c: New.
 

Richard,

PING... Is this patch OK after branch 4.7 is created and trunk is open
again?

Thanks,
-Jiangning

 diff --git a/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c
 b/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c
 new file mode 100644
 index 000..28d5c93
 --- /dev/null
 +++ b/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c
 @@ -0,0 +1,18 @@
 +/* { dg-do compile } */
 +/* { dg-options -O2 -fdump-tree-optimized } */
 +
 +int *a_p;
 +int a[1000];
 +
 +f(int k)
 +{
 + int i;
 +
 + for (i=k; i1000; i+=k) {
 + a_p = a[i];
 + *a_p = 100;
 +}
 +}
 +
 +/* { dg-final { scan-tree-dump-times a 1 optimized } } */
 +/* { dg-final { cleanup-tree-dump optimized } } */
 diff --git a/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c
 b/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c
 new file mode 100644
 index 000..6c1e530
 --- /dev/null
 +++ b/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c
 @@ -0,0 +1,23 @@
 +/* { dg-do compile } */
 +/* { dg-options -O2 -fdump-tree-optimized } */
 +
 +typedef struct {
 + int x;
 + int y;
 +} S;
 +
 +int *a_p;
 +S a[1000];
 +
 +f(int k)
 +{
 + int i;
 +
 + for (i=k; i1000; i+=k) {
 + a_p = a[i].y;
 + *a_p = 100;
 +}
 +}
 +
 +/* { dg-final { scan-tree-dump-times a 1 optimized } } */
 +/* { dg-final { cleanup-tree-dump optimized } } */
 diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c
 index 2077c8d..4e06b75
 --- a/gcc/tree-scalar-evolution.c
 +++ b/gcc/tree-scalar-evolution.c
 @@ -1712,16 +1712,61 @@ interpret_rhs_expr (struct loop *loop, gimple
 at_stmt,
switch (code)
  {
  case ADDR_EXPR:
 -  /* Handle MEM[ptr + CST] which is equivalent to
 POINTER_PLUS_EXPR.
 */
 -  if (TREE_CODE (TREE_OPERAND (rhs1, 0)) != MEM_REF)
 - {
 -   res = chrec_dont_know;
 -   break;
 - }
 +  if (TREE_CODE (TREE_OPERAND (rhs1, 0)) == ARRAY_REF
 +  || TREE_CODE (TREE_OPERAND (rhs1, 0)) == MEM_REF
 +  || TREE_CODE (TREE_OPERAND (rhs1, 0)) == COMPONENT_REF)
 +{
 +   enum machine_mode mode;
 +   HOST_WIDE_INT bitsize, bitpos;
 +   int unsignedp;
 +   int volatilep = 0;
 +   tree base, offset;
 +   tree chrec3;
 +
 +   base = get_inner_reference (TREE_OPERAND (rhs1, 0),
 +   bitsize, bitpos, offset,
 +   mode, unsignedp, volatilep, false);
 +
 +   if (TREE_CODE (base) == MEM_REF)
 + {
 +   rhs2 = TREE_OPERAND (base, 1);
 +   rhs1 = TREE_OPERAND (base, 0);
 +
 +   chrec1 = analyze_scalar_evolution (loop, rhs1);
 +   chrec2 = analyze_scalar_evolution (loop, rhs2);
 +   chrec1 = chrec_convert (type, chrec1, at_stmt);
 +   chrec2 = chrec_convert (TREE_TYPE (rhs2), chrec2, at_stmt);
 +   res = chrec_fold_plus (type, chrec1, chrec2);
 + }
 +   else
 + {
 +   base = build1 (ADDR_EXPR, TREE_TYPE (rhs1), base);
 +   chrec1 = analyze_scalar_evolution (loop, base);
 +   chrec1 = chrec_convert (type, chrec1, at_stmt);
 +   res = chrec1;
 + }
 
 -  rhs2 = TREE_OPERAND (TREE_OPERAND (rhs1, 0), 1);
 -  rhs1 = TREE_OPERAND (TREE_OPERAND (rhs1, 0), 0);
 -  /* Fall through.  */
 +   if (offset != NULL_TREE

RE: [PATCH] Improve SCEV for array element

2012-01-20 Thread Jiangning Liu
 It's definitely not ok at this stage but at most for next stage1.

OK. I may wait until next stage1.

 This is a very narrow pattern-match.  It doesn't allow for a[i].x for
 example,
 even if a[i] is a one-element structure.  I think the canonical way of
 handling
 ADDR_EXPR is to use sth like
 
 base = get_inner_reference (TREE_OPERAND (rhs1, 0), ..., offset,  ...);
 base = build1 (ADDR_EXPR, TREE_TYPE (rhs1), base);
 chrec1 = analyze_scalar_evolution (loop, base);
 chrec2 = analyze_scalar_evolution (loop, offset);
 chrec1 = chrec_convert (type, chrec1, at_stmt);
 chrec2 = chrec_convert (TREE_TYPE (offset), chrec2, at_stmt);
 res = chrec_fold_plus (type, chrec1, chrec2);
 
 where you probably need to handle scev_not_known when analyzing offset
 (which might be NULL).  You also need to add bitpos to the base address
 (in bytes, of course).  Note that the MEM_REF case would naturally
 work
 with this as well.

OK. New patch is like below, and bootstrapped on x86-32.

ChangeLog:

2012-01-20  Jiangning Liu  jiangning@arm.com

* tree-scalar-evolution (interpret_rhs_expr): generate chrec for
array reference and component reference.


ChangeLog for testsuite:

2012-01-20  Jiangning Liu  jiangning@arm.com

* gcc.dg/tree-ssa/scev-3.c: New.
* gcc.dg/tree-ssa/scev-4.c: New.

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c
b/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c
new file mode 100644
index 000..28d5c93
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options -O2 -fdump-tree-optimized } */
+
+int *a_p;
+int a[1000];
+
+f(int k)
+{
+   int i;
+
+   for (i=k; i1000; i+=k) {
+   a_p = a[i];
+   *a_p = 100;
+}
+}
+
+/* { dg-final { scan-tree-dump-times a 1 optimized } } */
+/* { dg-final { cleanup-tree-dump optimized } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c
b/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c
new file mode 100644
index 000..6c1e530
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options -O2 -fdump-tree-optimized } */
+
+typedef struct {
+   int x;
+   int y;
+} S;
+
+int *a_p;
+S a[1000];
+
+f(int k)
+{
+   int i;
+
+   for (i=k; i1000; i+=k) {
+   a_p = a[i].y;
+   *a_p = 100;
+}
+}
+
+/* { dg-final { scan-tree-dump-times a 1 optimized } } */
+/* { dg-final { cleanup-tree-dump optimized } } */
diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c
index 2077c8d..4e06b75
--- a/gcc/tree-scalar-evolution.c
+++ b/gcc/tree-scalar-evolution.c
@@ -1712,16 +1712,61 @@ interpret_rhs_expr (struct loop *loop, gimple
at_stmt,
   switch (code)
 {
 case ADDR_EXPR:
-  /* Handle MEM[ptr + CST] which is equivalent to POINTER_PLUS_EXPR.
*/
-  if (TREE_CODE (TREE_OPERAND (rhs1, 0)) != MEM_REF)
-   {
- res = chrec_dont_know;
- break;
-   }
+  if (TREE_CODE (TREE_OPERAND (rhs1, 0)) == ARRAY_REF
+  || TREE_CODE (TREE_OPERAND (rhs1, 0)) == MEM_REF
+  || TREE_CODE (TREE_OPERAND (rhs1, 0)) == COMPONENT_REF)
+{
+ enum machine_mode mode;
+ HOST_WIDE_INT bitsize, bitpos;
+ int unsignedp;
+ int volatilep = 0;
+ tree base, offset;
+ tree chrec3;
+
+ base = get_inner_reference (TREE_OPERAND (rhs1, 0),
+ bitsize, bitpos, offset,
+ mode, unsignedp, volatilep, false);
+
+ if (TREE_CODE (base) == MEM_REF)
+   {
+ rhs2 = TREE_OPERAND (base, 1);
+ rhs1 = TREE_OPERAND (base, 0);
+
+ chrec1 = analyze_scalar_evolution (loop, rhs1);
+ chrec2 = analyze_scalar_evolution (loop, rhs2);
+ chrec1 = chrec_convert (type, chrec1, at_stmt);
+ chrec2 = chrec_convert (TREE_TYPE (rhs2), chrec2, at_stmt);
+ res = chrec_fold_plus (type, chrec1, chrec2);
+   }
+ else
+   {
+ base = build1 (ADDR_EXPR, TREE_TYPE (rhs1), base);
+ chrec1 = analyze_scalar_evolution (loop, base);
+ chrec1 = chrec_convert (type, chrec1, at_stmt);
+ res = chrec1;
+   }
 
-  rhs2 = TREE_OPERAND (TREE_OPERAND (rhs1, 0), 1);
-  rhs1 = TREE_OPERAND (TREE_OPERAND (rhs1, 0), 0);
-  /* Fall through.  */
+ if (offset != NULL_TREE)
+   {
+ chrec2 = analyze_scalar_evolution (loop, offset);
+ chrec2 = chrec_convert (TREE_TYPE (offset), chrec2, at_stmt);
+ res = chrec_fold_plus (type, res, chrec2);
+   }
+
+ if (bitpos)
+   {
+ gcc_assert ((bitpos % BITS_PER_UNIT) == 0);
+
+ chrec3 = build_int_cst (integer_type_node,
+ bitpos / BITS_PER_UNIT

[PATCH] Improve SCEV for array element

2012-01-17 Thread Jiangning Liu
This code change intends to improve scev for array element and reduce the
common sub-expressions in loop, which may be introduced by multiple
reference of expression like a[i]. With this optimization the register
pressure can be reduced in loops. 

The problem is originally from a real benchmark, and the test case only
tries to detect the GIMPLE level changes.

Bootstraped on x86-32. OK for trunk?

ChangeLog:

2012-01-05  Jiangning Liu  jiangning@arm.com

* tree-scalar-evolution (interpret_rhs_expr): generate chrec for
array reference.

ChangeLog for testsuite:

2012-01-05  Jiangning Liu  jiangning@arm.com

* gcc.dg/scev-1.c: New.

diff --git a/gcc/testsuite/gcc.dg/scev-1.c b/gcc/testsuite/gcc.dg/scev-1.c
new file mode 100644 index 000..28d5c93
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/scev-1.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options -O2 -fdump-tree-optimized } */
+
+int *a_p;
+int a[1000];
+
+f(int k)
+{
+   int i;
+
+   for (i=k; i1000; i+=k) {
+   a_p = a[i];
+   *a_p = 100;
+}
+}
+
+/* { dg-final { scan-tree-dump-times a 1 optimized } } */
+/* { dg-final { cleanup-tree-dump optimized } } */
diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c index
2077c8d..de89fc4
--- a/gcc/tree-scalar-evolution.c
+++ b/gcc/tree-scalar-evolution.c
@@ -1712,6 +1712,42 @@ interpret_rhs_expr (struct loop *loop, gimple
at_stmt,
   switch (code)
 {
 case ADDR_EXPR:
+  if (TREE_CODE (TREE_OPERAND (rhs1, 0)) == ARRAY_REF)
+{
+ tree array_ref;
+ tree var_decl, base, offset;
+ tree low_bound;
+ tree unit_size;
+ tree index;
+
+ array_ref = TREE_OPERAND (rhs1, 0);
+ var_decl = TREE_OPERAND (array_ref, 0);
+ index = TREE_OPERAND (array_ref, 1);
+
+ low_bound = array_ref_low_bound (array_ref);
+ unit_size = array_ref_element_size (array_ref);
+
+ /* We assume all arrays have sizes that are a multiple of a byte.
+First subtract the lower bound, if any, in the type of the
+index, then convert to sizetype and multiply by the size of
+the array element.  */
+ if (! integer_zerop (low_bound))
+   index = fold_build2 (MINUS_EXPR, TREE_TYPE (index),
+index, low_bound);
+
+ offset = size_binop (MULT_EXPR,
+  fold_convert (sizetype, index),
+  unit_size);
+
+ base = build1 (ADDR_EXPR, TREE_TYPE (rhs1), var_decl);
+ chrec1 = analyze_scalar_evolution (loop, base);
+ chrec2 = analyze_scalar_evolution (loop, offset);
+  chrec1 = chrec_convert (type, chrec1, at_stmt);
+  chrec2 = chrec_convert (TREE_TYPE (offset), chrec2, at_stmt);
+ res = chrec_fold_plus (type, chrec1, chrec2);
+ break;
+}
+
   /* Handle MEM[ptr + CST] which is equivalent to POINTER_PLUS_EXPR.
*/
   if (TREE_CODE (TREE_OPERAND (rhs1, 0)) != MEM_REF)
{

Thanks,
-Jiangning

scev.patch
Description: Binary data


RE: [RFC] Use REG_EXPR in back-end (introduced by optimization to conditional and/or in ARM back-end)

2012-01-11 Thread Jiangning Liu


 -Original Message-
 From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
 ow...@gcc.gnu.org] On Behalf Of Jiangning Liu
 Sent: Wednesday, December 21, 2011 2:48 PM
 To: 'Richard Henderson'
 Cc: gcc-patches@gcc.gnu.org; 'Richard Guenther'
 Subject: RE: [RFC] Use REG_EXPR in back-end (introduced by optimization
 to conditional and/or in ARM back-end)
 
 
  -Original Message-
  From: Richard Henderson [mailto:r...@redhat.com]
  Sent: Tuesday, November 22, 2011 9:55 AM
  To: Jiangning Liu
  Cc: gcc-patches@gcc.gnu.org; 'Richard Guenther'
  Subject: Re: [RFC] Use REG_EXPR in back-end (introduced by
 optimization
  to conditional and/or in ARM back-end)
 
  On 11/21/2011 05:31 PM, Jiangning Liu wrote:
   My question is essentially is May I really use REG_EXPR in back-
 end
  code?
   like the patch I gave below?
 
  I suppose.
 
  Another alternative is to use BImode for booleans.  Dunno how much of
  that
  you'd be able to gleen from mere rtl expansion or if you'd need
 boolean
  decls to be expanded with BImode.
 
 Hi Richard,
 
 The output of expand pass requires the operands must meet the
 requirement of
 general_operand for binary operations, i.e. all RTX operations on top
 level
 must meet the hardware instruction requirement. Generally for a
 hardware not
 directly supporting a single bit logic operation, we can't generate BI
 mode
 rtx dest.
 
 So if I simply generate BImode for NE in expand pass, like subreg:SI
 (ne:BI
 (reg:SI xxx) (const_int 0))), there would be an assertion failure due
 to
 failing to find an appropriate instruction code to operate on a single
 bit.
 
 This looks like a GCC design level issue. How would you suggest
 generating a
 legitimate rtx expression containing BImode?
 

Can anybody give me useful suggestions on this issue?

My question essentially is may I really use BImode to solve my original
problem?

Thanks,
-Jiangning

 Thanks,
 -Jiangning
 
 
  The later would probably need a target hook.  I've often wondered how
  much
  ia64 would benefit from that too, being able to store bool variables
  directly
  in predicate registers.
 
  All of this almost certainly must wait until stage1 opens up again
  though...
 
 
  r~
 
 
 
 






RE: A case exposing code sink issue

2012-01-05 Thread Jiangning Liu
   In final value replacement, expression a + D. can be
 figured
  out,
   while a[i_xxx] failed to be CHRECed, so I'm wondering if we
 should
   lower
   a[i_xxx] to a + unitsize(a) * i_xxx first? It seems GCC
 intends
  to
   keep
   a[i_xxx] until cfgexpand pass. 

Richard,

Thanks a lot for your explanation!

Any comments for this question as well? Does GCC intends to keep a[i] until
expand pass? Any special reason? If I change CHREC algorithm, I see ivopt
would have done this lowering, so we wouldn't see a[i] in expand pass. Any
hurt?

Thanks,
-Jiangning






RE: A case exposing code sink issue

2012-01-05 Thread Jiangning Liu


 -Original Message-
 From: Richard Guenther [mailto:richard.guent...@gmail.com]
 Sent: Thursday, January 05, 2012 5:32 PM
 To: Jiangning Liu
 Cc: Michael Matz; gcc@gcc.gnu.org
 Subject: Re: A case exposing code sink issue
 
 On Thu, Jan 5, 2012 at 9:34 AM, Jiangning Liu jiangning@arm.com
 wrote:
In final value replacement, expression a + D. can be
  figured
   out,
while a[i_xxx] failed to be CHRECed, so I'm wondering if we
  should
lower
a[i_xxx] to a + unitsize(a) * i_xxx first? It seems GCC
  intends
   to
keep
a[i_xxx] until cfgexpand pass.
 
  Richard,
 
  Thanks a lot for your explanation!
 
  Any comments for this question as well? Does GCC intends to keep a[i]
 until
  expand pass?
 
 It's kept as it suits - it's shorter and easier to analyze (we know the
 implicit
 multiplication won't overflow)
 
  Any special reason? If I change CHREC algorithm, I see ivopt
  would have done this lowering, so we wouldn't see a[i] in expand
 pass. Any
  hurt?
 
 No, I think changing the CHREC algorithm is ok (I didn't look closely
 at
 your patch).  This is stage1 material though.

Actually I didn't send it out at all. :-) And I just send out a RFC here
http://gcc.gnu.org/ml/gcc-patches/2012-01/msg00236.html, can you have a
look?

 
 Thanks,
 Richard.
 
  Thanks,
  -Jiangning
 
 
 
 






[RFC][patch] improve scev for array element

2012-01-05 Thread Jiangning Liu
This code change intends to improve scev for array element and reduce the
common sub-expressions in loop, which may be introduced by multiple
reference of expression a[i]. In the end the register pressure may be
reduced in the loop. The test case is simplified from a real benchmark, and
only want to show the GIMPLE level changes.

Any suggestions?

diff --git a/gcc/testsuite/gcc.dg/scev-1.c b/gcc/testsuite/gcc.dg/scev-1.c
new file mode 100644
index 000..28d5c93
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/scev-1.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options -O2 -fdump-tree-optimized } */
+
+int *a_p;
+int a[1000];
+
+f(int k)
+{
+   int i;
+
+   for (i=k; i1000; i+=k) {
+   a_p = a[i];
+   *a_p = 100;
+}
+}
+
+/* { dg-final { scan-tree-dump-times a 1 optimized } } */
+/* { dg-final { cleanup-tree-dump optimized } } */
diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c
index 2077c8d..de89fc4
--- a/gcc/tree-scalar-evolution.c
+++ b/gcc/tree-scalar-evolution.c
@@ -1712,6 +1712,42 @@ interpret_rhs_expr (struct loop *loop, gimple
at_stmt,
   switch (code)
 {
 case ADDR_EXPR:
+  if (TREE_CODE (TREE_OPERAND (rhs1, 0)) == ARRAY_REF)
+{
+ tree array_ref;
+ tree var_decl, base, offset;
+ tree low_bound;
+ tree unit_size;
+ tree index;
+
+ array_ref = TREE_OPERAND (rhs1, 0);
+ var_decl = TREE_OPERAND (array_ref, 0);
+ index = TREE_OPERAND (array_ref, 1);
+
+ low_bound = array_ref_low_bound (array_ref);
+ unit_size = array_ref_element_size (array_ref);
+
+ /* We assume all arrays have sizes that are a multiple of a byte.
+First subtract the lower bound, if any, in the type of the
+index, then convert to sizetype and multiply by the size of
+the array element.  */
+ if (! integer_zerop (low_bound))
+   index = fold_build2 (MINUS_EXPR, TREE_TYPE (index),
+index, low_bound);
+
+ offset = size_binop (MULT_EXPR,
+  fold_convert (sizetype, index),
+  unit_size);
+
+ base = build1 (ADDR_EXPR, TREE_TYPE (rhs1), var_decl);
+ chrec1 = analyze_scalar_evolution (loop, base);
+ chrec2 = analyze_scalar_evolution (loop, offset);
+  chrec1 = chrec_convert (type, chrec1, at_stmt);
+  chrec2 = chrec_convert (TREE_TYPE (offset), chrec2, at_stmt);
+ res = chrec_fold_plus (type, chrec1, chrec2);
+ break;
+}
+
   /* Handle MEM[ptr + CST] which is equivalent to POINTER_PLUS_EXPR.
*/
   if (TREE_CODE (TREE_OPERAND (rhs1, 0)) != MEM_REF)
{

Another bigger case may be like below, and register pressure could be
observed lower.

int a[512] ;
int b[512] ;
int *ax ;
int *bx ;
int *ay ;
int *by ;
int i ;
int j ;

int f(int k)
{
for( j = 0 ; j  k ; j++)
{
for( i = j ; i  512 ; i += k)
{
ax = a[i+k] ;
bx = b[i+k] ;
ay = a[i] ;
by = b[i] ;

*ax = 7 ;
*bx = 7 ;

a[i] = 8;
b[i] = 8;
}
}
}

Thanks,
-Jiangning






RE: A case exposing code sink issue

2011-12-29 Thread Jiangning Liu


 -Original Message-
 From: Jiangning Liu
 Sent: Wednesday, December 28, 2011 5:38 PM
 To: Jiangning Liu; 'Richard Guenther'
 Cc: Michael Matz; gcc@gcc.gnu.org
 Subject: RE: A case exposing code sink issue
 
 
 
  -Original Message-
  From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf
 Of
  Jiangning Liu
  Sent: Tuesday, December 27, 2011 5:10 PM
  To: 'Richard Guenther'
  Cc: Michael Matz; gcc@gcc.gnu.org
  Subject: RE: A case exposing code sink issue
 
  
   The job to do this is final value replacement, not sinking (we do
 not
   sink non-invariant expressions - you'd have to translate them
 through
   the loop-closed SSA exit PHI node, certainly doable, patches
   welcome ;)).
  
 
  Richard,
 
  In final value replacement, expression a + D. can be figured
 out,
  while a[i_xxx] failed to be CHRECed, so I'm wondering if we should
  lower
  a[i_xxx] to a + unitsize(a) * i_xxx first? It seems GCC intends
 to
  keep
  a[i_xxx] until cfgexpand pass. Or we have to directly modify CHREC
  algorithm to get it calculated?
 
  Appreciate your kindly help in advance!
 
 
 Richard,
 
 Now I have a patch working for the case of step i++, by directly
 modifying
 scalar evolution algorithm. the following code would be generated after
 SCCP,
 l
   # i_13 = PHI i_6(7), k_2(D)(4)
   a_p.0_4 = a[i_13];
   MEM[(int *)a][i_13] = 100;
   i_6 = i_13 + 1;
   if (i_6 = 999)
 goto bb 7;
   else
 goto bb 6;
 
 bb 6:
   a_p_lsm.5_11 = MEM[(void *)a + 3996B];
   a_p = a_p_lsm.5_11;
   goto bb 3;
 
 It looks good, but I still have problem when the case has step i+=k.
 
 For this case the value of variable i exiting loop isn't invariant, the
 algorithm below in scalar evolution doesn't work on it,
 
 compute_overall_effect_of_inner_loop()
 {
   ...
 tree nb_iter = number_of_latch_executions (inner_loop);
 
 if (nb_iter == chrec_dont_know)
   return chrec_dont_know;
 else
   {
 tree res;
 
 /* evolution_fn is the evolution function in LOOP.  Get
its value in the nb_iter-th iteration.  */
 res = chrec_apply (inner_loop-num, evolution_fn, nb_iter);
 
 if (chrec_contains_symbols_defined_in_loop (res, loop-num))
   res = instantiate_parameters (loop, res);
 
 /* Continue the computation until ending on a parent of LOOP.
 */
 return compute_overall_effect_of_inner_loop (loop, res);
   }
 }
 
 In theory, we can still have the transformation like below even if the
 step
 is i+=k,
 
   # i_13 = PHI i_6(7), k_2(D)(4)
   i_14 = i_13,
   a_p.0_4 = a[i_13];
   MEM[(int *)a][i_13] = 100;
   i_6 = i_13 + k_2(D); // i+=k
   if (i_6 = 999)
 goto bb 7;
   else
 goto bb 6;
 
 bb 6:
   a_p_lsm.5_11 = a[i_14];
   a_p = a_p_lsm.5_11;
   goto bb 3;
 
 But I realize this is not a loop closed SSA form at all, because i_14
 is
 being used out of the loop. Where could we extend the liverange of
 variable
 i in GCC infrastructure and finally solve this problem?
 

It seems many people are still in the happy of the upcoming 2012 New Year.
:-)

Following my story, I find the following code in tree-ssa-copy.c

  /* Avoid copy propagation from an inner into an outer loop.
 Otherwise, this may move loop variant variables outside of
 their loops and prevent coalescing opportunities.  If the
 value was loop invariant, it will be hoisted by LICM and
 exposed for copy propagation.
 ???  The value will be always loop invariant.
 In loop-closed SSA form do not copy-propagate through
 PHI nodes in blocks with a loop exit edge predecessor.  */
  if (current_loops
   TREE_CODE (arg_value) == SSA_NAME
   (loop_depth_of_name (arg_value)  loop_depth_of_name (lhs)
  || (loops_state_satisfies_p (LOOP_CLOSED_SSA)
   loop_exit_edge_p (e-src-loop_father, e
{
  phi_val.value = lhs;
  break;
}

Here http://gcc.gnu.org/ml/gcc-patches/2006-12/msg00066.html, Dan said 

The original check was not because of coalescing, but because we would
copy prop in-loop variables outside the loop, causing *more*
invariantness in nested loops.

Can anybody give me a concrete example to explain this statement?

Anyway, for my simple case, I don't see bad thing would happen when
propagate a[i] out of the loop. Also, after this propagation, a_p.0_4 =
a[i_13]; within the loop would be dead code and removed in the passes
afterwards. In my opinion, that way the computation would be reduced in the
loop. Did I make any mistake?

  Thanks,
  -Jiangning
 
 
 
 
 






RE: A case exposing code sink issue

2011-12-28 Thread Jiangning Liu


 -Original Message-
 From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
 Jiangning Liu
 Sent: Tuesday, December 27, 2011 5:10 PM
 To: 'Richard Guenther'
 Cc: Michael Matz; gcc@gcc.gnu.org
 Subject: RE: A case exposing code sink issue
 
 
  The job to do this is final value replacement, not sinking (we do not
  sink non-invariant expressions - you'd have to translate them through
  the loop-closed SSA exit PHI node, certainly doable, patches
  welcome ;)).
 
 
 Richard,
 
 In final value replacement, expression a + D. can be figured out,
 while a[i_xxx] failed to be CHRECed, so I'm wondering if we should
 lower
 a[i_xxx] to a + unitsize(a) * i_xxx first? It seems GCC intends to
 keep
 a[i_xxx] until cfgexpand pass. Or we have to directly modify CHREC
 algorithm to get it calculated?
 
 Appreciate your kindly help in advance!
 

Richard,

Now I have a patch working for the case of step i++, by directly modifying
scalar evolution algorithm. the following code would be generated after
SCCP,
l
  # i_13 = PHI i_6(7), k_2(D)(4)
  a_p.0_4 = a[i_13];
  MEM[(int *)a][i_13] = 100;
  i_6 = i_13 + 1;
  if (i_6 = 999)
goto bb 7;
  else
goto bb 6;

bb 6:
  a_p_lsm.5_11 = MEM[(void *)a + 3996B];
  a_p = a_p_lsm.5_11;
  goto bb 3;

It looks good, but I still have problem when the case has step i+=k. 

For this case the value of variable i exiting loop isn't invariant, the
algorithm below in scalar evolution doesn't work on it,

compute_overall_effect_of_inner_loop()
{
  ...
  tree nb_iter = number_of_latch_executions (inner_loop);

  if (nb_iter == chrec_dont_know)
return chrec_dont_know;
  else
{
  tree res;

  /* evolution_fn is the evolution function in LOOP.  Get
 its value in the nb_iter-th iteration.  */
  res = chrec_apply (inner_loop-num, evolution_fn, nb_iter);

  if (chrec_contains_symbols_defined_in_loop (res, loop-num))
res = instantiate_parameters (loop, res);

  /* Continue the computation until ending on a parent of LOOP.
*/
  return compute_overall_effect_of_inner_loop (loop, res);
}
}

In theory, we can still have the transformation like below even if the step
is i+=k,

  # i_13 = PHI i_6(7), k_2(D)(4)
  i_14 = i_13,
  a_p.0_4 = a[i_13];
  MEM[(int *)a][i_13] = 100;
  i_6 = i_13 + k_2(D); // i+=k
  if (i_6 = 999)
goto bb 7;
  else
goto bb 6;

bb 6:
  a_p_lsm.5_11 = a[i_14];
  a_p = a_p_lsm.5_11;
  goto bb 3;

But I realize this is not a loop closed SSA form at all, because i_14 is
being used out of the loop. Where could we extend the liverange of variable
i in GCC infrastructure and finally solve this problem?

 Thanks,
 -Jiangning
 
 
 






RE: A case exposing code sink issue

2011-12-27 Thread Jiangning Liu
 
 The job to do this is final value replacement, not sinking (we do not
 sink non-invariant expressions - you'd have to translate them through
 the loop-closed SSA exit PHI node, certainly doable, patches
 welcome ;)).
 

Richard,

In final value replacement, expression a + D. can be figured out,
while a[i_xxx] failed to be CHRECed, so I'm wondering if we should lower
a[i_xxx] to a + unitsize(a) * i_xxx first? It seems GCC intends to keep
a[i_xxx] until cfgexpand pass. Or we have to directly modify CHREC
algorithm to get it calculated?

Appreciate your kindly help in advance!

Thanks,
-Jiangning





RE: A case exposing code sink issue

2011-12-22 Thread Jiangning Liu
 Yes, the number of iterations of the i loop simply is too difficult for
 our loop iteration calculator to comprehend:
 
   for (i=k; i500; i+=k)
 
 iterates for roundup((500-k)/k) time.  In particular if the step is
 non-constant our nr-of-iteration calculator gives up.
 

I'm trying to give an even smaller case,

int a[512] ;
int *a_p ;

int f(int k)
{
int i ;

for(i=0; ik; i++)
{
a_p = a[i] ;
*a_p = 7 ;
}
}

For this case, we have a very simple loop step i++, then we would have the
GIMPLE before expand like below,

bb 5:
  # i_13 = PHI i_6(5), 0(4)
  # ivtmp.10_9 = PHI ivtmp.10_1(5), ivtmp.10_15(4)
  a_p_lsm.6_4 = a[i_13];
  ivtmp.10_1 = ivtmp.10_9 + 4;
  D.4085_16 = (void *) ivtmp.10_1;
  MEM[base: D.4085_16, offset: 0B] = 7;
  i_6 = i_13 + 1;
  if (i_6 != k_3(D))
goto bb 5;
  else
goto bb 6;

bb 6:
  # a_p_lsm.6_11 = PHI a_p_lsm.6_4(5)
  a_p = a_p_lsm.6_11;
  goto bb 3;

Why can't we still sunk a[i_13] out of loop? For example, I expect to
generate the code like below,

bb 5:
  # i_13 = PHI i_6(5), 0(4)
  # ivtmp.10_9 = PHI ivtmp.10_1(5), ivtmp.10_15(4)
  i_14 = i_13;
  ivtmp.10_1 = ivtmp.10_9 + 4;
  D.4085_16 = (void *) ivtmp.10_1;
  MEM[base: D.4085_16, offset: 0B] = 7;
  i_6 = i_13 + 1;
  if (i_6 != k_3(D))
goto bb 5;
  else
goto bb 6;

bb 6:
  # a_p_lsm.6_11 = PHI a_p_lsm.6_4(5)
  a_p_lsm.6_4 = a[i_14];
  a_p = a_p_lsm.6_11;
  goto bb 3;

This way the computation of a[i] would be saved within the loop. 

Any idea?

Thanks,
-Jiangning





RE: [RFC] Use REG_EXPR in back-end (introduced by optimization to conditional and/or in ARM back-end)

2011-12-20 Thread Jiangning Liu

 -Original Message-
 From: Richard Henderson [mailto:r...@redhat.com]
 Sent: Tuesday, November 22, 2011 9:55 AM
 To: Jiangning Liu
 Cc: gcc-patches@gcc.gnu.org; 'Richard Guenther'
 Subject: Re: [RFC] Use REG_EXPR in back-end (introduced by optimization
 to conditional and/or in ARM back-end)
 
 On 11/21/2011 05:31 PM, Jiangning Liu wrote:
  My question is essentially is May I really use REG_EXPR in back-end
 code?
  like the patch I gave below?
 
 I suppose.
 
 Another alternative is to use BImode for booleans.  Dunno how much of
 that
 you'd be able to gleen from mere rtl expansion or if you'd need boolean
 decls to be expanded with BImode.

Hi Richard,

The output of expand pass requires the operands must meet the requirement of
general_operand for binary operations, i.e. all RTX operations on top level
must meet the hardware instruction requirement. Generally for a hardware not
directly supporting a single bit logic operation, we can't generate BI mode
rtx dest.

So if I simply generate BImode for NE in expand pass, like subreg:SI (ne:BI
(reg:SI xxx) (const_int 0))), there would be an assertion failure due to
failing to find an appropriate instruction code to operate on a single bit.

This looks like a GCC design level issue. How would you suggest generating a
legitimate rtx expression containing BImode?

Thanks,
-Jiangning

 
 The later would probably need a target hook.  I've often wondered how
 much
 ia64 would benefit from that too, being able to store bool variables
 directly
 in predicate registers.
 
 All of this almost certainly must wait until stage1 opens up again
 though...
 
 
 r~






RE: A case exposing code sink issue

2011-12-01 Thread Jiangning Liu


 -Original Message-
 From: Michael Matz [mailto:m...@suse.de]
 Sent: Monday, November 28, 2011 9:07 PM
 To: Jiangning Liu
 Cc: gcc@gcc.gnu.org
 Subject: RE: A case exposing code sink issue
 
 Hi,
 
 On Mon, 28 Nov 2011, Jiangning Liu wrote:
 
One more question...
   
Can  i = i.6_18; be sinked out of loop, because it doesn't have
   memory
dependence with others?
  
   With current trunk the stores to i, a_p, b_p and k are sunken after
 the
   loop.  (There are no aliasing problems because the decls can't
   conflict).
  
   What isn't sunken is the calculation of the a[D.2248_7] expression.
   First, the number of iterations of the inner loop can't be
 determined
   by
   current code (replacing i+=k with e.g. i++ could be handled for
   instance).
 
  Hi Michael,
 
  Do you know what the essential problem is in the case of loop
 iteration
  uncertainty?
 
 Yes, the number of iterations of the i loop simply is too difficult for
 our loop iteration calculator to comprehend:
 
   for (i=k; i500; i+=k)
 
 iterates for roundup((500-k)/k) time.  In particular if the step is
 non-constant our nr-of-iteration calculator gives up.

So do you think this can be improved somewhere?

For this case, looking at the result in middle end, a_p.2_8 =
a[D.2248_7]; should be able to sunken out of loop. That way the
computation of a[D.2248_7] would be saved in loop, although the consequence
is the liverange of D.2248_7 is longer and it needs to live out of loop. But
anyway the register pressure would be decreased within the loop, and we
would less possibly have spill/fill code. This is what I want.

I think we can simply use loop induction variable analysis to solve this
problem. Do you think so?

Thanks,
-Jiangning

 
  I thought it was still an aliasing problem.
 
 No.  All accesses are resolved to final objects (i.e. no pointers), and
 hence can be trivially disambiguated.
 
 
 Ciao,
 Michael.






RE: A case exposing code sink issue

2011-11-27 Thread Jiangning Liu


 -Original Message-
 From: Michael Matz [mailto:m...@suse.de]
 Sent: Friday, November 25, 2011 11:23 PM
 To: Jiangning Liu
 Cc: gcc@gcc.gnu.org
 Subject: RE: A case exposing code sink issue
 
 Hi,
 
 On Thu, 24 Nov 2011, Jiangning Liu wrote:
 
  One more question...
 
  Can  i = i.6_18; be sinked out of loop, because it doesn't have
 memory
  dependence with others?
 
 With current trunk the stores to i, a_p, b_p and k are sunken after the
 loop.  (There are no aliasing problems because the decls can't
 conflict).
 
 What isn't sunken is the calculation of the a[D.2248_7] expression.
 First, the number of iterations of the inner loop can't be determined
 by
 current code (replacing i+=k with e.g. i++ could be handled for
 instance).

Hi Michael,

Do you know what the essential problem is in the case of loop iteration
uncertainty? I thought it was still an aliasing problem.

Thanks,
-Jiangning

 Then this code could be handled by final value replacement, but isn't
 because interpret_rhs_expr doesn't deal with ADDR_EXPR of ARRAY_REFs.
 
 
 Ciao,
 Michael.






A case exposing code sink issue

2011-11-23 Thread Jiangning Liu
Hi,

For this small test case,

int a[512] ;
int b[512] ;
int *a_p ;
int *b_p ;
int i ;
int k ;

int f(void)
{
for( k = 1 ; k = 9 ; k++)
{
for( i = k ; i  512 ; i += k)
{
a_p = a[i + (1k)] ;
b_p = b[i + (1k)] ;
*a_p = 7 ;
*b_p = 7 ;
}
}
}

Before sink pass we have,

f ()
{
  int pretmp.11;
  int k.7;
  int i.6;
  int * b_p.3;
  int * a_p.2;
  int D.2248;
  int i.1;
  int D.2246;
  int k.0;

bb 2:
  k = 1;

bb 3:
  # k.0_9 = PHI k.7_20(11), 1(2)
  i = k.0_9;
  if (k.0_9 = 511)
goto bb 7;
  else
goto bb 8;

bb 8:
Invalid sum of incoming frequencies 900, should be 81
  goto bb 5;

bb 7:
  pretmp.11_19 = 1  k.0_9;

bb 4:
  # i.1_34 = PHI i.6_18(9), k.0_9(7)
  D.2246_5 = pretmp.11_19;
  D.2248_7 = i.1_34 + D.2246_5;
  a_p.2_8 = a[D.2248_7];
  a_p = a_p.2_8;
  b_p.3_13 = b[D.2248_7];
  b_p = b_p.3_13;
  MEM[(int *)a][D.2248_7] = 7;
  MEM[(int *)b][D.2248_7] = 7;
  i.6_18 = k.0_9 + i.1_34;
  i = i.6_18;
  if (i.6_18 = 511)
goto bb 9;
  else
goto bb 8;

bb 9:
  goto bb 4;

bb 5:
Invalid sum of incoming frequencies 81, should be 900
  k.7_20 = k.0_9 + 1;
  k = k.7_20;
  if (k.7_20 = 9)
goto bb 11;
  else
goto bb 6;

bb 11:
  goto bb 3;

bb 6:
  return;

}

Can the following statements be sinked out of loop? I don't see this
optimization happen in trunk. The consequence is register pressure increased
and a spill/fill occurs in RA.

  a_p.2_8 = a[D.2248_7];
  a_p = a_p.2_8;
  b_p.3_13 = b[D.2248_7];
  b_p = b_p.3_13;

I know the sink would happen in sink pass if a_p and b_p are local
variables. 

If this is the root cause, which optimization pass in GCC take the role to
sink them out of loop? How should we get it fixed?

Thanks,
-Jiangning





RE: A case exposing code sink issue

2011-11-23 Thread Jiangning Liu


 -Original Message-
 From: Andrew Pinski [mailto:pins...@gmail.com]
 Sent: Thursday, November 24, 2011 12:15 PM
 To: Jiangning Liu
 Cc: gcc@gcc.gnu.org
 Subject: Re: A case exposing code sink issue
 
 On Wed, Nov 23, 2011 at 8:05 PM, Jiangning Liu jiangning@arm.com
 wrote:
  If this is the root cause, which optimization pass in GCC take the
 role to
  sink them out of loop? How should we get it fixed?
 
 lim1 handles the case just fine for me.  lim1 is the first loop pass.
 
 After lim1 I get:
 
 bb 4:
   # i.1_34 = PHI i.6_18(9), k.0_9(7)
   D.2934_5 = pretmp.11_33;
   D.2936_7 = i.1_34 + D.2934_5;
   a_p.2_8 = a[D.2936_7];
   a_p_lsm.13_37 = a_p.2_8;
   b_p.3_13 = b[D.2936_7];
   b_p_lsm.14_38 = b_p.3_13;
   MEM[(int *)a][D.2936_7] = 7;
   MEM[(int *)b][D.2936_7] = 7;
   i.6_18 = k.0_9 + i.1_34;
   i_lsm.12_39 = i.6_18;
   if (i.6_18 = 511)
 goto bb 9;
   else
 goto bb 8;
 
 bb 9:
   goto bb 4;
 

a[D.2936_7] and b[D.2936_7] are not loop invariants, so it seems lim1 
shouldn't be able to sink them, right? Do I misunderstand this optimization?

Thanks,
-Jiangning

 
 
 
 Thanks,
 Andrew Pinski






RE: A case exposing code sink issue

2011-11-23 Thread Jiangning Liu
Sorry, I realize we can't do that optimization because a_p may have
dependence upon other memory accesses like MEM[(int *)a][D.2248_7].

For example, if it happens a_p equals a_p, that optimization would be
wrong.

But can alias analysis solve the problem if we can guarantee (i+(1k)) is
less than the upbound of array a's definition? 

Or is there any GCC command line switch assuming no array bound overflow?
That way we can do more aggressive optimizations, right?

Thanks,
-Jiangning

 -Original Message-
 From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
 Jiangning Liu
 Sent: Thursday, November 24, 2011 12:05 PM
 To: gcc@gcc.gnu.org
 Subject: A case exposing code sink issue
 
 Hi,
 
 For this small test case,
 
 int a[512] ;
 int b[512] ;
 int *a_p ;
 int *b_p ;
 int i ;
 int k ;
 
 int f(void)
 {
 for( k = 1 ; k = 9 ; k++)
 {
 for( i = k ; i  512 ; i += k)
 {
 a_p = a[i + (1k)] ;
 b_p = b[i + (1k)] ;
 *a_p = 7 ;
 *b_p = 7 ;
 }
 }
 }
 
 Before sink pass we have,
 
 f ()
 {
   int pretmp.11;
   int k.7;
   int i.6;
   int * b_p.3;
   int * a_p.2;
   int D.2248;
   int i.1;
   int D.2246;
   int k.0;
 
 bb 2:
   k = 1;
 
 bb 3:
   # k.0_9 = PHI k.7_20(11), 1(2)
   i = k.0_9;
   if (k.0_9 = 511)
 goto bb 7;
   else
 goto bb 8;
 
 bb 8:
 Invalid sum of incoming frequencies 900, should be 81
   goto bb 5;
 
 bb 7:
   pretmp.11_19 = 1  k.0_9;
 
 bb 4:
   # i.1_34 = PHI i.6_18(9), k.0_9(7)
   D.2246_5 = pretmp.11_19;
   D.2248_7 = i.1_34 + D.2246_5;
   a_p.2_8 = a[D.2248_7];
   a_p = a_p.2_8;
   b_p.3_13 = b[D.2248_7];
   b_p = b_p.3_13;
   MEM[(int *)a][D.2248_7] = 7;
   MEM[(int *)b][D.2248_7] = 7;
   i.6_18 = k.0_9 + i.1_34;
   i = i.6_18;
   if (i.6_18 = 511)
 goto bb 9;
   else
 goto bb 8;
 
 bb 9:
   goto bb 4;
 
 bb 5:
 Invalid sum of incoming frequencies 81, should be 900
   k.7_20 = k.0_9 + 1;
   k = k.7_20;
   if (k.7_20 = 9)
 goto bb 11;
   else
 goto bb 6;
 
 bb 11:
   goto bb 3;
 
 bb 6:
   return;
 
 }
 
 Can the following statements be sinked out of loop? I don't see this
 optimization happen in trunk. The consequence is register pressure
 increased
 and a spill/fill occurs in RA.
 
   a_p.2_8 = a[D.2248_7];
   a_p = a_p.2_8;
   b_p.3_13 = b[D.2248_7];
   b_p = b_p.3_13;
 
 I know the sink would happen in sink pass if a_p and b_p are local
 variables.
 
 If this is the root cause, which optimization pass in GCC take the role
 to
 sink them out of loop? How should we get it fixed?
 
 Thanks,
 -Jiangning
 
 
 






RE: A case exposing code sink issue

2011-11-23 Thread Jiangning Liu
One more question...

Can  i = i.6_18; be sinked out of loop, because it doesn't have memory
dependence with others?

Thanks,
-Jiangning

 -Original Message-
 From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
 Jiangning Liu
 Sent: Thursday, November 24, 2011 2:57 PM
 To: gcc@gcc.gnu.org
 Subject: RE: A case exposing code sink issue
 
 Sorry, I realize we can't do that optimization because a_p may have
 dependence upon other memory accesses like MEM[(int *)a][D.2248_7].
 
 For example, if it happens a_p equals a_p, that optimization would be
 wrong.
 
 But can alias analysis solve the problem if we can guarantee (i+(1k))
 is
 less than the upbound of array a's definition?
 
 Or is there any GCC command line switch assuming no array bound
 overflow?
 That way we can do more aggressive optimizations, right?
 
 Thanks,
 -Jiangning
 
  -Original Message-
  From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf
 Of
  Jiangning Liu
  Sent: Thursday, November 24, 2011 12:05 PM
  To: gcc@gcc.gnu.org
  Subject: A case exposing code sink issue
 
  Hi,
 
  For this small test case,
 
  int a[512] ;
  int b[512] ;
  int *a_p ;
  int *b_p ;
  int i ;
  int k ;
 
  int f(void)
  {
  for( k = 1 ; k = 9 ; k++)
  {
  for( i = k ; i  512 ; i += k)
  {
  a_p = a[i + (1k)] ;
  b_p = b[i + (1k)] ;
  *a_p = 7 ;
  *b_p = 7 ;
  }
  }
  }
 
  Before sink pass we have,
 
  f ()
  {
int pretmp.11;
int k.7;
int i.6;
int * b_p.3;
int * a_p.2;
int D.2248;
int i.1;
int D.2246;
int k.0;
 
  bb 2:
k = 1;
 
  bb 3:
# k.0_9 = PHI k.7_20(11), 1(2)
i = k.0_9;
if (k.0_9 = 511)
  goto bb 7;
else
  goto bb 8;
 
  bb 8:
  Invalid sum of incoming frequencies 900, should be 81
goto bb 5;
 
  bb 7:
pretmp.11_19 = 1  k.0_9;
 
  bb 4:
# i.1_34 = PHI i.6_18(9), k.0_9(7)
D.2246_5 = pretmp.11_19;
D.2248_7 = i.1_34 + D.2246_5;
a_p.2_8 = a[D.2248_7];
a_p = a_p.2_8;
b_p.3_13 = b[D.2248_7];
b_p = b_p.3_13;
MEM[(int *)a][D.2248_7] = 7;
MEM[(int *)b][D.2248_7] = 7;
i.6_18 = k.0_9 + i.1_34;
i = i.6_18;
if (i.6_18 = 511)
  goto bb 9;
else
  goto bb 8;
 
  bb 9:
goto bb 4;
 
  bb 5:
  Invalid sum of incoming frequencies 81, should be 900
k.7_20 = k.0_9 + 1;
k = k.7_20;
if (k.7_20 = 9)
  goto bb 11;
else
  goto bb 6;
 
  bb 11:
goto bb 3;
 
  bb 6:
return;
 
  }
 
  Can the following statements be sinked out of loop? I don't see this
  optimization happen in trunk. The consequence is register pressure
  increased
  and a spill/fill occurs in RA.
 
a_p.2_8 = a[D.2248_7];
a_p = a_p.2_8;
b_p.3_13 = b[D.2248_7];
b_p = b_p.3_13;
 
  I know the sink would happen in sink pass if a_p and b_p are local
  variables.
 
  If this is the root cause, which optimization pass in GCC take the
 role
  to
  sink them out of loop? How should we get it fixed?
 
  Thanks,
  -Jiangning
 
 
 
 
 
 
 






RE: [RFC] Optimization to conditional and/or in ARM back-end

2011-11-22 Thread Jiangning Liu


 -Original Message-
 From: Andrew Pinski [mailto:pins...@gmail.com]
 Sent: Tuesday, November 22, 2011 1:14 PM
 To: Jiangning Liu
 Cc: gcc-patches@gcc.gnu.org; Richard Guenther; Richard Henderson
 Subject: Re: [RFC] Optimization to conditional and/or in ARM back-end
 
 On Sun, Nov 20, 2011 at 6:17 PM, Jiangning Liu jiangning@arm.com
 wrote:
  Hi,
 
  This patch is to implement a peephole like optimization in ARM back-
 end.
 
  If we have an if condition expression like ((r3 != 0)  r1) != 0,
 
 So this is the same as:
 int f1(int r1, int r3)
 {
   if (((r3 != 0)  r1) != 0)
 return g();
   return 1;
 }
 --- CUT ---
 Can't you do this instead:
 Combine the following two instructions:
 (insn 17 15 18 2 (parallel [
 (set (reg:SI 150)
 (and:SI (ne:SI (reg:SI 0 r0 [ r1 ])
 (const_int 0 [0]))
 (reg:SI 1 r1 [ r3 ])))
 (clobber (reg:CC 24 cc))
 ]) t24.c:11 274 {*cond_arith}
  (expr_list:REG_UNUSED (reg:CC 24 cc)
 (expr_list:REG_DEAD (reg:SI 1 r1 [ r3 ])
 (expr_list:REG_DEAD (reg:SI 0 r0 [ r1 ])
 (nil)
 
 (insn 18 17 19 2 (set (reg:CC 24 cc)
 (compare:CC (reg:SI 150)
 (const_int 0 [0]))) t24.c:11 211 {*arm_cmpsi_insn}
  (expr_list:REG_DEAD (reg:SI 150)
 (nil)))
 
 And then have a pattern which expands it to (note the r3 here is the
 r3 in the function, likewise for r1):
 andi r1, r1, #1
 cmp r3, #0
 it  ne
 cmpne   r1, #0
 
 Yes it is one extra instruction but it is still better than before
 right?

Hi Andrew,

Yes, and even the code generated can be like

cmp r3, #0
itt ne
andine  r1, r1, #1
cmpne   r1, #0

But this is not what I want. For performance purpose, I want to remove that 
andi instruction as well.

Also, actually if we don't modify r1, there might be other more optimization 
opportunities, for example register pressure can be reduced for some cases, I 
guess.

Thanks,
-Jiangning





RE: [RFC] Use REG_EXPR in back-end (introduced by optimization to conditional and/or in ARM back-end)

2011-11-21 Thread Jiangning Liu
The original subject doesn't catch the key point, so I changed the subject
to get more people noticed.

My question is essentially is May I really use REG_EXPR in back-end code?
like the patch I gave below?

Thanks,
-Jiangning

 -Original Message-
 From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
 ow...@gcc.gnu.org] On Behalf Of Jiangning Liu
 Sent: Monday, November 21, 2011 10:18 AM
 To: gcc-patches@gcc.gnu.org
 Cc: 'Richard Guenther'; Richard Henderson
 Subject: [RFC] Optimization to conditional and/or in ARM back-end
 
 Hi,
 
 This patch is to implement a peephole like optimization in ARM back-end.
 
 If we have an if condition expression like ((r3 != 0)  r1) != 0,
 originally the binary code to be generated is like,
 
 cmp r3, #0
 ite eq
 moveq   r1, #0
 andne   r1, r1, #1
 cmp r1, #0
 
 But actually this expression could be transformed into ((r3 != x) 
 (r1 !=
 0)) != 0, if we know r2 is with bool type. This way we could generate
 new
 binary code like,
 
 cmp r3, #0
 it  ne
 cmpne   r1, #0
 
 The question is how to judge r2 is a bool variable in back-end. I'm
 using
 REG_EXPR in function arm_check_logic_with_bool_reg to check if r2 is a
 bool,
 this function is being invoked in pattern *cond_code.
 
 May I really use REG_EXPR this way in GCC back-end code? I posted a
 related
 topic at http://gcc.gnu.org/ml/gcc/2011-11/msg00417.html.
 
 If the answer is No, is there any suggestions about either how to judge
 this
 bool type or how to implement this optimization?
 
 Appreciate your comments in advance!
 
 Thanks,
 -Jiangning
 
 
 diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
 index 23a29c6..8b12d48
 --- a/gcc/config/arm/arm-protos.h
 +++ b/gcc/config/arm/arm-protos.h
 @@ -114,6 +114,7 @@ extern rtx arm_gen_load_multiple (int *, int, rtx,
 int,
 rtx, HOST_WIDE_INT *);
  extern rtx arm_gen_store_multiple (int *, int, rtx, int, rtx,
 HOST_WIDE_INT
 *);
  extern int arm_gen_movmemqi (rtx *);
  extern enum machine_mode arm_select_cc_mode (RTX_CODE, rtx, rtx);
 +extern bool arm_check_logic_with_bool_reg (RTX_CODE, rtx, rtx);
  extern enum machine_mode arm_select_dominance_cc_mode (rtx, rtx,
  HOST_WIDE_INT);
  extern rtx arm_gen_compare_reg (RTX_CODE, rtx, rtx);
 diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
 index a429c19..e96f24a
 --- a/gcc/config/arm/arm.c
 +++ b/gcc/config/arm/arm.c
 @@ -10910,6 +10910,41 @@ arm_gen_movmemqi (rtx *operands)
return 1;
  }
 
 +/* Check whether the expression rc cmp1 bool_reg can be
 transformed
 +   to use conditional comparison or not.  */
 +bool
 +arm_check_logic_with_bool_reg (RTX_CODE rc, rtx bool_reg, rtx cmp1) {
 +  rtx cmp2;
 +  HOST_WIDE_INT dom_cc;
 +
 +  if (!REG_P (bool_reg))
 +return false;
 +
 +  if (REG_EXPR (bool_reg) == NULL)
 +return false;
 +
 +  if (TREE_CODE (REG_EXPR (bool_reg)) != VAR_DECL)
 +return false;
 +
 +  if (TREE_CODE (TREE_TYPE (REG_EXPR (bool_reg))) != BOOLEAN_TYPE)
 +return false;
 +
 +  cmp2 = gen_rtx_NE (GET_MODE (bool_reg), bool_reg, const0_rtx);
 +
 +  if (rc == AND)
 +dom_cc = DOM_CC_X_AND_Y;
 +  else if (rc == IOR)
 +dom_cc = DOM_CC_X_OR_Y;
 +  else
 +gcc_unreachable ();
 +
 +  if (arm_select_dominance_cc_mode (cmp1, cmp2, dom_cc) == CCmode)
 +return false;
 +
 +  return true;
 +}
 +
 +
  /* Select a dominance comparison mode if possible for a test of the
 general
 form (OP (COND_OR (X) (Y)) (const_int 0)).  We support three forms.
 COND_OR == DOM_CC_X_AND_Y = (X  Y)
 diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
 index a78ba88..e90a78e
 --- a/gcc/config/arm/arm.md
 +++ b/gcc/config/arm/arm.md
 @@ -9172,6 +9172,64 @@
 (set_attr length 4,4,8)]
  )
 
 +; This pattern matches expression like example ((r1 != x)  r2) != 0.
 +; Here r2 is a register with data type boolean.
 +; This pattern can transform to code matching patterns cmp_and.
 +; Likewise, code matching pattern cmp_ior could be generated, if it is
 |
 +; rather than  in the example.
 +(define_insn_and_split *cond_code
 +  [(set (match_operand 0 cc_register )
 + (compare
 + (ior_and:SI
 +   (match_operator:SI 4 arm_comparison_operator
 +   [(match_operand:SI 2 s_register_operand r,r)
 + (match_operand:SI 3 arm_add_operand rI,L)])
 +  (match_operand:SI 1 s_register_operand r,r))
 +  (const_int 0)))]
 +  TARGET_32BIT
 +arm_check_logic_with_bool_reg (CODE, operands[1],
 operands[4])
 +  #
 +   1
 +  [(set (match_dup 7)
 + (compare
 +  (match_op_dup 5
 +   [(match_op_dup 4 [(match_dup 2) (match_dup 3)])
 +(match_op_dup 6 [(match_dup 1) (const_int 0)])])
 +  (const_int 0)))]
 +  
 +  {
 +HOST_WIDE_INT dom_cc;
 +
 +operands[6] = gen_rtx_NE (SImode, operands[1], const0_rtx);
 +
 +if (CODE == AND)
 +  {
 +dom_cc = DOM_CC_X_AND_Y

RE: [RFC] Use which_alternative in preparation-statements of define_insn_and_split

2011-11-21 Thread Jiangning Liu


 -Original Message-
 From: Richard Henderson [mailto:r...@redhat.com]
 Sent: Tuesday, November 22, 2011 7:55 AM
 To: Jiangning Liu
 Cc: gcc-patches@gcc.gnu.org
 Subject: Re: [RFC] Use which_alternative in preparation-statements of
 define_insn_and_split
 
 On 11/20/2011 07:34 PM, Jiangning Liu wrote:
  Hi,
 
  I find which_alternative can't really be used in preparation-
 statements of
  define_insn_and_split, so can this be fixed like below?
 
  For example, I want to use which_alternative in the pattern below,
 
  (define_insn_and_split *thumb2_movsicc_insn
[(set (match_operand:SI 0 s_register_operand =r,r,r,r,r,r,r,r)
  (if_then_else:SI
   (match_operator 3 arm_comparison_operator
[(match_operand 4 cc_register ) (const_int 0)])
   (match_operand:SI 1 arm_not_operand 0,0,rI,K,rI,rI,K,K)
   (match_operand:SI 2 arm_not_operand rI,K,0,0,rI,K,rI,K)))]
TARGET_THUMB2
@
 it\\t%D3\;mov%D3\\t%0, %2
 it\\t%D3\;mvn%D3\\t%0, #%B2
 it\\t%d3\;mov%d3\\t%0, %1
 it\\t%d3\;mvn%d3\\t%0, #%B1
 ite\\t%d3\;mov%d3\\t%0, %1\;mov%D3\\t%0, %2
 ite\\t%d3\;mov%d3\\t%0, %1\;mvn%D3\\t%0, #%B2
 ite\\t%d3\;mvn%d3\\t%0, #%B1\;mov%D3\\t%0, %2
 ite\\t%d3\;mvn%d3\\t%0, #%B1\;mvn%D3\\t%0, #%B2
 reload_completed
[(cond_exec (match_dup 5)
(set (match_dup 0) (match_dup 6)))]

{
  if (which_alternative = 2  which_alternative  4)
{
  ...
}
  else if (which_alternative = 4)
 
 Hmm, I guess.  It seems a bit weird.
 
 It seems like you'd be better off *not* using define_insn_and_split,
 actually, and instead using more specific tests on the separate
 define_split than you would on the define_insn.
 

Yes. That would be alternative solution I can accept. 

But I still want to know why we don't want to support this? I don't see any
GCC documentation saying not allowing this usage.

Or you might be thinking this change is not safe enough?

Thanks,
-Jiangning

 I don't feel strongly about it though.  I won't object if some other
 rtl maintainer wants to approve this.
 
 
 r~






RE: Is VRP is too conservative to identify boolean value 0 and 1?

2011-11-20 Thread Jiangning Liu


 -Original Message-
 From: Richard Guenther [mailto:richard.guent...@gmail.com]
 Sent: Friday, September 02, 2011 5:07 PM
 To: Jiangning Liu
 Cc: gcc@gcc.gnu.org
 Subject: Re: Is VRP is too conservative to identify boolean value 0 and
 1?
 
 On Fri, Sep 2, 2011 at 7:58 AM, Jiangning Liu jiangning@arm.com
 wrote:
  Hi,
 
  For the following small case,
 
  int f(int i, int j)
  {
         if (i==1  j==2)
                 return i;
         else
                 return j;
  }
 
  with -O2 option, GCC has vrp2 dump like below,
 
  ==
 
  Value ranges after VRP:
 
  i_1: VARYING
  i_2(D): VARYING
  D.1249_3: [0, +INF]
  j_4(D): VARYING
  D.1250_5: [0, +INF]
  D.1251_6: [0, +INF]
  j_10: [2, 2]  EQUIVALENCES: { j_4(D) } (1 elements)
 
 
  Removing basic block 3
  f (int i, int j)
  {
   _Bool D.1251;
   _Bool D.1250;
   _Bool D.1249;
 
  bb 2:
   D.1249_3 = i_2(D) == 1;
   D.1250_5 = j_4(D) == 2;
   D.1251_6 = D.1250_5  D.1249_3;
   if (D.1251_6 != 0)
     goto bb 3;
   else
     goto bb 4;
 
  bb 3:
 
  bb 4:
   # i_1 = PHI 1(3), j_4(D)(2)
   return i_1;
 
  }
 
  
 
  Variable D.1249_3, D.1250_5 and D.1251_6 should be boolean values, so
 the
  their value ranges should be
 
  D.1249_3: [0, 1]
  D.1250_5: [0, 1]
  D.1251_6: [0, 1]
 
  So why current VRP can't find out this value range?
 
 It does - it just prints it as [0, +INF], they are bools with
 TYPE_MAX_VALUE
 == 1 after all.

Richard,

May I use REG_EXPR(rtx of D.1249_3) in xxx.md file to detect whether
D.1249_3 is a bool or not? 

Some comments in GCC says REG_EXPR may be lost in back-end. True?

If we do have REG_EXPR info for some cases in back-end, is it guaranteed to
be correct? May I implementing back-end peephole optimization depending on
REG_EXPR?

Thanks,
-Jiangning

 
 Richard.
 
 
  I'm asking this question because the optimizations in back-end need
 this
  info to do advanced optimization.
 
  Thanks,
  -Jiangning
 
 
 






[RFC] Optimization to conditional and/or in ARM back-end

2011-11-20 Thread Jiangning Liu
Hi,

This patch is to implement a peephole like optimization in ARM back-end.

If we have an if condition expression like ((r3 != 0)  r1) != 0,
originally the binary code to be generated is like,

cmp r3, #0
ite eq
moveq   r1, #0
andne   r1, r1, #1
cmp r1, #0

But actually this expression could be transformed into ((r3 != x)  (r1 !=
0)) != 0, if we know r2 is with bool type. This way we could generate new
binary code like,

cmp r3, #0
it  ne
cmpne   r1, #0

The question is how to judge r2 is a bool variable in back-end. I'm using
REG_EXPR in function arm_check_logic_with_bool_reg to check if r2 is a bool,
this function is being invoked in pattern *cond_code. 

May I really use REG_EXPR this way in GCC back-end code? I posted a related
topic at http://gcc.gnu.org/ml/gcc/2011-11/msg00417.html.

If the answer is No, is there any suggestions about either how to judge this
bool type or how to implement this optimization?

Appreciate your comments in advance!

Thanks,
-Jiangning


diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 23a29c6..8b12d48
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -114,6 +114,7 @@ extern rtx arm_gen_load_multiple (int *, int, rtx, int,
rtx, HOST_WIDE_INT *);
 extern rtx arm_gen_store_multiple (int *, int, rtx, int, rtx, HOST_WIDE_INT
*);
 extern int arm_gen_movmemqi (rtx *);
 extern enum machine_mode arm_select_cc_mode (RTX_CODE, rtx, rtx);
+extern bool arm_check_logic_with_bool_reg (RTX_CODE, rtx, rtx);
 extern enum machine_mode arm_select_dominance_cc_mode (rtx, rtx,
   HOST_WIDE_INT);
 extern rtx arm_gen_compare_reg (RTX_CODE, rtx, rtx);
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index a429c19..e96f24a
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -10910,6 +10910,41 @@ arm_gen_movmemqi (rtx *operands)
   return 1;
 }
 
+/* Check whether the expression rc cmp1 bool_reg can be transformed
+   to use conditional comparison or not.  */
+bool
+arm_check_logic_with_bool_reg (RTX_CODE rc, rtx bool_reg, rtx cmp1) {
+  rtx cmp2;
+  HOST_WIDE_INT dom_cc;
+
+  if (!REG_P (bool_reg))
+return false;
+
+  if (REG_EXPR (bool_reg) == NULL)
+return false;
+
+  if (TREE_CODE (REG_EXPR (bool_reg)) != VAR_DECL)
+return false;
+
+  if (TREE_CODE (TREE_TYPE (REG_EXPR (bool_reg))) != BOOLEAN_TYPE)
+return false;
+
+  cmp2 = gen_rtx_NE (GET_MODE (bool_reg), bool_reg, const0_rtx);
+
+  if (rc == AND)
+dom_cc = DOM_CC_X_AND_Y;
+  else if (rc == IOR)
+dom_cc = DOM_CC_X_OR_Y;
+  else
+gcc_unreachable ();
+
+  if (arm_select_dominance_cc_mode (cmp1, cmp2, dom_cc) == CCmode)
+return false;
+
+  return true;
+}
+
+
 /* Select a dominance comparison mode if possible for a test of the general
form (OP (COND_OR (X) (Y)) (const_int 0)).  We support three forms.
COND_OR == DOM_CC_X_AND_Y = (X  Y)
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index a78ba88..e90a78e
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -9172,6 +9172,64 @@
(set_attr length 4,4,8)]
 )
 
+; This pattern matches expression like example ((r1 != x)  r2) != 0.
+; Here r2 is a register with data type boolean.
+; This pattern can transform to code matching patterns cmp_and.
+; Likewise, code matching pattern cmp_ior could be generated, if it is |
+; rather than  in the example.
+(define_insn_and_split *cond_code
+  [(set (match_operand 0 cc_register )
+   (compare
+ (ior_and:SI
+ (match_operator:SI 4 arm_comparison_operator
+   [(match_operand:SI 2 s_register_operand r,r)
+   (match_operand:SI 3 arm_add_operand rI,L)])
+  (match_operand:SI 1 s_register_operand r,r))
+(const_int 0)))]
+  TARGET_32BIT
+arm_check_logic_with_bool_reg (CODE, operands[1], operands[4])
+  #
+   1
+  [(set (match_dup 7)
+   (compare
+(match_op_dup 5
+ [(match_op_dup 4 [(match_dup 2) (match_dup 3)])
+  (match_op_dup 6 [(match_dup 1) (const_int 0)])])
+(const_int 0)))]
+  
+  {
+HOST_WIDE_INT dom_cc;
+
+operands[6] = gen_rtx_NE (SImode, operands[1], const0_rtx);
+
+if (CODE == AND)
+  {
+dom_cc = DOM_CC_X_AND_Y;
+operands[5] = gen_rtx_fmt_ee (CODE, SImode,
+  operands[4], operands[6]);
+  }
+else if (CODE == IOR)
+  {
+dom_cc = DOM_CC_X_OR_Y;
+operands[5] = gen_rtx_fmt_ee (CODE, SImode,
+  operands[4], operands[6]);
+  }
+else
+  gcc_unreachable ();
+
+operands[7]
+  = gen_rtx_REG (arm_select_dominance_cc_mode (operands[4],
+   operands[6],
+  dom_cc),
+CC_REGNUM);
+  }
+  [(set_attr conds clob)
+   (set (attr length)
+   (if_then_else 

[RFC] Use which_alternative in preparation-statements of define_insn_and_split

2011-11-20 Thread Jiangning Liu
Hi,

I find which_alternative can't really be used in preparation-statements of
define_insn_and_split, so can this be fixed like below?

For example, I want to use which_alternative in the pattern below,

(define_insn_and_split *thumb2_movsicc_insn
  [(set (match_operand:SI 0 s_register_operand =r,r,r,r,r,r,r,r)
(if_then_else:SI
 (match_operator 3 arm_comparison_operator
  [(match_operand 4 cc_register ) (const_int 0)])
 (match_operand:SI 1 arm_not_operand 0,0,rI,K,rI,rI,K,K)
 (match_operand:SI 2 arm_not_operand rI,K,0,0,rI,K,rI,K)))]
  TARGET_THUMB2
  @
   it\\t%D3\;mov%D3\\t%0, %2
   it\\t%D3\;mvn%D3\\t%0, #%B2
   it\\t%d3\;mov%d3\\t%0, %1
   it\\t%d3\;mvn%d3\\t%0, #%B1
   ite\\t%d3\;mov%d3\\t%0, %1\;mov%D3\\t%0, %2
   ite\\t%d3\;mov%d3\\t%0, %1\;mvn%D3\\t%0, #%B2
   ite\\t%d3\;mvn%d3\\t%0, #%B1\;mov%D3\\t%0, %2
   ite\\t%d3\;mvn%d3\\t%0, #%B1\;mvn%D3\\t%0, #%B2
   reload_completed
  [(cond_exec (match_dup 5)
  (set (match_dup 0) (match_dup 6)))]
  
  {
if (which_alternative = 2  which_alternative  4)
  {
...
  }
else if (which_alternative = 4)
  {
...
  }
  }
  [(set_attr length 6,6,6,6,10,10,10,10)
   (set_attr conds use)]
)

Thanks,
-Jiangning

Patch:

diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c
index c94e743..df6a3df
--- a/gcc/emit-rtl.c
+++ b/gcc/emit-rtl.c
@@ -3502,6 +3502,10 @@ try_split (rtx pat, rtx trial, int last)
 split_branch_probability = INTVAL (XEXP (note, 0));
   probability = split_branch_probability;

+  extract_insn (trial);
+  if (!constrain_operands (reload_completed))
+return trial;
+
   seq = split_insns (pat, trial);

   split_branch_probability = -1;





MAINTAINERS: add myself

2011-11-17 Thread Jiangning Liu
Just committed the following:

   * MAINTAINERS (Write After Approval): Add myself.

Index: MAINTAINERS
===
--- MAINTAINERS (revision 181466)
+++ MAINTAINERS (working copy)
@@ -419,6 +419,7 @@
 Marc Lehmann   p...@goof.com
 James Lemkejwle...@juniper.net
 Kriang Lerdsuwanakij
lerds...@users.sourceforge.net
+Jiangning Liu  jiangning@arm.com
 Sa Liu sa...@de.ibm.com
 Ralph Loader   r...@ihug.co.nz
 Gabor Loki l...@inf.u-szeged.hu





A question about redudant load elimination

2011-11-14 Thread Jiangning Liu
Hi,

For this test case,

int x;
extern void f(void);

void g(int *a)
{
a[x] = 1;
if (x == 100)
f();
a[x] = 2;
}

For trunk, the x86 assembly code is like below,

movlx, %eax
movl16(%esp), %ebx
movl$1, (%ebx,%eax,4)
movlx, %eax   // Is this a redundant one?
cmpl$100, %eax
je  .L4
movl$2, (%ebx,%eax,4)
addl$8, %esp
.cfi_remember_state
.cfi_def_cfa_offset 8
popl%ebx
.cfi_restore 3
.cfi_def_cfa_offset 4
ret
.p2align 4,,7
.p2align 3
.L4:
.cfi_restore_state
callf
movlx, %eax
movl$2, (%ebx,%eax,4)
addl$8, %esp
.cfi_def_cfa_offset 8
popl%ebx
.cfi_restore 3
.cfi_def_cfa_offset 4
Ret

Is the 2nd movl x, %eax is a redundant one for single thread programming
model? If yes, can this be optimized away?

Thanks,
-Jiangning





[PATCH, ARM] Fix stack red zone bug (PR38644)

2011-11-01 Thread Jiangning Liu
Hi,

This patch is to fix PR38644 in ARM back-end. OK for trunk?

For every detail, please refer to
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38644.

ChangeLog:

2011-11-2  Jiangning Liu  jiangning@arm.com

PR rtl-optimization/38644
* config/arm/arm.c (thumb1_expand_epilogue): Add memory barrier
for epilogue having stack adjustment.

ChangeLog of testsuite:

2011-11-2  Jiangning Liu  jiangning@arm.com

PR rtl-optimization/38644
* gcc.target/arm/stack-red-zone.c: New.

Thanks,
-Jiangning

Patch:

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index f1ada6f..1f6fc26
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -22215,6 +22215,8 @@ thumb1_expand_epilogue (void)
   gcc_assert (amount = 0);
   if (amount)
 {
+  emit_insn (gen_blockage ());
+
   if (amount  512)
emit_insn (gen_addsi3 (stack_pointer_rtx, stack_pointer_rtx,
   GEN_INT (amount)));
diff --git a/gcc/testsuite/gcc.target/arm/stack-red-zone.c
b/gcc/testsuite/gcc.target/arm/stack-red-zone.c
new file mode 100644
index 000..b9f0f99
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/stack-red-zone.c
@@ -0,0 +1,12 @@
+/* No stack red zone.  PR38644.  */
+/* { dg-options -mthumb -O2 } */
+/* { dg-final { scan-assembler ldrb\[^\n\]*\\n\[\t \]*add\[\t \]*sp } }
*/
+
+extern int doStreamReadBlock (int *, char *, int size, int);
+
+char readStream (int *s)
+{
+   char c = 0;
+   doStreamReadBlock (s, c, 1, *s);
+   return c;
+}







RE: [patch tree-optimization]: Improve handling of conditional-branches on targets with high branch costs

2011-10-31 Thread Jiangning Liu


 -Original Message-
 From: Kai Tietz [mailto:ktiet...@googlemail.com]
 Sent: Thursday, October 27, 2011 5:36 PM
 To: Jiangning Liu
 Cc: Michael Matz; Richard Guenther; Kai Tietz; gcc-patches@gcc.gnu.org;
 Richard Henderson
 Subject: Re: [patch tree-optimization]: Improve handling of
 conditional-branches on targets with high branch costs
 
 2011/10/27 Jiangning Liu jiangning@arm.com:
 
 
  -Original Message-
  From: Michael Matz [mailto:m...@suse.de]
  Sent: Wednesday, October 26, 2011 11:47 PM
  To: Kai Tietz
  Cc: Jiangning Liu; Richard Guenther; Kai Tietz; gcc-
 patc...@gcc.gnu.org;
  Richard Henderson
  Subject: Re: [patch tree-optimization]: Improve handling of
  conditional-branches on targets with high branch costs
 
  Hi,
 
  On Wed, 26 Oct 2011, Kai Tietz wrote:
 
   So you would mean that memory dereferencing shouldn't be
 considered
  as
   side-effect at all?
 
  No.  I haven't said this at all.  Of course it's a side-effect, but
  we're
  allowed to remove existing ones (under some circumstances).  We're
 not
  allowed to introduce new ones, which means that this ...
 
   So we would happily cause by code 'if (i  *i != 0) an crash, as
   memory-dereference has for you no side-effect?
 
  ... is not allowed.  But in the original example the memread was on
 the
  left side, hence occured always, therefore we can move it to the
 right
  side, even though it might occur less often.
 
   In you special case it might be valid that, if first (and C-fold-
  const
   doesn't know if the side-effect condition is really the first, as
 it
   might be a sub-sequence of a condition) condition might trap or
 not,
  to
   combine it.  But branching has to cover the general cases.  If you
  find
   a way to determine that left-hand operand in fold_const's
 branching
  code
   is really the left-most condition in chain, then we can add such a
   special case, but I don't see here an easy way to determine it.
 
  Hmm?  I don't see why it's necessary to check if it's the left-most
  condition in a chain.  If the left hand of '' is a memread it can
  always
  be moved to the right side (or the operator transformed into ''
 which
  can
  have the same effect), of course only if the original rhs is free of
  side
  effects, but then independed if the  was part of a larger
 expression.
  The memread will possibly be done fewer times than originally, but
 as
  said, that's okay.
 
 
  Agree. The point is for the small case I gave RHS doesn't have side
 effect
  at all, so the optimization of changing it to AND doesn't violate C
  specification. We need to recover something for this case, although
 it did
  improve a lot for some particular benchmarks.
 
  Thanks,
  -Jiangning
 
 
  Ciao,
  Michael.
 
 Hmm, so we can allow merging to AND, if the left-hand-side might trap
 but has no-side-effects and rhs has neither trapping nor side-effects.
 As for the case that left-hand side has side-effects but right-hand
 not, we aren't allowed to do this AND/OR merge.  For example 'if ((f =
 foo ()) != 0  f  24)' we aren't allowed to make this
 transformation.
 
 This shouldn't be that hard.  We need to provide to simple_operand_p_2
 an additional argument for checking trapping or not.
 

Would it be OK if I file a tracker in bugzilla against this?

 Regards,
 Kai






RE: [patch tree-optimization]: Improve handling of conditional-branches on targets with high branch costs

2011-10-27 Thread Jiangning Liu


 -Original Message-
 From: Michael Matz [mailto:m...@suse.de]
 Sent: Wednesday, October 26, 2011 11:47 PM
 To: Kai Tietz
 Cc: Jiangning Liu; Richard Guenther; Kai Tietz; gcc-patches@gcc.gnu.org;
 Richard Henderson
 Subject: Re: [patch tree-optimization]: Improve handling of
 conditional-branches on targets with high branch costs
 
 Hi,
 
 On Wed, 26 Oct 2011, Kai Tietz wrote:
 
  So you would mean that memory dereferencing shouldn't be considered
 as
  side-effect at all?
 
 No.  I haven't said this at all.  Of course it's a side-effect, but
 we're
 allowed to remove existing ones (under some circumstances).  We're not
 allowed to introduce new ones, which means that this ...
 
  So we would happily cause by code 'if (i  *i != 0) an crash, as
  memory-dereference has for you no side-effect?
 
 ... is not allowed.  But in the original example the memread was on the
 left side, hence occured always, therefore we can move it to the right
 side, even though it might occur less often.
 
  In you special case it might be valid that, if first (and C-fold-
 const
  doesn't know if the side-effect condition is really the first, as it
  might be a sub-sequence of a condition) condition might trap or not,
 to
  combine it.  But branching has to cover the general cases.  If you
 find
  a way to determine that left-hand operand in fold_const's branching
 code
  is really the left-most condition in chain, then we can add such a
  special case, but I don't see here an easy way to determine it.
 
 Hmm?  I don't see why it's necessary to check if it's the left-most
 condition in a chain.  If the left hand of '' is a memread it can
 always
 be moved to the right side (or the operator transformed into '' which
 can
 have the same effect), of course only if the original rhs is free of
 side
 effects, but then independed if the  was part of a larger expression.
 The memread will possibly be done fewer times than originally, but as
 said, that's okay.
 

Agree. The point is for the small case I gave RHS doesn't have side effect
at all, so the optimization of changing it to AND doesn't violate C
specification. We need to recover something for this case, although it did
improve a lot for some particular benchmarks.

Thanks,
-Jiangning

 
 Ciao,
 Michael.






RE: [patch tree-optimization]: Improve handling of conditional-branches on targets with high branch costs

2011-10-26 Thread Jiangning Liu


 -Original Message-
 From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
 ow...@gcc.gnu.org] On Behalf Of Michael Matz
 Sent: Tuesday, October 11, 2011 10:45 PM
 To: Kai Tietz
 Cc: Richard Guenther; Kai Tietz; gcc-patches@gcc.gnu.org; Richard
 Henderson
 Subject: Re: [patch tree-optimization]: Improve handling of
 conditional-branches on targets with high branch costs
 
 Hi,
 
 On Tue, 11 Oct 2011, Kai Tietz wrote:
 
   Better make it a separate function the first tests your new
   conditions, and then calls simple_operand_p.
 
  Well, either I make it a new function and call it instead of
  simple_operand_p,
 
 That's what I meant, yes.
 
   @@ -5149,13 +5176,6 @@ fold_truthop (location_t loc, enum tree_
build2 (BIT_IOR_EXPR, TREE_TYPE (ll_arg),
ll_arg, rl_arg),
build_int_cst (TREE_TYPE (ll_arg), 0));
   -
   -  if (LOGICAL_OP_NON_SHORT_CIRCUIT)
   - {
   -   if (code != orig_code || lhs != orig_lhs || rhs !=
 orig_rhs)
   - return build2_loc (loc, code, truth_type, lhs, rhs);
   -   return NULL_TREE;
   - }
  
   Why do you remove this hunk?  Shouldn't you instead move the hunk
 you
   added to fold_truth_andor() here.  I realize this needs some TLC to
   fold_truth_andor_1, because right now it early-outs for non-
 comparisons,
   but it seems the better place.  I.e. somehow move the below code
 into the
   above branch, with the associated diddling on fold_truth_andor_1
 that it
   gets called.
 
  This hunk is removed, as it is vain to do here.
 
 There is a fallthrough now, that wasn't there before.  I don't know if
 it's harmless, I just wanted to mention it.
 

Yes, this part introduced different behavior for this small case,

int f(char *i, int j)
{
if (*i  j!=2)
return *i;
else
return j;
}

Before the fix, we have

  D.4710 = *i;
  D.4711 = D.4710 != 0;
  D.4712 = j != 2;
  D.4713 = D.4711  D.4712;
  if (D.4713 != 0) goto D.4714; else goto D.4715;
  D.4714:
  D.4710 = *i;
  D.4716 = (int) D.4710;
  return D.4716;
  D.4715:
  D.4716 = j;
  return D.4716;

After the fix, we have

  D.4711 = *i;
  if (D.4711 != 0) goto D.4712; else goto D.4710;
  D.4712:
  if (j != 2) goto D.4713; else goto D.4710;
  D.4713:
  D.4711 = *i;
  D.4714 = (int) D.4711;
  return D.4714;
  D.4710:
  D.4714 = j;
  return D.4714;

Does this meet the original expectation? 

I find the code below in function fold_truth_andor makes difference,

  /* Transform (A AND-IF B) into (A AND B), or (A OR-IF B)   into (A OR B).
 For sequence point consistancy, we need to check for trapping,
 and side-effects.  */
  else if (code == icode  simple_operand_p_2 (arg0)
simple_operand_p_2 (arg1))
 return fold_build2_loc (loc, ncode, type, arg0, arg1);

for *i != 0 simple_operand_p(*i) returns false. Originally this is not 
checked by the code. I don't see the patch originally wanted to cover this. Can 
this be explained reasonably?

I'm not arguing this patch did worse thing, but only want to understand the 
rationale behind this and justify this patch doesn't hurt anything else. 
Actually on the contrary, I measured and this change accidently made some 
benchmarks significantly improved due to some other reasons.

Thanks,
-Jiangning

  Btw richi asked for it, and I agree that new TRUTH-AND/OR packing is
  better done at a single place in fold_truth_andor only.
 
 As fold_truthop is called twice by fold_truth_andor, the latter might
 indeed be the better place.
 
 
 Ciao,
 Michael.





RE: [PATCH] Fix stack red zone bug (PR38644)

2011-10-09 Thread Jiangning Liu


 -Original Message-
 From: Richard Henderson [mailto:r...@redhat.com]
 Sent: Saturday, October 01, 2011 3:05 AM
 To: Jiangning Liu
 Cc: 'Jakub Jelinek'; 'Richard Guenther'; Andrew Pinski; gcc-
 patc...@gcc.gnu.org
 Subject: Re: [PATCH] Fix stack red zone bug (PR38644)
 
 On 09/29/2011 06:13 PM, Jiangning Liu wrote:
 
 
  -Original Message-
  From: Jakub Jelinek [mailto:ja...@redhat.com]
  Sent: Thursday, September 29, 2011 6:14 PM
  To: Jiangning Liu
  Cc: 'Richard Guenther'; Andrew Pinski; gcc-patches@gcc.gnu.org
  Subject: Re: [PATCH] Fix stack red zone bug (PR38644)
 
  On Thu, Sep 29, 2011 at 06:08:50PM +0800, Jiangning Liu wrote:
  As far as I know different back-ends are implementing different
  prologue/epilogue in GCC. If one day this part can be refined and
  abstracted
  as well, I would say solving this stack-red-zone problem in shared
  prologue/epilogue code would be a perfect solution, and barrier can
  be
  inserted there.
 
  I'm not saying you are wrong on keeping scheduler using a pure
  barrier
  interface. From engineering point of view, I only feel my proposal
 is
  so far
  so good, because this patch at least solve the problem for all
  targets in a
  quite simple way. Maybe it can be improved in future based on this.
 
  But you don't want to listen about any other alternative, other
  backends are
  happy with being able to put the best kind of barrier at the best
 spot
  in the epilogue and don't need a generic solution which won't
 model
  very
  well the target diversity anyway.
 
  Jakub,
 
  Appreciate for your attention on this issue,
 
  1) Can you clarify who are the others back-ends? Does it cover most
 of the
  back-ends being supported by GCC right now?
 
 Your red-stack barrier issue is *exactly* the same as the frame pointer
 barrier issue, which affects many backends.
 
 That is, if the frame pointer is initialized before the local stack
 frame
 is allocated, then one has to add a barrier such that memory references
 based on the frame pointer are not scheduled before the local stack
 frame
 allocation.
 
 One example of this is in the i386 port, where the prologue looks like
 
   push%ebp
   mov %esp, %ebp
   sub $frame, %esp
 
 The rtl we emit for that subtraction looks like
 
 (define_insn pro_epilogue_adjust_stack_mode_add
   [(set (match_operand:P 0 register_operand =r,r)
 (plus:P (match_operand:P 1 register_operand 0,r)
 (match_operand:P 2 nonmemory_operand ri,li)))
(clobber (reg:CC FLAGS_REG))
(clobber (mem:BLK (scratch)))]
 
 Note the final clobber, which is a memory scheduling barrier.
 
 Other targets use similar tricks.  For instance arm stack_tie.
 
 Honestly, I've found nothing convincing throughout this thread that
 suggests to me that this problem should be handled generically.
 

Richard H.,

Thanks for your explanation by giving an example in x86. 

The key is if possible, fixing it in middle end can benefit all ports
directly and avoid bug fixing burden in back-ends, rather than fix this
problem port by port.

Actually now the debating here is whether memory barrier is properly
modeling through whole GCC rather than a single component, because my
current understanding is scheduler is not the only component using memory
barrier.

Thanks,
-Jiangning

 
 r~






RE: [PATCH] Fix stack red zone bug (PR38644)

2011-10-09 Thread Jiangning Liu


 -Original Message-
 From: Richard Guenther [mailto:richard.guent...@gmail.com]
 Sent: Friday, September 30, 2011 8:57 PM
 To: Jiangning Liu; Jakub Jelinek; Richard Guenther; Andrew Pinski; gcc-
 patc...@gcc.gnu.org; richard.sandif...@linaro.org
 Subject: Re: [PATCH] Fix stack red zone bug (PR38644)
 
 On Fri, Sep 30, 2011 at 2:46 PM, Richard Sandiford
 richard.sandif...@linaro.org wrote:
  Jiangning Liu jiangning@arm.com writes:
  You seem to feel strongly about this because it's a wrong-code bug
 that
  is very easy to introduce and often very hard to detect.  And I
  defintely
  sympathise with that.  If we were going to to do it in a target-
  independent
  way, though, I think it would be better to scan patterns like
 epilogue
  and
  automatically introduce barriers before assignments to
  stack_pointer_rtx
  (subject to the kind of hook in your patch).  But I still don't
 think
  that's better than leaving the onus on the backend.  The backend is
  still responsible for much more complicated things like determning
  the correct deallocation and register-restore sequence, and for
  determining the correct CFI sequence.
 
 
  I think middle-end in GCC is actually shared code rather than the
 part
  exactly in the middle. A pass working on RTL can be a middle end
 just
  because the code can be shared for all targets, and some passes can
 even
  work for both GIMPLE and RTL.
 
  Actually some optimizations need to work through shared part
 (middle-end)
  plus target specific part (back-end). You are thinking the
 interface
  between this shared part and target specific part should be
 using
  barrier as a properly model. To some extension I agree with this.
 However,
  it doesn't mean the fix should be in back-end rather than middle end,
  because obviously this problem is a common ABI issue for all targets.
 If we
  can abstract this issue to be a shared part, why shouldn't we do it
 in
  middle end to reduce the onus of back-end? Back-end should handle
 the target
  specific things rather than only the complicated things.
 
  And for avoidance of doubt, the automatic barrier insertion that I
  described would be one way of doing it in target-independent code.
  But...
 
  If a complicated problem can be implemented in a shared code
 manner, we
  still want to put it into middle end rather than back-end. I believe
 those
  optimizations based on SSA form are complicated enough, but they are
 all in
  middle end. This is the logic I'm seeing in GCC.
 
  The situation here is different.  The target-independent rtl code is
  being given a blob of instructions that the backend has generated for
  the epilogue.  There's no fine-tuning beyond that.  E.g. we don't
 have
  separate patterns for restore registers, deallocate stack,
 return:
  we just have one monolithic epilogue pattern.  The target-
 independent
  code has very little control.
 
  In contrast, after the tree optimisers have handed off the initial IL,
  the tree optimisers are more or less in full control.  There are very
  few cases where we generate further trees outside the middle-
 end.  The only
  case I know off-hand is the innards of va_start and va_arg, which can
 be
  generated by the backend.
 
  So let's suppose we had a similar situation there, where we wanted
  va_arg do something special in a certain situation.  If we had the
  same three choices of:
 
   1. use an on-the-side hook to represent the special something
   2. scan the code generated by the backend and automatically
      inject the special something at an appropriate place
   3. require each backend to do it properly from the start
 
  (OK, slightly prejudiced wording :-)) I think we'd still choose 3.
 
  For this particular issue, I don't think that hook interface I'm
  proposing is more complicated than the barrier. Instead, it is
 easier
  for back-end implementer to be aware of the potential issue before
  really solving stack red zone problem, because it is very clearly
  listed in target hook list.
 
  The point for model it in the IL supporters like myself is that we
  have both many backends and many rtl passes.  Putting it in a hook
 keeps
  things simple for the backends, but it means that every rtl pass must
 be
  aware of this on-the-side dependency.  Perhaps sched2 really is the
 only
  pass that needs to look at the hook at present.  But perhaps not.
  E.g. dbr_schedule (not a problem on ARM, I realise) also reorders
  instructions, so maybe it would need to be audited to see whether any
  calls to this hook are needed.  And perhaps we'd add more rtl passes
  later.
 
  The point behind using a barrier is that the rtl passes do not then
 need
  to treat the stack-deallocation dependency as a special case.  They
 can
  just use the normal analysis and get it right.
 
  In other words, we're both arguing for safety here.
 
 Indeed.  It's certainly not only scheduling that can move instructions,
 but RTL PRE, combine, ifcvt all can

RE: [PATCH] Fix stack red zone bug (PR38644)

2011-10-09 Thread Jiangning Liu


 -Original Message-
 From: Richard Sandiford richard.sandif...@linaro.org
 Date: Fri, Sep 30, 2011 at 8:46 PM
 Subject: Re: [PATCH] Fix stack red zone bug (PR38644)
 To: Jiangning Liu jiangning@arm.com
 Cc: Jakub Jelinek ja...@redhat.com, Richard Guenther
 richard.guent...@gmail.com, Andrew Pinski pins...@gmail.com,
 gcc-patches@gcc.gnu.org
 
 
 Jiangning Liu jiangning@arm.com writes:
  You seem to feel strongly about this because it's a wrong-code bug
 that
  is very easy to introduce and often very hard to detect.  And I
  defintely
  sympathise with that.  If we were going to to do it in a target-
  independent
  way, though, I think it would be better to scan patterns like
 epilogue
  and
  automatically introduce barriers before assignments to
  stack_pointer_rtx
  (subject to the kind of hook in your patch).  But I still don't
 think
  that's better than leaving the onus on the backend.  The backend is
  still responsible for much more complicated things like determning
  the correct deallocation and register-restore sequence, and for
  determining the correct CFI sequence.
 
 
  I think middle-end in GCC is actually shared code rather than the
 part
  exactly in the middle. A pass working on RTL can be a middle end just
  because the code can be shared for all targets, and some passes can
 even
  work for both GIMPLE and RTL.
 
  Actually some optimizations need to work through shared part
 (middle-end)
  plus target specific part (back-end). You are thinking the
 interface
  between this shared part and target specific part should be using
  barrier as a properly model. To some extension I agree with this.
 However,
  it doesn't mean the fix should be in back-end rather than middle end,
  because obviously this problem is a common ABI issue for all targets.
 If we
  can abstract this issue to be a shared part, why shouldn't we do it
 in
  middle end to reduce the onus of back-end? Back-end should handle the
 target
  specific things rather than only the complicated things.
 
 And for avoidance of doubt, the automatic barrier insertion that I
 described would be one way of doing it in target-independent code.
 But...
 
  If a complicated problem can be implemented in a shared code manner,
 we
  still want to put it into middle end rather than back-end. I believe
 those
  optimizations based on SSA form are complicated enough, but they are
 all in
  middle end. This is the logic I'm seeing in GCC.
 
 The situation here is different.  The target-independent rtl code is
 being given a blob of instructions that the backend has generated for
 the epilogue.  There's no fine-tuning beyond that.  E.g. we don't have
 separate patterns for restore registers, deallocate stack, return:
 we just have one monolithic epilogue pattern.  The target-independent
 code has very little control.
 
 In contrast, after the tree optimisers have handed off the initial IL,
 the tree optimisers are more or less in full control.  There are very
 few cases where we generate further trees outside the middle-end.  The
 only
 case I know off-hand is the innards of va_start and va_arg, which can
 be
 generated by the backend.
 
 So let's suppose we had a similar situation there, where we wanted
 va_arg do something special in a certain situation.  If we had the
 same three choices of:
 
  1. use an on-the-side hook to represent the special something
  2. scan the code generated by the backend and automatically
     inject the special something at an appropriate place
  3. require each backend to do it properly from the start
 
 (OK, slightly prejudiced wording :-)) I think we'd still choose 3.
 

Richard S.,

Although I've ever implemented va_arg for a commercial compiler previously
long times ago, I forgot all the details. :-) I'm not sure if using va_arg
is a good example to compare with this stack red zone case.

  For this particular issue, I don't think that hook interface I'm
  proposing is more complicated than the barrier. Instead, it is easier
  for back-end implementer to be aware of the potential issue before
  really solving stack red zone problem, because it is very clearly
  listed in target hook list.
 
 The point for model it in the IL supporters like myself is that we
 have both many backends and many rtl passes.  Putting it in a hook
 keeps
 things simple for the backends, but it means that every rtl pass must
 be
 aware of this on-the-side dependency.  Perhaps sched2 really is the
 only
 pass that needs to look at the hook at present.  But perhaps not.
 E.g. dbr_schedule (not a problem on ARM, I realise) also reorders
 instructions, so maybe it would need to be audited to see whether any
 calls to this hook are needed.  And perhaps we'd add more rtl passes
 later.

Let me rephrase your justification with my own words.

===

We can't compare adding a new pass and adding a new port, because they are
totally different things. But it implies with my proposal the burden may
still be added

RE: [PATCH] Fix stack red zone bug (PR38644)

2011-09-30 Thread Jiangning Liu


 -Original Message-
 From: Richard Sandiford [mailto:rdsandif...@googlemail.com]
 Sent: Friday, September 30, 2011 4:15 PM
 To: Jiangning Liu
 Cc: 'Jakub Jelinek'; 'Richard Guenther'; Andrew Pinski; gcc-
 patc...@gcc.gnu.org
 Subject: Re: [PATCH] Fix stack red zone bug (PR38644)
 
 Jiangning Liu jiangning@arm.com writes:
  -Original Message-
  From: Jakub Jelinek [mailto:ja...@redhat.com]
  Sent: Thursday, September 29, 2011 6:14 PM
  To: Jiangning Liu
  Cc: 'Richard Guenther'; Andrew Pinski; gcc-patches@gcc.gnu.org
  Subject: Re: [PATCH] Fix stack red zone bug (PR38644)
 
  On Thu, Sep 29, 2011 at 06:08:50PM +0800, Jiangning Liu wrote:
   As far as I know different back-ends are implementing different
   prologue/epilogue in GCC. If one day this part can be refined and
  abstracted
   as well, I would say solving this stack-red-zone problem in shared
   prologue/epilogue code would be a perfect solution, and barrier
 can
  be
   inserted there.
  
   I'm not saying you are wrong on keeping scheduler using a pure
  barrier
   interface. From engineering point of view, I only feel my proposal
 is
  so far
   so good, because this patch at least solve the problem for all
  targets in a
   quite simple way. Maybe it can be improved in future based on this.
 
  But you don't want to listen about any other alternative, other
  backends are
  happy with being able to put the best kind of barrier at the best
 spot
  in the epilogue and don't need a generic solution which won't
 model
  very
  well the target diversity anyway.
 
  Jakub,
 
  Appreciate for your attention on this issue,
 
  1) Can you clarify who are the others back-ends? Does it cover most
 of the
  back-ends being supported by GCC right now?
 
 Not answering for Jakub of course, but as a maintainer of a backend, I
 know
 MIPS doesn't have the required barrier at the moment.  But that's a bug.
 
 Like others in this thread, I'm strongly of the opinion that this
 should
 be modelled directly in the IL.  And it's already supposed to be
 modelled
 in the IL.  Target-independent code emits the required barriers in
 cases
 where it rather than the backend patterns are responsible.  E.g.:
 
 emit_clobber (gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (VOIDmode)));
 emit_clobber (gen_rtx_MEM (BLKmode, hard_frame_pointer_rtx));
 
 emit_move_insn (hard_frame_pointer_rtx, fp);
 emit_stack_restore (SAVE_NONLOCAL, stack);
 
 from expand_builtin_longjmp() and:
 
   if (sa != 0)
 {
   sa = validize_mem (sa);
   /* These clobbers prevent the scheduler from moving
references to variable arrays below the code
that deletes (pops) the arrays.  */
   emit_clobber (gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (VOIDmode)));
   emit_clobber (gen_rtx_MEM (BLKmode, stack_pointer_rtx));
 }
 
 from emit_stack_restore().  Backends that fail to follow suit are IMO
 just buggy.
 
 FWIW, I intend to fix MIPS this weekend.

Richard S.,

Appreciate your attention on this issue and investigation on MIPS target.

Really glad to know you find a potential bug for MIPS through this
discussion. To some extension this proved my hypothesis previously.

 
 You seem to feel strongly about this because it's a wrong-code bug that
 is very easy to introduce and often very hard to detect.  And I
 defintely
 sympathise with that.  If we were going to to do it in a target-
 independent
 way, though, I think it would be better to scan patterns like epilogue
 and
 automatically introduce barriers before assignments to
 stack_pointer_rtx
 (subject to the kind of hook in your patch).  But I still don't think
 that's better than leaving the onus on the backend.  The backend is
 still responsible for much more complicated things like determning
 the correct deallocation and register-restore sequence, and for
 determining the correct CFI sequence.
 

I think middle-end in GCC is actually shared code rather than the part
exactly in the middle. A pass working on RTL can be a middle end just
because the code can be shared for all targets, and some passes can even
work for both GIMPLE and RTL.

Actually some optimizations need to work through shared part (middle-end)
plus target specific part (back-end). You are thinking the interface
between this shared part and target specific part should be using
barrier as a properly model. To some extension I agree with this. However,
it doesn't mean the fix should be in back-end rather than middle end,
because obviously this problem is a common ABI issue for all targets. If we
can abstract this issue to be a shared part, why shouldn't we do it in
middle end to reduce the onus of back-end? Back-end should handle the target
specific things rather than only the complicated things. 

If a complicated problem can be implemented in a shared code manner, we
still want to put it into middle end rather than back-end. I believe those
optimizations based on SSA form are complicated enough

RE: [PATCH] Fix stack red zone bug (PR38644)

2011-09-29 Thread Jiangning Liu


 -Original Message-
 From: Richard Guenther [mailto:richard.guent...@gmail.com]
 Sent: Thursday, September 29, 2011 5:03 PM
 To: Jiangning Liu
 Cc: Andrew Pinski; gcc-patches@gcc.gnu.org
 Subject: Re: [PATCH] Fix stack red zone bug (PR38644)
 
 On Thu, Sep 29, 2011 at 5:13 AM, Jiangning Liu jiangning@arm.com
 wrote:
 
 
  -Original Message-
  From: Richard Guenther [mailto:richard.guent...@gmail.com]
  Sent: Wednesday, September 28, 2011 5:56 PM
  To: Jiangning Liu
  Cc: Andrew Pinski; gcc-patches@gcc.gnu.org
  Subject: Re: [PATCH] Fix stack red zone bug (PR38644)
 
  On Wed, Sep 28, 2011 at 11:40 AM, Jiangning Liu
 jiangning@arm.com
  wrote:
  
  
   -Original Message-
   From: Richard Guenther [mailto:richard.guent...@gmail.com]
   Sent: Wednesday, September 28, 2011 5:20 PM
   To: Jiangning Liu
   Cc: Andrew Pinski; gcc-patches@gcc.gnu.org
   Subject: Re: [PATCH] Fix stack red zone bug (PR38644)
  
   On Wed, Sep 28, 2011 at 11:10 AM, Jiangning Liu
  jiangning@arm.com
   wrote:
   
   
-Original Message-
From: Richard Guenther [mailto:richard.guent...@gmail.com]
Sent: Wednesday, September 28, 2011 4:39 PM
To: Jiangning Liu
Cc: Andrew Pinski; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] Fix stack red zone bug (PR38644)
   
On Wed, Sep 28, 2011 at 3:49 AM, Jiangning Liu
   jiangning@arm.com
wrote:


 -Original Message-
 From: Richard Guenther [mailto:richard.guent...@gmail.com]
 Sent: Tuesday, September 27, 2011 3:41 PM
 To: Jiangning Liu
 Cc: Andrew Pinski; gcc-patches@gcc.gnu.org
 Subject: Re: [PATCH] Fix stack red zone bug (PR38644)

 On Tue, Sep 27, 2011 at 5:32 AM, Jiangning Liu
jiangning@arm.com
 wrote:
  Think of it this way.  What the IR says is there is no
  barrier
 between
  those moves.  You either have an implicit barrier (which
 is
   what
you
  are proposing) or you have it explicitly.  I think we
 all
   rather
 have
  more things explicit rather than implicit in the
 IR.  And
  that
has
  been the overall feeling for a few years now.
 
 
  Sorry, I'm afraid I can't agree with you. Instead, I
 think
   using
 barrier to describe this kind of dependence is a kind of
  implicit
 method. Please note that this is not an usual data
 dependence
   issue.
 The stack pointer change doesn't have any dependence with
  memory
access
 at all.

 It is similar to atomic instructions that require being an
 optimization/memory barrier.  Sure it is not a usual data
   dependence
 (otherwise we would handle
 it already), so the targets have to explicitly express the
dependence
 somehow, for which we only have barriers right now.


 Richard,

 Thanks for your explanation. It's explicit to back-end,
 while
  it's
implicit
 to scheduler in middle end, because barrier can decide
  dependence
   in
 scheduler but barrier can be generated from several
 different
scenarios.
 It's unsafe and prone to introduce bug if any one of the
  scenarios
requiring
 generating barriers is being missed in back-end.

 Between middle-end and back-end, we should have interfaces
 that
  is
easy to
 be implemented by back-end. After all, middle-end itself
 can't
consist of a
 compiler, and vice versa. Back-end needs middle-end's help
 to
  make
sure
 back-end is easy to be implemented and reduce the
 possibility
  of
introducing
 bugs.

 Without an explicit hook as I'm proposing, back-end
  implementers
   have
to
 clearly know all scenarios of generating barrier very
 clearly,
because there
 isn't any code tips and comments in middle end telling back-
 end
   the
list of
 all scenarios on generating barriers.

 Yes, barrier is a perfect interface for scheduler in theory.
  But
   from
 engineering point of view, I think it's better to explicitly
   define
an
 interface to describe stack red zone and inform back-end, or
  vice
versa. Not
 like computer, people is easy to make mistake if you don't
 tell
   them.
On
 this bug, the fact is over the years different back-ends
 made
   similar
bugs.

 GCC is really a perfect platform on building new ports, and
 I
  saw
   a
lot of
 new back-ends. The current middle end is unsafe, if port
  doesn't
support
 stack red zone and back-ends doesn't generate barrier for it.
  Why
can't we
 explicitly clarify this in compiler code between middle end
 and
   back
end?
 What if any other back-end (new or old) NOT supporting stack
  red
   zone
 exposing the similar bug again?
   
There are gazillion things you have to explicitly get right in
  your
backends,
so I don't see why exposing proper scheduling barriers should
 be
special

RE: [PATCH] Fix stack red zone bug (PR38644)

2011-09-29 Thread Jiangning Liu


 -Original Message-
 From: Jakub Jelinek [mailto:ja...@redhat.com]
 Sent: Thursday, September 29, 2011 6:14 PM
 To: Jiangning Liu
 Cc: 'Richard Guenther'; Andrew Pinski; gcc-patches@gcc.gnu.org
 Subject: Re: [PATCH] Fix stack red zone bug (PR38644)
 
 On Thu, Sep 29, 2011 at 06:08:50PM +0800, Jiangning Liu wrote:
  As far as I know different back-ends are implementing different
  prologue/epilogue in GCC. If one day this part can be refined and
 abstracted
  as well, I would say solving this stack-red-zone problem in shared
  prologue/epilogue code would be a perfect solution, and barrier can
 be
  inserted there.
 
  I'm not saying you are wrong on keeping scheduler using a pure
 barrier
  interface. From engineering point of view, I only feel my proposal is
 so far
  so good, because this patch at least solve the problem for all
 targets in a
  quite simple way. Maybe it can be improved in future based on this.
 
 But you don't want to listen about any other alternative, other
 backends are
 happy with being able to put the best kind of barrier at the best spot
 in the epilogue and don't need a generic solution which won't model
 very
 well the target diversity anyway.

Jakub,

Appreciate for your attention on this issue,

1) Can you clarify who are the others back-ends? Does it cover most of the
back-ends being supported by GCC right now?
2) You need give data to prove other back-ends are happy with current
solution. The fact is over the years there are a bunch of bugs filed against
this problem. WHY? At this point you are implying other back-ends are
happy with bugs or potential bugs? This is wired to me. Also, this is not a
issue whether a back-end is able to implement barrier or not. If you are
really asking able or not, I would say every back-end is able, but it
doesn't mean able is correct and it doesn't mean able is the most
reasonable.

Comparing with the one I am proposing, I don't see the current solution has
other significant advantages in addition to the proper modeling for
scheduler you are arguing. Instead, the solution I'm proposing is a safe
solution, and a solution easy to avoid bugs. If GCC compiler infrastructure
can't even give a safe compilation, why should we insist on the proper
modeling for scheduler only?

Hopefully you can consider again about this.

-Jiangning

 
   Jakub






RE: [PATCH] Fix stack red zone bug (PR38644)

2011-09-28 Thread Jiangning Liu
  -static inline bool
  +extern bool
 
 static bool
 
   ix86_using_red_zone (void)
   {
    return TARGET_RED_ZONE  !TARGET_64BIT_MS_ABI;
  @@ -35958,6 +35958,9 @@ ix86_autovectorize_vector_sizes (void)
   #define TARGET_STACK_PROTECT_FAIL ix86_stack_protect_fail
   #endif
 

...

 
  +/* Return true if the ABI allows red zone access.  */
  +extern bool
 
 static bool
 
  +rs6000_stack_using_red_zone (void)
  +{
  +   return (DEFAULT_ABI != ABI_V4);
  +}
  +
 

Uros,

Thanks for your review. Accept and new patch is as below. No change for
ChangeLog.

Thanks,
-Jiangning

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index ff8c49f..e200974 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2631,7 +2631,7 @@ static const char *const
cpu_names[TARGET_CPU_DEFAULT_max] =
 

 /* Return true if a red-zone is in use.  */
 
-static inline bool
+static bool
 ix86_using_red_zone (void)
 {
   return TARGET_RED_ZONE  !TARGET_64BIT_MS_ABI;
@@ -35958,6 +35958,9 @@ ix86_autovectorize_vector_sizes (void)
 #define TARGET_STACK_PROTECT_FAIL ix86_stack_protect_fail
 #endif
 
+#undef TARGET_STACK_USING_RED_ZONE
+#define TARGET_STACK_USING_RED_ZONE ix86_using_red_zone
+
 #undef TARGET_FUNCTION_VALUE
 #define TARGET_FUNCTION_VALUE ix86_function_value
 
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 1ab57e5..1e64d14 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1537,6 +1537,9 @@ static const struct attribute_spec
rs6000_attribute_table[] =
 #define TARGET_STACK_PROTECT_FAIL rs6000_stack_protect_fail
 #endif
 
+#undef TARGET_STACK_USING_RED_ZONE
+#define TARGET_STACK_USING_RED_ZONE rs6000_stack_using_red_zone
+
 /* MPC604EUM 3.5.2 Weak Consistency between Multiple Processors
The PowerPC architecture requires only weak consistency among
processors--that is, memory accesses between processors need not be
@@ -20660,6 +20663,13 @@ rs6000_restore_saved_cr (rtx reg, int
using_mfcr_multiple)
}
 }
 
+/* Return true if the ABI allows red zone access.  */
+static bool
+rs6000_stack_using_red_zone (void)
+{
+   return (DEFAULT_ABI != ABI_V4);
+}
+
 /* Return true if OFFSET from stack pointer can be clobbered by signals.
V.4 doesn't have any stack cushion, AIX ABIs have 220 or 288 bytes
below stack pointer not cloberred by signals.  */
@@ -20667,7 +20677,7 @@ rs6000_restore_saved_cr (rtx reg, int
using_mfcr_multiple)
 static inline bool
 offset_below_red_zone_p (HOST_WIDE_INT offset)
 {
-  return offset  (DEFAULT_ABI == ABI_V4
+  return offset  (!TARGET_STACK_USING_RED_ZONE
   ? 0
   : TARGET_32BIT ? -220 : -288);
 }
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 335c1d1..c848806 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11332,6 +11332,22 @@ to the stack.  Therefore, this hook should return
true in general, but
 false for naked functions.  The default implementation always returns true.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_STACK_USING_RED_ZONE (void)
+This hook returns true if the target has a red zone (an area beyond the
+current extent of the stack that cannot be modified by asynchronous events
+on the processor).
+
+If this hook returns false then the compiler mid-end will not move an
access
+to memory in the stack frame past a stack adjustment insn.
+
+If this hook returns true then the compiler mid-end will assume that it is
+safe to move an access to memory in the stack frame past a stack adjustment
+insn.  The target back-end must emit scheduling barrier insns when this is
+unsafe.
+
+The default is to return false which is safe and appropriate for most
targets.
+@end deftypefn
+
 @deftypevr {Target Hook} {unsigned HOST_WIDE_INT} TARGET_CONST_ANCHOR
 On some architectures it can take multiple instructions to synthesize
 a constant.  If there is another constant already in a register that
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 6783826..0fa9e10 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -11223,6 +11223,22 @@ to the stack.  Therefore, this hook should return
true in general, but
 false for naked functions.  The default implementation always returns true.
 @end deftypefn
 
+@hook TARGET_STACK_USING_RED_ZONE
+This hook returns true if the target has a red zone (an area beyond the
+current extent of the stack that cannot be modified by asynchronous events
+on the processor).
+
+If this hook returns false then the compiler mid-end will not move an
access
+to memory in the stack frame past a stack adjustment insn.
+
+If this hook returns true then the compiler mid-end will assume that it is
+safe to move an access to memory in the stack frame past a stack adjustment
+insn.  The target back-end must emit scheduling barrier insns when this is
+unsafe.
+
+The default is to return false which is safe and appropriate for most
targets.
+@end deftypefn
+
 @hook TARGET_CONST_ANCHOR
 On some architectures it can take multiple 

RE: [PATCH] Fix stack red zone bug (PR38644)

2011-09-28 Thread Jiangning Liu
Just realized ChangeLog needs to be changed as well.

ChangeLog:

* config/i386/i386.c (ix86_using_red_zone): Remove inline.
(TARGET_STACK_USING_RED_ZONE): New.
* config/rs6000/rs6000.c (rs6000_stack_using_red_zone): New.
(TARGET_STACK_USING_RED_ZONE): New.
(offset_below_red_zone_p): Change to use new hook 
TARGET_STACK_USING_RED_ZONE.
* doc/tm.texi (TARGET_STACK_USING_RED_ZONE): New.
* doc/tm.texi.in (TARGET_STACK_USING_RED_ZONE): New.
* sched-deps.c (sched_analyze_1): If the stack pointer is being 
modified and stack red zone is not supported for ports, flush out 
all memory references as they may become invalid if moved across 
the stack adjustment.
* target.def (stack_using_red_zone): New.
* testsuite/gcc.target/arm/stack-red-zone.c: New.

 -Original Message-
 From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
 ow...@gcc.gnu.org] On Behalf Of Jiangning Liu
 Sent: Wednesday, September 28, 2011 2:24 PM
 To: 'Uros Bizjak'
 Cc: gcc-patches@gcc.gnu.org; j...@suse.cz; geo...@geoffk.org;
 dje@gmail.com; r...@redhat.com; Richard Earnshaw; Matthew Gretton-
 Dann
 Subject: RE: [PATCH] Fix stack red zone bug (PR38644)
 
   -static inline bool
   +extern bool
 
  static bool
 
    ix86_using_red_zone (void)
    {
     return TARGET_RED_ZONE  !TARGET_64BIT_MS_ABI;
   @@ -35958,6 +35958,9 @@ ix86_autovectorize_vector_sizes (void)
    #define TARGET_STACK_PROTECT_FAIL ix86_stack_protect_fail
    #endif
  
 
 ...
 
  
   +/* Return true if the ABI allows red zone access.  */
   +extern bool
 
  static bool
 
   +rs6000_stack_using_red_zone (void)
   +{
   +   return (DEFAULT_ABI != ABI_V4);
   +}
   +
 
 
 Uros,
 
 Thanks for your review. Accept and new patch is as below. No change for
 ChangeLog.
 
 Thanks,
 -Jiangning
 
 diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
 index ff8c49f..e200974 100644
 --- a/gcc/config/i386/i386.c
 +++ b/gcc/config/i386/i386.c
 @@ -2631,7 +2631,7 @@ static const char *const
 cpu_names[TARGET_CPU_DEFAULT_max] =
 
 
  /* Return true if a red-zone is in use.  */
 
 -static inline bool
 +static bool
  ix86_using_red_zone (void)
  {
return TARGET_RED_ZONE  !TARGET_64BIT_MS_ABI;
 @@ -35958,6 +35958,9 @@ ix86_autovectorize_vector_sizes (void)
  #define TARGET_STACK_PROTECT_FAIL ix86_stack_protect_fail
  #endif
 
 +#undef TARGET_STACK_USING_RED_ZONE
 +#define TARGET_STACK_USING_RED_ZONE ix86_using_red_zone
 +
  #undef TARGET_FUNCTION_VALUE
  #define TARGET_FUNCTION_VALUE ix86_function_value
 
 diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
 index 1ab57e5..1e64d14 100644
 --- a/gcc/config/rs6000/rs6000.c
 +++ b/gcc/config/rs6000/rs6000.c
 @@ -1537,6 +1537,9 @@ static const struct attribute_spec
 rs6000_attribute_table[] =
  #define TARGET_STACK_PROTECT_FAIL rs6000_stack_protect_fail
  #endif
 
 +#undef TARGET_STACK_USING_RED_ZONE
 +#define TARGET_STACK_USING_RED_ZONE rs6000_stack_using_red_zone
 +
  /* MPC604EUM 3.5.2 Weak Consistency between Multiple Processors
 The PowerPC architecture requires only weak consistency among
 processors--that is, memory accesses between processors need not be
 @@ -20660,6 +20663,13 @@ rs6000_restore_saved_cr (rtx reg, int
 using_mfcr_multiple)
   }
  }
 
 +/* Return true if the ABI allows red zone access.  */
 +static bool
 +rs6000_stack_using_red_zone (void)
 +{
 +   return (DEFAULT_ABI != ABI_V4);
 +}
 +
  /* Return true if OFFSET from stack pointer can be clobbered by
 signals.
 V.4 doesn't have any stack cushion, AIX ABIs have 220 or 288 bytes
 below stack pointer not cloberred by signals.  */
 @@ -20667,7 +20677,7 @@ rs6000_restore_saved_cr (rtx reg, int
 using_mfcr_multiple)
  static inline bool
  offset_below_red_zone_p (HOST_WIDE_INT offset)
  {
 -  return offset  (DEFAULT_ABI == ABI_V4
 +  return offset  (!TARGET_STACK_USING_RED_ZONE
  ? 0
  : TARGET_32BIT ? -220 : -288);
  }
 diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
 index 335c1d1..c848806 100644
 --- a/gcc/doc/tm.texi
 +++ b/gcc/doc/tm.texi
 @@ -11332,6 +11332,22 @@ to the stack.  Therefore, this hook should
 return
 true in general, but
  false for naked functions.  The default implementation always returns
 true.
  @end deftypefn
 
 +@deftypefn {Target Hook} bool TARGET_STACK_USING_RED_ZONE (void)
 +This hook returns true if the target has a red zone (an area beyond
 the
 +current extent of the stack that cannot be modified by asynchronous
 events
 +on the processor).
 +
 +If this hook returns false then the compiler mid-end will not move an
 access
 +to memory in the stack frame past a stack adjustment insn.
 +
 +If this hook returns true then the compiler mid-end will assume that
 it is
 +safe to move an access to memory in the stack frame past a stack
 adjustment
 +insn.  The target back-end must emit scheduling barrier insns when
 this is
 +unsafe.
 +
 +The default

RE: [PATCH] Fix stack red zone bug (PR38644)

2011-09-28 Thread Jiangning Liu


 -Original Message-
 From: Richard Guenther [mailto:richard.guent...@gmail.com]
 Sent: Wednesday, September 28, 2011 4:39 PM
 To: Jiangning Liu
 Cc: Andrew Pinski; gcc-patches@gcc.gnu.org
 Subject: Re: [PATCH] Fix stack red zone bug (PR38644)
 
 On Wed, Sep 28, 2011 at 3:49 AM, Jiangning Liu jiangning@arm.com
 wrote:
 
 
  -Original Message-
  From: Richard Guenther [mailto:richard.guent...@gmail.com]
  Sent: Tuesday, September 27, 2011 3:41 PM
  To: Jiangning Liu
  Cc: Andrew Pinski; gcc-patches@gcc.gnu.org
  Subject: Re: [PATCH] Fix stack red zone bug (PR38644)
 
  On Tue, Sep 27, 2011 at 5:32 AM, Jiangning Liu
 jiangning@arm.com
  wrote:
   Think of it this way.  What the IR says is there is no barrier
  between
   those moves.  You either have an implicit barrier (which is what
 you
   are proposing) or you have it explicitly.  I think we all rather
  have
   more things explicit rather than implicit in the IR.  And that
 has
   been the overall feeling for a few years now.
  
  
   Sorry, I'm afraid I can't agree with you. Instead, I think using
  barrier to describe this kind of dependence is a kind of implicit
  method. Please note that this is not an usual data dependence issue.
  The stack pointer change doesn't have any dependence with memory
 access
  at all.
 
  It is similar to atomic instructions that require being an
  optimization/memory barrier.  Sure it is not a usual data dependence
  (otherwise we would handle
  it already), so the targets have to explicitly express the
 dependence
  somehow, for which we only have barriers right now.
 
 
  Richard,
 
  Thanks for your explanation. It's explicit to back-end, while it's
 implicit
  to scheduler in middle end, because barrier can decide dependence in
  scheduler but barrier can be generated from several different
 scenarios.
  It's unsafe and prone to introduce bug if any one of the scenarios
 requiring
  generating barriers is being missed in back-end.
 
  Between middle-end and back-end, we should have interfaces that is
 easy to
  be implemented by back-end. After all, middle-end itself can't
 consist of a
  compiler, and vice versa. Back-end needs middle-end's help to make
 sure
  back-end is easy to be implemented and reduce the possibility of
 introducing
  bugs.
 
  Without an explicit hook as I'm proposing, back-end implementers have
 to
  clearly know all scenarios of generating barrier very clearly,
 because there
  isn't any code tips and comments in middle end telling back-end the
 list of
  all scenarios on generating barriers.
 
  Yes, barrier is a perfect interface for scheduler in theory. But from
  engineering point of view, I think it's better to explicitly define
 an
  interface to describe stack red zone and inform back-end, or vice
 versa. Not
  like computer, people is easy to make mistake if you don't tell them.
 On
  this bug, the fact is over the years different back-ends made similar
 bugs.
 
  GCC is really a perfect platform on building new ports, and I saw a
 lot of
  new back-ends. The current middle end is unsafe, if port doesn't
 support
  stack red zone and back-ends doesn't generate barrier for it. Why
 can't we
  explicitly clarify this in compiler code between middle end and back
 end?
  What if any other back-end (new or old) NOT supporting stack red zone
  exposing the similar bug again?
 
 There are gazillion things you have to explicitly get right in your
 backends,
 so I don't see why exposing proper scheduling barriers should be
 special,
 and there, why red-zones should be special (as opposed to other
 occasions
 where you need to emit barriers from the backend for the scheduler).
 

Richard,

This is because,

1) Current scheduler is unsafe if back-end doesn't generate barrier for a
port which doesn't support stack red zone at all.
2) Implementing barrier in back-end is a burden to new back-end
implementation for ports not supporting stack red zone at all. 
3) There are many other ports not reporting bugs around this. I doubt there
isn't bug for them. 
4) There are over 300 TARGET HOOKS being defined in target.def. I don't
think adding this interface is hurting GCC.

BTW, really appreciate your close attention to this specific issue.

Thanks,
-Jiangning

 Richard.
 
  Thanks,
  -Jiangning
 
  Richard.
 
   No matter what solution itself is, the problem itself is a quite a
  common one on ISA level, so we should solve it in middle-end,
 because
  middle-end is shared for all ports.
  
   My proposal avoids problems in future. Any new ports and new back-
 end
  implementations needn't explicitly define this hook in a very safe
 way.
  But if one port wants to unsafely introduce red zone, we can
  explicitly define this hook in back-end. This way, we would avoid
 the
  bug in the earliest time. Do you really want to hide this problem in
  back-end silently? And wait others to report bug and then waste time
 to
  get it fixed again?
  
   The facts I see is over the years

RE: [PATCH] Fix stack red zone bug (PR38644)

2011-09-28 Thread Jiangning Liu


 -Original Message-
 From: Richard Guenther [mailto:richard.guent...@gmail.com]
 Sent: Wednesday, September 28, 2011 5:20 PM
 To: Jiangning Liu
 Cc: Andrew Pinski; gcc-patches@gcc.gnu.org
 Subject: Re: [PATCH] Fix stack red zone bug (PR38644)
 
 On Wed, Sep 28, 2011 at 11:10 AM, Jiangning Liu jiangning@arm.com
 wrote:
 
 
  -Original Message-
  From: Richard Guenther [mailto:richard.guent...@gmail.com]
  Sent: Wednesday, September 28, 2011 4:39 PM
  To: Jiangning Liu
  Cc: Andrew Pinski; gcc-patches@gcc.gnu.org
  Subject: Re: [PATCH] Fix stack red zone bug (PR38644)
 
  On Wed, Sep 28, 2011 at 3:49 AM, Jiangning Liu
 jiangning@arm.com
  wrote:
  
  
   -Original Message-
   From: Richard Guenther [mailto:richard.guent...@gmail.com]
   Sent: Tuesday, September 27, 2011 3:41 PM
   To: Jiangning Liu
   Cc: Andrew Pinski; gcc-patches@gcc.gnu.org
   Subject: Re: [PATCH] Fix stack red zone bug (PR38644)
  
   On Tue, Sep 27, 2011 at 5:32 AM, Jiangning Liu
  jiangning@arm.com
   wrote:
Think of it this way.  What the IR says is there is no barrier
   between
those moves.  You either have an implicit barrier (which is
 what
  you
are proposing) or you have it explicitly.  I think we all
 rather
   have
more things explicit rather than implicit in the IR.  And that
  has
been the overall feeling for a few years now.
   
   
Sorry, I'm afraid I can't agree with you. Instead, I think
 using
   barrier to describe this kind of dependence is a kind of implicit
   method. Please note that this is not an usual data dependence
 issue.
   The stack pointer change doesn't have any dependence with memory
  access
   at all.
  
   It is similar to atomic instructions that require being an
   optimization/memory barrier.  Sure it is not a usual data
 dependence
   (otherwise we would handle
   it already), so the targets have to explicitly express the
  dependence
   somehow, for which we only have barriers right now.
  
  
   Richard,
  
   Thanks for your explanation. It's explicit to back-end, while it's
  implicit
   to scheduler in middle end, because barrier can decide dependence
 in
   scheduler but barrier can be generated from several different
  scenarios.
   It's unsafe and prone to introduce bug if any one of the scenarios
  requiring
   generating barriers is being missed in back-end.
  
   Between middle-end and back-end, we should have interfaces that is
  easy to
   be implemented by back-end. After all, middle-end itself can't
  consist of a
   compiler, and vice versa. Back-end needs middle-end's help to make
  sure
   back-end is easy to be implemented and reduce the possibility of
  introducing
   bugs.
  
   Without an explicit hook as I'm proposing, back-end implementers
 have
  to
   clearly know all scenarios of generating barrier very clearly,
  because there
   isn't any code tips and comments in middle end telling back-end
 the
  list of
   all scenarios on generating barriers.
  
   Yes, barrier is a perfect interface for scheduler in theory. But
 from
   engineering point of view, I think it's better to explicitly
 define
  an
   interface to describe stack red zone and inform back-end, or vice
  versa. Not
   like computer, people is easy to make mistake if you don't tell
 them.
  On
   this bug, the fact is over the years different back-ends made
 similar
  bugs.
  
   GCC is really a perfect platform on building new ports, and I saw
 a
  lot of
   new back-ends. The current middle end is unsafe, if port doesn't
  support
   stack red zone and back-ends doesn't generate barrier for it. Why
  can't we
   explicitly clarify this in compiler code between middle end and
 back
  end?
   What if any other back-end (new or old) NOT supporting stack red
 zone
   exposing the similar bug again?
 
  There are gazillion things you have to explicitly get right in your
  backends,
  so I don't see why exposing proper scheduling barriers should be
  special,
  and there, why red-zones should be special (as opposed to other
  occasions
  where you need to emit barriers from the backend for the scheduler).
 
 
  Richard,
 
  This is because,
 
  1) Current scheduler is unsafe if back-end doesn't generate barrier
 for a
  port which doesn't support stack red zone at all.
  2) Implementing barrier in back-end is a burden to new back-end
  implementation for ports not supporting stack red zone at all.
  3) There are many other ports not reporting bugs around this. I doubt
 there
  isn't bug for them.
  4) There are over 300 TARGET HOOKS being defined in target.def. I
 don't
  think adding this interface is hurting GCC.
 
 I don't argue that your solution is not acceptable, just your reasoning
 is bogus IMHO. 
 1) is a target bug, 

Why should back-ends handle a thing that doesn't exist at all for them? I
don't see clear logic here. 

 2) huh, I doubt that this is the
 biggest issue
 one faces when implementing a new target, 

I never say

RE: [PATCH] Fix stack red zone bug (PR38644)

2011-09-28 Thread Jiangning Liu


 -Original Message-
 From: Richard Guenther [mailto:richard.guent...@gmail.com]
 Sent: Wednesday, September 28, 2011 5:56 PM
 To: Jiangning Liu
 Cc: Andrew Pinski; gcc-patches@gcc.gnu.org
 Subject: Re: [PATCH] Fix stack red zone bug (PR38644)
 
 On Wed, Sep 28, 2011 at 11:40 AM, Jiangning Liu jiangning@arm.com
 wrote:
 
 
  -Original Message-
  From: Richard Guenther [mailto:richard.guent...@gmail.com]
  Sent: Wednesday, September 28, 2011 5:20 PM
  To: Jiangning Liu
  Cc: Andrew Pinski; gcc-patches@gcc.gnu.org
  Subject: Re: [PATCH] Fix stack red zone bug (PR38644)
 
  On Wed, Sep 28, 2011 at 11:10 AM, Jiangning Liu
 jiangning@arm.com
  wrote:
  
  
   -Original Message-
   From: Richard Guenther [mailto:richard.guent...@gmail.com]
   Sent: Wednesday, September 28, 2011 4:39 PM
   To: Jiangning Liu
   Cc: Andrew Pinski; gcc-patches@gcc.gnu.org
   Subject: Re: [PATCH] Fix stack red zone bug (PR38644)
  
   On Wed, Sep 28, 2011 at 3:49 AM, Jiangning Liu
  jiangning@arm.com
   wrote:
   
   
-Original Message-
From: Richard Guenther [mailto:richard.guent...@gmail.com]
Sent: Tuesday, September 27, 2011 3:41 PM
To: Jiangning Liu
Cc: Andrew Pinski; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] Fix stack red zone bug (PR38644)
   
On Tue, Sep 27, 2011 at 5:32 AM, Jiangning Liu
   jiangning@arm.com
wrote:
 Think of it this way.  What the IR says is there is no
 barrier
between
 those moves.  You either have an implicit barrier (which is
  what
   you
 are proposing) or you have it explicitly.  I think we all
  rather
have
 more things explicit rather than implicit in the IR.  And
 that
   has
 been the overall feeling for a few years now.


 Sorry, I'm afraid I can't agree with you. Instead, I think
  using
barrier to describe this kind of dependence is a kind of
 implicit
method. Please note that this is not an usual data dependence
  issue.
The stack pointer change doesn't have any dependence with
 memory
   access
at all.
   
It is similar to atomic instructions that require being an
optimization/memory barrier.  Sure it is not a usual data
  dependence
(otherwise we would handle
it already), so the targets have to explicitly express the
   dependence
somehow, for which we only have barriers right now.
   
   
Richard,
   
Thanks for your explanation. It's explicit to back-end, while
 it's
   implicit
to scheduler in middle end, because barrier can decide
 dependence
  in
scheduler but barrier can be generated from several different
   scenarios.
It's unsafe and prone to introduce bug if any one of the
 scenarios
   requiring
generating barriers is being missed in back-end.
   
Between middle-end and back-end, we should have interfaces that
 is
   easy to
be implemented by back-end. After all, middle-end itself can't
   consist of a
compiler, and vice versa. Back-end needs middle-end's help to
 make
   sure
back-end is easy to be implemented and reduce the possibility
 of
   introducing
bugs.
   
Without an explicit hook as I'm proposing, back-end
 implementers
  have
   to
clearly know all scenarios of generating barrier very clearly,
   because there
isn't any code tips and comments in middle end telling back-end
  the
   list of
all scenarios on generating barriers.
   
Yes, barrier is a perfect interface for scheduler in theory.
 But
  from
engineering point of view, I think it's better to explicitly
  define
   an
interface to describe stack red zone and inform back-end, or
 vice
   versa. Not
like computer, people is easy to make mistake if you don't tell
  them.
   On
this bug, the fact is over the years different back-ends made
  similar
   bugs.
   
GCC is really a perfect platform on building new ports, and I
 saw
  a
   lot of
new back-ends. The current middle end is unsafe, if port
 doesn't
   support
stack red zone and back-ends doesn't generate barrier for it.
 Why
   can't we
explicitly clarify this in compiler code between middle end and
  back
   end?
What if any other back-end (new or old) NOT supporting stack
 red
  zone
exposing the similar bug again?
  
   There are gazillion things you have to explicitly get right in
 your
   backends,
   so I don't see why exposing proper scheduling barriers should be
   special,
   and there, why red-zones should be special (as opposed to other
   occasions
   where you need to emit barriers from the backend for the
 scheduler).
  
  
   Richard,
  
   This is because,
  
   1) Current scheduler is unsafe if back-end doesn't generate
 barrier
  for a
   port which doesn't support stack red zone at all.
   2) Implementing barrier in back-end is a burden to new back-end
   implementation for ports not supporting stack red zone at all.
   3) There are many other ports not reporting bugs around

RE: [PATCH] Fix stack red zone bug (PR38644)

2011-09-27 Thread Jiangning Liu


 -Original Message-
 From: Richard Guenther [mailto:richard.guent...@gmail.com]
 Sent: Tuesday, September 27, 2011 3:41 PM
 To: Jiangning Liu
 Cc: Andrew Pinski; gcc-patches@gcc.gnu.org
 Subject: Re: [PATCH] Fix stack red zone bug (PR38644)
 
 On Tue, Sep 27, 2011 at 5:32 AM, Jiangning Liu jiangning@arm.com
 wrote:
  Think of it this way.  What the IR says is there is no barrier
 between
  those moves.  You either have an implicit barrier (which is what you
  are proposing) or you have it explicitly.  I think we all rather
 have
  more things explicit rather than implicit in the IR.  And that has
  been the overall feeling for a few years now.
 
 
  Sorry, I'm afraid I can't agree with you. Instead, I think using
 barrier to describe this kind of dependence is a kind of implicit
 method. Please note that this is not an usual data dependence issue.
 The stack pointer change doesn't have any dependence with memory access
 at all.
 
 It is similar to atomic instructions that require being an
 optimization/memory barrier.  Sure it is not a usual data dependence
 (otherwise we would handle
 it already), so the targets have to explicitly express the dependence
 somehow, for which we only have barriers right now.
 

Richard,

Thanks for your explanation. It's explicit to back-end, while it's implicit
to scheduler in middle end, because barrier can decide dependence in
scheduler but barrier can be generated from several different scenarios.
It's unsafe and prone to introduce bug if any one of the scenarios requiring
generating barriers is being missed in back-end.

Between middle-end and back-end, we should have interfaces that is easy to
be implemented by back-end. After all, middle-end itself can't consist of a
compiler, and vice versa. Back-end needs middle-end's help to make sure
back-end is easy to be implemented and reduce the possibility of introducing
bugs.

Without an explicit hook as I'm proposing, back-end implementers have to
clearly know all scenarios of generating barrier very clearly, because there
isn't any code tips and comments in middle end telling back-end the list of
all scenarios on generating barriers. 

Yes, barrier is a perfect interface for scheduler in theory. But from
engineering point of view, I think it's better to explicitly define an
interface to describe stack red zone and inform back-end, or vice versa. Not
like computer, people is easy to make mistake if you don't tell them. On
this bug, the fact is over the years different back-ends made similar bugs.

GCC is really a perfect platform on building new ports, and I saw a lot of
new back-ends. The current middle end is unsafe, if port doesn't support
stack red zone and back-ends doesn't generate barrier for it. Why can't we
explicitly clarify this in compiler code between middle end and back end?
What if any other back-end (new or old) NOT supporting stack red zone
exposing the similar bug again?

Thanks,
-Jiangning

 Richard.
 
  No matter what solution itself is, the problem itself is a quite a
 common one on ISA level, so we should solve it in middle-end, because
 middle-end is shared for all ports.
 
  My proposal avoids problems in future. Any new ports and new back-end
 implementations needn't explicitly define this hook in a very safe way.
 But if one port wants to unsafely introduce red zone, we can
 explicitly define this hook in back-end. This way, we would avoid the
 bug in the earliest time. Do you really want to hide this problem in
 back-end silently? And wait others to report bug and then waste time to
 get it fixed again?
 
  The facts I see is over the years, for different ports reported the
 similar issue around this, and somebody tried to fix them. Actually,
 this kind of problem should be fixed in design level. If the first
 people solve this bug can give solution in middle end, we would not
 need to waste time on filing those bugs in bug zilla and fixing them
 around this problem.
 
  Thanks,
  -Jiangning
 
 
 
 
 
 
 
 






RE: A question about detecting array bounds for case Warray-bounds-3.c

2011-09-26 Thread Jiangning Liu
PING...

 -Original Message-
 From: Jiangning Liu [mailto:jiangning@arm.com]
 Sent: Thursday, September 22, 2011 10:19 AM
 To: gcc@gcc.gnu.org
 Cc: 'ja...@gcc.gnu.org'; 'muel...@gcc.gnu.org'; 'rgue...@gcc.gnu.org';
 Matthew Gretton-Dann
 Subject: A question about detecting array bounds for case Warray-
 bounds-3.c
 
 Hi,
 
 For case gcc/testsuite/gcc.dg/Warray-bounds-3.c, obviously it is an
 invalid C program, because the last iterations of all the loops cause
 the access of arrays is beyond the max size of corresponding array
 declarations. The condition of checking upper bound should be 
 rather than =.
 
 Right now, GCC compiler doesn't report any warning messages for this
 case, should it be a bug in both test case and compiler?
 
 But looking at http://gcc.gnu.org/PR31227 , it seems this test case is
 designed to be like this on purpose. Anybody can explain about this?
 
 The case is like below,
 
 /* { dg-do compile } */
 /* { dg-options -O2 -Warray-bounds } */
 /* based on PR 31227 */
 
 struct S
 {
   const char *abday[7];
   const char *day[7];
   const char *abmon[12];
   const char *mon[12];
   const char *am_pm[2];
 };
 
 ...
 
   for (cnt = 0; cnt = 7; ++cnt)
 {
   iov[2 + cnt].iov_base = (void *) (time-abday[cnt] ?: );
   iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1;
 }
 
   for (; cnt = 14; ++cnt)
 {
   iov[2 + cnt].iov_base = (void *) (time-day[cnt - 7] ?: );
   iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1;
 }
 
   for (; cnt = 26; ++cnt)
 {
   iov[2 + cnt].iov_base = (void *) (time-abmon[cnt - 14] ?: );
   iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1;
 }
 
   for (; cnt = 38; ++cnt)
 {
   iov[2 + cnt].iov_base = (void *) (time-mon[cnt - 26] ?: );
   iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1;
 }
 
   for (; cnt = 40; ++cnt)
 {
   iov[2 + cnt].iov_base =  (void *) (time-am_pm[cnt - 38] ?: );
   iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1;
 }
 
 Thanks,
 -Jiangning





RE: A case that PRE optimization hurts performance

2011-09-26 Thread Jiangning Liu
  * Without PRE,
 
  Path1:
         movl    $2, %eax
         cmpl    $1, %eax
         je      .L3
 
  Path2:
         movb    $3, %al
         cmpl    $1, %eax
         je      .L3
 
  Path3:
         cmpl    $1, %eax
         jne     .L14
 
  * With PRE,
 
  Path1:
         movl    $1, %ebx
         movl    $2, %eax
         testb   %bl, %bl
         je      .L3
 
  Path2:
         movl    $1, %ebx
         movb    $3, %al
         testb   %bl, %bl
         je      .L3
 
  Path3:
         cmpl    $1, %eax
         setne   %bl
         testb   %bl, %bl
         jne     .L14
 
  Do you have any more thoughts?
 
 It seems to me that with PRE all the testb %bl, %bl
 should be evaluated at compile-time considering the
 preceeding movl $1, %ebx.  Am I missing something?
 

Yes. Can this be done by PRE or any other optimizations in middle end?

Thanks,
-Jiangning

 Richard.
 






RE: A case that PRE optimization hurts performance

2011-09-26 Thread Jiangning Liu


 -Original Message-
 From: Jeff Law [mailto:l...@redhat.com]
 Sent: Tuesday, September 27, 2011 12:43 AM
 To: Richard Guenther
 Cc: Jiangning Liu; gcc@gcc.gnu.org
 Subject: Re: A case that PRE optimization hurts performance
 
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On 09/26/11 05:00, Richard Guenther wrote:
  On Mon, Sep 26, 2011 at 9:39 AM, Jiangning Liu
  jiangning@arm.com wrote:
  * Without PRE,
 
  Path1: movl$2, %eax cmpl$1, %eax je  .L3
 
  Path2: movb$3, %al cmpl$1, %eax je  .L3
 
  Path3: cmpl$1, %eax jne .L14
 
  * With PRE,
 
  Path1: movl$1, %ebx movl$2, %eax testb   %bl, %bl je
  .L3
 
  Path2: movl$1, %ebx movb$3, %al testb   %bl, %bl je
  .L3
 
  Path3: cmpl$1, %eax setne   %bl testb   %bl, %bl jne
  .L14
 
  Do you have any more thoughts?
 
  It seems to me that with PRE all the testb %bl, %bl should be
  evaluated at compile-time considering the preceeding movl $1,
  %ebx.  Am I missing something?
 
 
  Yes. Can this be done by PRE or any other optimizations in middle
  end?
 
  Hm, the paths as you quote them are obfuscated by missed
  jump-threading. On the tree level we have
 
  # s_2 = PHI 2(5), 3(4), 2(6), s_25(7) # prephitmp.6_1 = PHI
  1(5), 1(4), 1(6), prephitmp.6_3(7) L10: t_14 = t_24 + 1;
  D.2729_6 = MEM[base: t_14, offset: 0B]; D.2732_7 = D.2729_6 != 0;
  D.2734_9 = prephitmp.6_1  D.2732_7; if (D.2734_9 != 0)
 
  where we could thread the cases with prephitmp.6_1 == 1,
  ultimately removing the  and forwarding the D.2729_6 != 0 test.
  Which would of course cause some code duplication.
 
  Jeff, you recently looked at tree jump-threading, can you see if
  we can improve things on this particular testcase?
 There's nothing threading can do here because it doesn't know anything
 about the value MEM[t14].
 

Jeff, 

Could you please explain more about this? What information does jump
threading want to know on MEM[t14]? Do you mean it's hard to duplicate that
basic block due to some reasons?

Thanks,
-Jiangning

 
 Jeff
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.11 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
 
 iQEcBAEBAgAGBQJOgKuLAAoJEBRtltQi2kC75aIH/iikuOQXrMrQJFbQw0COXznB
 OGq8iXdGwTJGH13vxdItTE0upJp7RgUVLzuhdqj1elTLHv/ujYygMsQRNGKcc8tb
 GMLECmWDhZqQTFXcTJCgJNZiv7MH1PNELXSdIkkSnxY+pwyn9AX5D3+HcTSjGU6B
 51AdUNVph/VSaVboAgcrFpu9S0pX9HVTqFy4JI83Lh613zDVSmPo14DDy7vjBvE9
 2Srlvlw0srYup97bGmRqN8wT4ZLLlyYSB2rjEFc6jmgXVncxiteQYIUZpy0lcC0M
 q3j80aXjZ57/iWyAbqDr1jI5tbVKDBkRa9LL1jvn9534adiG4GrnSMPhoog0ibA=
 =azr5
 -END PGP SIGNATURE-






RE: [PATCH, testsuite] Add loop unrolling command line options for some test cases

2011-09-26 Thread Jiangning Liu
 -Original Message-
 From: Richard Guenther [mailto:richard.guent...@gmail.com]
 Sent: Saturday, September 24, 2011 6:12 PM
 To: Jiangning Liu
 Cc: Mike Stump; gcc-patches@gcc.gnu.org; r...@cebitec.uni-bielefeld.de
 Subject: Re: [PATCH, testsuite] Add loop unrolling command line options
 for some test cases
 
 On Thu, Sep 22, 2011 at 6:22 AM, Jiangning Liu jiangning@arm.com
 wrote:
  Hi Mike,
 
  OK. I will wait 24 more hours. If no objections by then, I will get
 it
  checked into trunk.
 
 I don't think you need -funroll-loops though.
 

The intention of those test cases are not hurt if -funroll-loops is added,
right?

  Thanks,
  -Jiangning
 
  -Original Message-
  From: Mike Stump [mailto:mikest...@comcast.net]
  Sent: Thursday, September 22, 2011 3:10 AM
  To: Jiangning Liu
  Cc: gcc-patches@gcc.gnu.org; r...@cebitec.uni-bielefeld.de
  Subject: Re: [PATCH, testsuite] Add loop unrolling command line
 options
  for some test cases
 
  On Sep 21, 2011, at 1:22 AM, Jiangning Liu wrote:
   The fix is to explicitly turn on loop unroll and set max-unroll-
 times
  to 8,
   which is larger than the unrolling times being detected in the
 cases.
 
  Sounds reasonable to me.  Ok, though, do watch for any comments by
  people that know more than I.
 
 
 
 
 






[PATCH] Fix stack red zone bug (PR38644)

2011-09-26 Thread Jiangning Liu
This patch is fix PR38644, a 3-year-old bug. 

From the discussions in mail list and bugzilla, I think the middle end fix
is a common view. Although there are stills some gaps on how to fix it in
middle end, I think this patch at least moves the problem from back-end to
middle-end, which makes GCC infrastructure stronger than before. Otherwise,
any back-end needs to find and fix this problem by itself.

Since this bug was pinged many times by customers, I think at least we can
move on with this patch. If necessary, we can then give follow-up to build a
even better solution in middle end later on.

The patch is tested on x86 and ARM.

ChangeLog:

* config/i386/i386.c (ix86_stack_using_red_zone): Change inline
to be extern.
(TARGET_STACK_USING_RED_ZONE): New.
* config/rs6000/rs6000.c (rs6000_stack_using_red_zone): New.
(TARGET_STACK_USING_RED_ZONE): New.
(offset_below_red_zone_p): Change to use new hook 
TARGET_STACK_USING_RED_ZONE.
* doc/tm.texi (TARGET_STACK_USING_RED_ZONE): New.
* doc/tm.texi.in (TARGET_STACK_USING_RED_ZONE): New.
* sched-deps.c (sched_analyze_1): If the stack pointer is being 
modified and stack red zone is not supported for ports, flush out 
all memory references as they may become invalid if moved across 
the stack adjustment.
* target.def (stack_using_red_zone): New.
* testsuite/gcc.target/arm/stack-red-zone.c: New.

Thanks,
-Jiangning

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index ff8c49f..ce486da 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2631,8 +2631,8 @@ static const char *const
cpu_names[TARGET_CPU_DEFAULT_max] =
 

 /* Return true if a red-zone is in use.  */
 
-static inline bool
-ix86_using_red_zone (void)
+extern bool
+ix86_stack_using_red_zone (void)
 {
   return TARGET_RED_ZONE  !TARGET_64BIT_MS_ABI;
 }
@@ -35958,6 +35958,9 @@ ix86_autovectorize_vector_sizes (void)
 #define TARGET_STACK_PROTECT_FAIL ix86_stack_protect_fail
 #endif
 
+#undef TARGET_STACK_USING_RED_ZONE
+#define TARGET_STACK_USING_RED_ZONE ix86_stack_using_red_zone
+
 #undef TARGET_FUNCTION_VALUE
 #define TARGET_FUNCTION_VALUE ix86_function_value
 
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 1ab57e5..411cb09 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1537,6 +1537,9 @@ static const struct attribute_spec
rs6000_attribute_table[] =
 #define TARGET_STACK_PROTECT_FAIL rs6000_stack_protect_fail
 #endif
 
+#undef TARGET_STACK_USING_RED_ZONE
+#define TARGET_STACK_USING_RED_ZONE rs6000_stack_using_red_zone
+
 /* MPC604EUM 3.5.2 Weak Consistency between Multiple Processors
The PowerPC architecture requires only weak consistency among
processors--that is, memory accesses between processors need not be
@@ -20660,6 +20663,13 @@ rs6000_restore_saved_cr (rtx reg, int
using_mfcr_multiple)
}
 }
 
+/* Return true if the ABI allows red zone access.  */
+extern bool
+rs6000_stack_using_red_zone (void)
+{
+   return (DEFAULT_ABI != ABI_V4);
+}
+
 /* Return true if OFFSET from stack pointer can be clobbered by signals.
V.4 doesn't have any stack cushion, AIX ABIs have 220 or 288 bytes
below stack pointer not cloberred by signals.  */
@@ -20667,7 +20677,7 @@ rs6000_restore_saved_cr (rtx reg, int
using_mfcr_multiple)
 static inline bool
 offset_below_red_zone_p (HOST_WIDE_INT offset)
 {
-  return offset  (DEFAULT_ABI == ABI_V4
+  return offset  (!TARGET_STACK_USING_RED_ZONE
   ? 0
   : TARGET_32BIT ? -220 : -288);
 }
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 335c1d1..c848806 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11332,6 +11332,22 @@ to the stack.  Therefore, this hook should return
true in general, but
 false for naked functions.  The default implementation always returns true.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_STACK_USING_RED_ZONE (void)
+This hook returns true if the target has a red zone (an area beyond the
+current extent of the stack that cannot be modified by asynchronous events
+on the processor).
+
+If this hook returns false then the compiler mid-end will not move an
access
+to memory in the stack frame past a stack adjustment insn.
+
+If this hook returns true then the compiler mid-end will assume that it is
+safe to move an access to memory in the stack frame past a stack adjustment
+insn.  The target back-end must emit scheduling barrier insns when this is
+unsafe.
+
+The default is to return false which is safe and appropriate for most
targets.
+@end deftypefn
+
 @deftypevr {Target Hook} {unsigned HOST_WIDE_INT} TARGET_CONST_ANCHOR
 On some architectures it can take multiple instructions to synthesize
 a constant.  If there is another constant already in a register that
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 6783826..0fa9e10 100644
--- a/gcc/doc/tm.texi.in
+++ 

RE: [PATCH] Fix stack red zone bug (PR38644)

2011-09-26 Thread Jiangning Liu


 -Original Message-
 From: Andrew Pinski [mailto:pins...@gmail.com]
 Sent: Tuesday, September 27, 2011 5:31 AM
 To: Jiangning Liu
 Cc: gcc-patches@gcc.gnu.org
 Subject: Re: [PATCH] Fix stack red zone bug (PR38644)
 
 On Mon, Sep 26, 2011 at 3:26 AM, Jiangning Liu jiangning@arm.com
 wrote:
  This patch is fix PR38644, a 3-year-old bug.
 
  From the discussions in mail list and bugzilla, I think the middle
 end fix
  is a common view. Although there are stills some gaps on how to fix
 it in
  middle end, I think this patch at least moves the problem from back-
 end to
  middle-end, which makes GCC infrastructure stronger than before.
 Otherwise,
  any back-end needs to find and fix this problem by itself.
 
 I don't see why you think that is the common view that the fix should
 be in the middle-end.  I and a few others think the back-end should be
 where the barrier emitted from.  The middle-end should not have stack
 accesses as being special in anyway when it comes down to scheduling.
 What happens when a new scheduler comes around?  Then this has to be
 fixed again in that new scheduler rather than having the barrier doing
 the correct thing for all schedulers.
 

Hi Andrew,

Thanks for your kindly response!

Sorry, for this bug, I don’t see your valuable comments previously in either 
bug zilla and the [RFC] I sent out long time ago in gcc mail list. What I see 
is a bunch of people agreed this problem should be fixed in middle end.

For example, 
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38644#c48
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38644#c49

My comments,
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38644#c35

I'd like to repeat my points here.

As I mentioned, it shouldn't be back-end's responsibility to find this 
problem, i.e. the instructions move over stack pointer change. ISAs can be 
split into two classes. One class doesn't support stack red zone, and the other 
class does support stack red zone. If we agree this is a general issue, I think 
this issue should be solved in middle end rather than in back-end.

Yes. Back-end can provide barrier to explicitly represent the dependence, but I 
think this is a kind of workaround, because this way we are hoping back-end to 
detect this kind of dependence, while this kind of dependence is common for 
every back-end which doesn't support stack red zone. If we carefully analyze 
the implementation in x86 and powerpc back-ends supporting stack red zone, we 
may find, they are doing almost the same things.

In particular, I think the retarget-ability is the most valuable treasure for 
GCC. Thinking of implementing a new port, do we want to write new code to 
find this problem, no matter whether the new port supports stack red zone or 
not? If the answer is YES, it means we need to write the similar code like what 
currently x86-64 and powerpc back-ends are doing. Obviously, it is a 
non-trivial work. This way I don't think it's good for GCC's retarget-ability. 
Why don't we abstract the common thing in middle-end with a very clean 
interface?

Finally, It's OK for me if you say scheduler can only tread dependence as a 
clean interface. But this point doesn't support stack red zone issue must be 
handled in different back-ends respectively. If we want to keep scheduler clean 
enough, a simpler solution is adding a pass in middle-end to generate barrier 
before scheduler and this pass handle the general stack-red-zone issue using 
the interface I'm defining in the patch, but obviously this is a kind of 
over-design so far. 

Thanks,
-Jiangning

 Thanks,
 Andrew Pinski






RE: [PATCH] Fix stack red zone bug (PR38644)

2011-09-26 Thread Jiangning Liu
Fix a typo and CC x86/rs6000/arm ports maintainers.

ChangeLog:

* config/i386/i386.c (ix86_stack_using_red_zone): Change inline
to be extern.
(TARGET_STACK_USING_RED_ZONE): New.
* config/rs6000/rs6000.c (rs6000_stack_using_red_zone): New.
(TARGET_STACK_USING_RED_ZONE): New.
(offset_below_red_zone_p): Change to use new hook 
TARGET_STACK_USING_RED_ZONE.
* doc/tm.texi (TARGET_STACK_USING_RED_ZONE): New.
* doc/tm.texi.in (TARGET_STACK_USING_RED_ZONE): New.
* sched-deps.c (sched_analyze_1): If the stack pointer is being 
modified and stack red zone is not supported for ports, flush out 
all memory references as they may become invalid if moved across 
the stack adjustment.
* target.def (stack_using_red_zone): New.
* testsuite/gcc.target/arm/stack-red-zone.c: New.

Thanks,
-Jiangning

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index ff8c49f..1c16391 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2631,7 +2631,7 @@ static const char *const
cpu_names[TARGET_CPU_DEFAULT_max] =
 

 /* Return true if a red-zone is in use.  */
 
-static inline bool
+extern bool
 ix86_using_red_zone (void)
 {
   return TARGET_RED_ZONE  !TARGET_64BIT_MS_ABI;
@@ -35958,6 +35958,9 @@ ix86_autovectorize_vector_sizes (void)
 #define TARGET_STACK_PROTECT_FAIL ix86_stack_protect_fail
 #endif
 
+#undef TARGET_STACK_USING_RED_ZONE
+#define TARGET_STACK_USING_RED_ZONE ix86_using_red_zone
+
 #undef TARGET_FUNCTION_VALUE
 #define TARGET_FUNCTION_VALUE ix86_function_value
 
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 1ab57e5..411cb09 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1537,6 +1537,9 @@ static const struct attribute_spec
rs6000_attribute_table[] =
 #define TARGET_STACK_PROTECT_FAIL rs6000_stack_protect_fail
 #endif
 
+#undef TARGET_STACK_USING_RED_ZONE
+#define TARGET_STACK_USING_RED_ZONE rs6000_stack_using_red_zone
+
 /* MPC604EUM 3.5.2 Weak Consistency between Multiple Processors
The PowerPC architecture requires only weak consistency among
processors--that is, memory accesses between processors need not be
@@ -20660,6 +20663,13 @@ rs6000_restore_saved_cr (rtx reg, int
using_mfcr_multiple)
}
 }
 
+/* Return true if the ABI allows red zone access.  */
+extern bool
+rs6000_stack_using_red_zone (void)
+{
+   return (DEFAULT_ABI != ABI_V4);
+}
+
 /* Return true if OFFSET from stack pointer can be clobbered by signals.
V.4 doesn't have any stack cushion, AIX ABIs have 220 or 288 bytes
below stack pointer not cloberred by signals.  */
@@ -20667,7 +20677,7 @@ rs6000_restore_saved_cr (rtx reg, int
using_mfcr_multiple)
 static inline bool
 offset_below_red_zone_p (HOST_WIDE_INT offset)
 {
-  return offset  (DEFAULT_ABI == ABI_V4
+  return offset  (!TARGET_STACK_USING_RED_ZONE
   ? 0
   : TARGET_32BIT ? -220 : -288);
 }
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 335c1d1..c848806 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11332,6 +11332,22 @@ to the stack.  Therefore, this hook should return
true in general, but
 false for naked functions.  The default implementation always returns true.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_STACK_USING_RED_ZONE (void)
+This hook returns true if the target has a red zone (an area beyond the
+current extent of the stack that cannot be modified by asynchronous events
+on the processor).
+
+If this hook returns false then the compiler mid-end will not move an
access
+to memory in the stack frame past a stack adjustment insn.
+
+If this hook returns true then the compiler mid-end will assume that it is
+safe to move an access to memory in the stack frame past a stack adjustment
+insn.  The target back-end must emit scheduling barrier insns when this is
+unsafe.
+
+The default is to return false which is safe and appropriate for most
targets.
+@end deftypefn
+
 @deftypevr {Target Hook} {unsigned HOST_WIDE_INT} TARGET_CONST_ANCHOR
 On some architectures it can take multiple instructions to synthesize
 a constant.  If there is another constant already in a register that
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 6783826..0fa9e10 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -11223,6 +11223,22 @@ to the stack.  Therefore, this hook should return
true in general, but
 false for naked functions.  The default implementation always returns true.
 @end deftypefn
 
+@hook TARGET_STACK_USING_RED_ZONE
+This hook returns true if the target has a red zone (an area beyond the
+current extent of the stack that cannot be modified by asynchronous events
+on the processor).
+
+If this hook returns false then the compiler mid-end will not move an
access
+to memory in the stack frame past a stack adjustment insn.
+
+If this hook returns true then the compiler mid-end will assume that it 

RE: [PATCH] Fix stack red zone bug (PR38644)

2011-09-26 Thread Jiangning Liu
 Think of it this way.  What the IR says is there is no barrier between
 those moves.  You either have an implicit barrier (which is what you
 are proposing) or you have it explicitly.  I think we all rather have
 more things explicit rather than implicit in the IR.  And that has
 been the overall feeling for a few years now.
 

Sorry, I'm afraid I can't agree with you. Instead, I think using barrier to 
describe this kind of dependence is a kind of implicit method. Please note that 
this is not an usual data dependence issue. The stack pointer change doesn't 
have any dependence with memory access at all.

No matter what solution itself is, the problem itself is a quite a common one 
on ISA level, so we should solve it in middle-end, because middle-end is shared 
for all ports. 

My proposal avoids problems in future. Any new ports and new back-end 
implementations needn't explicitly define this hook in a very safe way. But if 
one port wants to unsafely introduce red zone, we can explicitly define this 
hook in back-end. This way, we would avoid the bug in the earliest time. Do you 
really want to hide this problem in back-end silently? And wait others to 
report bug and then waste time to get it fixed again?

The facts I see is over the years, for different ports reported the similar 
issue around this, and somebody tried to fix them. Actually, this kind of 
problem should be fixed in design level. If the first people solve this bug can 
give solution in middle end, we would not need to waste time on filing those 
bugs in bug zilla and fixing them around this problem.

Thanks,
-Jiangning









A question about detecting array bounds for case Warray-bounds-3.c

2011-09-21 Thread Jiangning Liu
Hi,

For case gcc/testsuite/gcc.dg/Warray-bounds-3.c, obviously it is an invalid
C program, because the last iterations of all the loops cause the access of
arrays is beyond the max size of corresponding array declarations. The
condition of checking upper bound should be  rather than =. 

Right now, GCC compiler doesn't report any warning messages for this case,
should it be a bug in both test case and compiler?

But looking at http://gcc.gnu.org/PR31227 , it seems this test case is
designed to be like this on purpose. Anybody can explain about this?

The case is like below,

/* { dg-do compile } */
/* { dg-options -O2 -Warray-bounds } */
/* based on PR 31227 */

struct S
{
  const char *abday[7];
  const char *day[7];
  const char *abmon[12];
  const char *mon[12];
  const char *am_pm[2];
};

...

  for (cnt = 0; cnt = 7; ++cnt)
{
  iov[2 + cnt].iov_base = (void *) (time-abday[cnt] ?: );
  iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1;
}

  for (; cnt = 14; ++cnt)
{
  iov[2 + cnt].iov_base = (void *) (time-day[cnt - 7] ?: );
  iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1;
}

  for (; cnt = 26; ++cnt)
{
  iov[2 + cnt].iov_base = (void *) (time-abmon[cnt - 14] ?: );
  iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1;
}

  for (; cnt = 38; ++cnt)
{
  iov[2 + cnt].iov_base = (void *) (time-mon[cnt - 26] ?: );
  iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1;
}

  for (; cnt = 40; ++cnt)
{
  iov[2 + cnt].iov_base =  (void *) (time-am_pm[cnt - 38] ?: );
  iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1;
}

Thanks,
-Jiangning





[PATCH, testsuite] Add loop unrolling command line options for some test cases

2011-09-21 Thread Jiangning Liu
Hi,

The following test cases are to check predictive commoning optimization by
detecting loop unrolling times. Originally by default the max-unroll-times
is 8. If max-unroll-times is specified to be less than the expected
unrolling times, those cases would fail. 

The fix is to explicitly turn on loop unroll and set max-unroll-times to 8,
which is larger than the unrolling times being detected in the cases.

ChangeLog:

2011-09-14  Jiangning Liu  jiangning@arm.com

* gcc/testsuite/gcc.dg/tree-ssa/predcom-1.c: Explicitly turn on 
loop unroll and set max unroll times to 8.
* gcc/testsuite/gcc.dg/tree-ssa/predcom-2.c: Likewise.
* gcc/testsuite/gcc.dg/tree-ssa/predcom-3.c: Likewise.
* gcc/testsuite/gcc.dg/tree-ssa/predcom-4.c: Likewise.
* gcc/testsuite/gcc.dg/tree-ssa/predcom-5.c: Likewise.

Thanks,
-Jiangning

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/predcom-1.c
b/gcc/testsuite/gcc.dg/tree-ssa/predcom-1.c
index 16bd5c9..f1e52e5 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/predcom-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/predcom-1.c
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
 /* { dg-do run } */
-/* { dg-options -O2 -fpredictive-commoning -fdump-tree-pcom-details } */
+/* { dg-options -O2 -funroll-loops --param max-unroll-times=8
-fpredictive-commoning -fdump-tree-pcom-details } */
 
 void abort (void);
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/predcom-2.c
b/gcc/testsuite/gcc.dg/tree-ssa/predcom-2.c
index 7275f28..27e53ee 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/predcom-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/predcom-2.c
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
 /* { dg-do run } */
-/* { dg-options -O2 -fpredictive-commoning -fdump-tree-pcom-details } */
+/* { dg-options -O2 -funroll-loops --param max-unroll-times=8
-fpredictive-commoning -fdump-tree-pcom-details } */
 
 void abort (void);
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/predcom-3.c
b/gcc/testsuite/gcc.dg/tree-ssa/predcom-3.c
index d500234..5dfe384 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/predcom-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/predcom-3.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options -O2 -fpredictive-commoning -fdump-tree-pcom-details } */
+/* { dg-options -O2 -funroll-loops --param max-unroll-times=8
-fpredictive-commoning -fdump-tree-pcom-details } */
 
 int a[1000], b[1000];
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/predcom-4.c
b/gcc/testsuite/gcc.dg/tree-ssa/predcom-4.c
index 6f06b7f..c29a46a 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/predcom-4.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/predcom-4.c
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
 /* { dg-do run } */
-/* { dg-options -O2 -fpredictive-commoning -fdump-tree-pcom-details } */
+/* { dg-options -O2 -funroll-loops --param max-unroll-times=8
-fpredictive-commoning -fdump-tree-pcom-details } */
 
 /* Test for predictive commoning of expressions, without reassociation.  */
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/predcom-5.c
b/gcc/testsuite/gcc.dg/tree-ssa/predcom-5.c
index 134fc37..29444ab 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/predcom-5.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/predcom-5.c
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
 /* { dg-do run } */
-/* { dg-options -O2 -fpredictive-commoning -fdump-tree-pcom-details } */
+/* { dg-options -O2 -funroll-loops --param max-unroll-times=8
-fpredictive-commoning -fdump-tree-pcom-details } */
 
 /* Test for predictive commoning of expressions, with reassociation.  */






RE: [PATCH, testsuite] Add loop unrolling command line options for some test cases

2011-09-21 Thread Jiangning Liu
Hi Mike,

OK. I will wait 24 more hours. If no objections by then, I will get it
checked into trunk.

Thanks,
-Jiangning

 -Original Message-
 From: Mike Stump [mailto:mikest...@comcast.net]
 Sent: Thursday, September 22, 2011 3:10 AM
 To: Jiangning Liu
 Cc: gcc-patches@gcc.gnu.org; r...@cebitec.uni-bielefeld.de
 Subject: Re: [PATCH, testsuite] Add loop unrolling command line options
 for some test cases
 
 On Sep 21, 2011, at 1:22 AM, Jiangning Liu wrote:
  The fix is to explicitly turn on loop unroll and set max-unroll-times
 to 8,
  which is larger than the unrolling times being detected in the cases.
 
 Sounds reasonable to me.  Ok, though, do watch for any comments by
 people that know more than I.






RE: A case that PRE optimization hurts performance

2011-09-15 Thread Jiangning Liu
Hi Richard,

I slightly changed the case to be like below,

int f(char *t) {
int s=0;

while (*t  s != 1) {
switch (s) {
case 0:   /* path 1 */
s = 2;
break;
case 2:   /* path 2 */
s = 3; /* changed */
break;
default:  /* path 3 */
if (*t == '-') 
s = 2;
break;
}
t++;
}

return s;
}

-O2 is still worse than -O2 -fno-tree-pre. 

-O2 -fno-tree-pre result is 

f:
pushl   %ebp
xorl%eax, %eax
movl%esp, %ebp
movl8(%ebp), %edx
movzbl  (%edx), %ecx
jmp .L14
.p2align 4,,7
.p2align 3
.L5:
movl$2, %eax
.L7:
addl$1, %edx
cmpl$1, %eax
movzbl  (%edx), %ecx
je  .L3
.L14:
testb   %cl, %cl
je  .L3
testl   %eax, %eax
je  .L5
cmpl$2, %eax
.p2align 4,,5
je  .L17
cmpb$45, %cl
.p2align 4,,5
je  .L5
addl$1, %edx
cmpl$1, %eax
movzbl  (%edx), %ecx
jne .L14
.p2align 4,,7
.p2align 3
.L3:
popl%ebp
.p2align 4,,2
ret
.p2align 4,,7
.p2align 3
.L17:
movb$3, %al
.p2align 4,,3
jmp .L7

While -O2 result is 

f:
pushl   %ebp
xorl%eax, %eax
movl%esp, %ebp
movl8(%ebp), %edx
pushl   %ebx
movzbl  (%edx), %ecx
jmp .L14
.p2align 4,,7
.p2align 3
.L5:
movl$1, %ebx
movl$2, %eax
.L7:
addl$1, %edx
testb   %bl, %bl
movzbl  (%edx), %ecx
je  .L3
.L14:
testb   %cl, %cl
je  .L3
testl   %eax, %eax
je  .L5
cmpl$2, %eax
.p2align 4,,5
je  .L16
cmpb$45, %cl
.p2align 4,,5
je  .L5
cmpl$1, %eax
setne   %bl
addl$1, %edx
testb   %bl, %bl
movzbl  (%edx), %ecx
jne .L14
.p2align 4,,7
.p2align 3
.L3:
popl%ebx
popl%ebp
ret
.p2align 4,,7
.p2align 3
.L16:
movl$1, %ebx
movb$3, %al
jmp .L7

You may notice that register ebx is introduced, and some more instructions
around ebx are generated as well. i.e.

setne   %bl
testb   %bl, %bl

I agree with you that in theory PRE does the right thing to minimize the
computation cost on gimple level. However, the problem is the cost of
converting comparison result to a bool value is not considered, so it
actually makes binary code worse. For this case, as I summarized below, to
complete the same functionality With PRE is worse than Without PRE for
all three paths,

* Without PRE,

Path1:
movl$2, %eax
cmpl$1, %eax
je  .L3

Path2:
movb$3, %al
cmpl$1, %eax
je  .L3

Path3:
cmpl$1, %eax
jne .L14

* With PRE,

Path1:
movl$1, %ebx
movl$2, %eax
testb   %bl, %bl
je  .L3

Path2:
movl$1, %ebx
movb$3, %al
testb   %bl, %bl
je  .L3

Path3:
cmpl$1, %eax
setne   %bl
testb   %bl, %bl
jne .L14

Do you have any more thoughts?

Thanks,
-Jiangning

 -Original Message-
 From: Richard Guenther [mailto:richard.guent...@gmail.com]
 Sent: Tuesday, August 02, 2011 5:23 PM
 To: Jiangning Liu
 Cc: gcc@gcc.gnu.org
 Subject: Re: A case that PRE optimization hurts performance
 
 On Tue, Aug 2, 2011 at 4:37 AM, Jiangning Liu jiangning@arm.com
 wrote:
  Hi,
 
  For the following simple test case, PRE optimization hoists
 computation
  (s!=1) into the default branch of the switch statement, and finally
 causes
  very poor code generation. This problem occurs in both X86 and ARM,
 and I
  believe it is also a problem for other targets.
 
  int f(char *t) {
     int s=0;
 
     while (*t  s != 1) {
         switch (s) {
         case 0:
             s = 2;
             break;
         case 2:
             s = 1;
             break;
         default:
             if (*t == '-')
                 s = 1;
             break;
         }
         t++;
     }
 
     return s;
  }
 
  Taking X86 as an example, with option -O2 you may find 52
 instructions
  generated like below,
 
   f:
    0:   55                      push   %ebp
    1:   31 c0                   xor    %eax,%eax
    3:   89 e5                   mov    %esp,%ebp
    5:   57                      push   %edi
    6:   56                      push   %esi
    7:   53                      push   %ebx
    8:   8b 55 08                mov    0x8(%ebp),%edx
    b:   0f b6 0a                movzbl (%edx),%ecx
    e:   84 c9

[PATCH, testsuite, ARM] Change to expected failure for g++.dg/abi/local1.C on ARM target

2011-09-13 Thread Jiangning Liu
Here comes a patch to change the case g++.dg/abi/local1.C to be expected
failure for ARM target.

In C++ ABI for the ARM architecture, it says,



This runs contrary to §2.9.1 of [GC++ABI] which states:

It is intended that two type_info pointers point to equivalent type
descriptions if and only if the pointers are equal. An implementation must
satisfy this constraint, e.g. by using symbol preemption, COMDAT sections,
or other mechanisms.

Fortunately, we can ignore this requirement without violating the C++
standard provided that:

* type_info::operator== and type_info::operator!= compare the strings
returned by type_info::name(), not just the pointers to the RTTI objects and
their names.

* No reliance is placed on the address returned by type_info::name(). (That
is, t1.name() != t2.name() does not imply that t1 != t2).



The patch is,

diff --git a/gcc/testsuite/g++.dg/abi/local1.C
b/gcc/testsuite/g++.dg/abi/local1.C
index 518193c..7f08a8f 100644
--- a/gcc/testsuite/g++.dg/abi/local1.C
+++ b/gcc/testsuite/g++.dg/abi/local1.C
@@ -1,4 +1,4 @@
-// { dg-do run }
+// { dg-do run { xfail { arm*-*-eabi* } } }
 // { dg-additional-sources local1-a.cc }
 
 #include typeinfo

ChangeLog:

2011-09-14  Jiangning Liu  jiangning@arm.com

* g++.dg/abi/local1.C: Change to XFAIL for ARM EABI target.

Thanks,
-Jiangning





Is VRP is too conservative to identify boolean value 0 and 1?

2011-09-02 Thread Jiangning Liu
Hi,

For the following small case,

int f(int i, int j)
{
if (i==1  j==2)
return i;
else
return j;
}

with -O2 option, GCC has vrp2 dump like below,

==

Value ranges after VRP:

i_1: VARYING
i_2(D): VARYING
D.1249_3: [0, +INF]
j_4(D): VARYING
D.1250_5: [0, +INF]
D.1251_6: [0, +INF]
j_10: [2, 2]  EQUIVALENCES: { j_4(D) } (1 elements)


Removing basic block 3
f (int i, int j)
{
  _Bool D.1251;
  _Bool D.1250;
  _Bool D.1249;

bb 2:
  D.1249_3 = i_2(D) == 1;
  D.1250_5 = j_4(D) == 2;
  D.1251_6 = D.1250_5  D.1249_3;
  if (D.1251_6 != 0)
goto bb 3;
  else
goto bb 4;

bb 3:

bb 4:
  # i_1 = PHI 1(3), j_4(D)(2)
  return i_1;

}



Variable D.1249_3, D.1250_5 and D.1251_6 should be boolean values, so the
their value ranges should be

D.1249_3: [0, 1]
D.1250_5: [0, 1]
D.1251_6: [0, 1]

So why current VRP can't find out this value range?

I'm asking this question because the optimizations in back-end need this
info to do advanced optimization.

Thanks,
-Jiangning




RE: Is VRP is too conservative to identify boolean value 0 and 1?

2011-09-02 Thread Jiangning Liu
Andrew,

I realize I needn't back-end solution for my case at all, and in middle end I 
can directly use the _Bool type info! Appreciate your reply!

Thanks,
-Jiangning


 -Original Message-
 From: Andrew Pinski [mailto:pins...@gmail.com]
 Sent: Friday, September 02, 2011 2:27 PM
 To: Jiangning Liu
 Cc: gcc@gcc.gnu.org
 Subject: Re: Is VRP is too conservative to identify boolean value 0 and
 1?
 
 On Thu, Sep 1, 2011 at 10:58 PM, Jiangning Liu jiangning@arm.com
 wrote:
  D.1249_3: [0, 1]
  D.1250_5: [0, 1]
  D.1251_6: [0, 1]
 
 Those are equivalent to [0, MAX] as _Bool only has two different
 values, 0 and 1 (MAX).  Can you explain more about the optimization
 which you are working on that needs the ranges as (int)[0,1] rather
 than (_Bool)[0,MAX] ?
 
 Thanks,
 Andrew Pinski





[PATCH, testsuite, ARM] change XFAIL to pass for ARM on a case testing tree-ssa-dom

2011-08-26 Thread Jiangning Liu
Test case gcc.dg/tree-ssa/20040204-1.c can pass for -O1 after Richard
Guenther rguent...@suse.de fixed something in tree-ssa-dom. The
link_error should be optimized away for ARM targets as well.

The patch is:

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/20040204-1.c
b/gcc/testsuite/gcc.dg/tree-ssa/20040204-1.c
index 45e44a1..470b585 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/20040204-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/20040204-1.c
@@ -33,5 +33,5 @@ void test55 (int x, int y)
that the  should be emitted (based on BRANCH_COST).  Fix this
by teaching dom to look through  and register all components
as true.  */
-/* { dg-final { scan-tree-dump-times link_error 0 optimized { xfail { !
alpha*-*-* powerpc*-*-* cris-*-* crisv32-*-* hppa*-*-* i?86-*-* mmix-*-*
mips*-*-* m68k*-*-* moxie-*-* sparc*-*-* spu-*-* x86_64-*-* } } } } */
+/* { dg-final { scan-tree-dump-times link_error 0 optimized { xfail { !
alpha*-*-* arm*-*-* powerpc*-*-* cris-*-* crisv32-*-* hppa*-*-* i?86-*-*
mmix-*-* mips*-*-* m68k*-*-* moxie-*-* sparc*-*-* spu-*-* x86_64-*-* } } }
} */
 /* { dg-final { cleanup-tree-dump optimized } } */

gcc/testsuite/ChangeLog:

2011-08-26  Jiangning Liu  jiangning@arm.com

   PR tree-optimization/46021
   * gcc.dg/tree-ssa/20040204-1.c: Don't XFAIL on arm*-*-*.

Thanks,
-Jiangning





RE: [PATCH, ARM] Generate conditional compares in Thumb2 state

2011-08-17 Thread Jiangning Liu
Attached is the new patch file. Review please!

ChangeLog:

2011-08-18  Jiangning Liu  jiangning@arm.com

* config/arm/arm.md (*ior_scc_scc): Enable for Thumb2 as well.
(*ior_scc_scc_cmp): Likewise
(*and_scc_scc): Likewise.
(*and_scc_scc_cmp): Likewise.
(*and_scc_scc_nodom): Likewise.
(*cmp_ite0, *cmp_ite1, *cmp_and, *cmp_ior): Handle Thumb2.

Testsuite Changelog would be:

2011-08-18  Jiangning Liu  jiangning@arm.com

* gcc.target/arm/thumb2-cond-cmp-1.c: New. Make sure conditional 
compare can be generated.
* gcc.target/arm/thumb2-cond-cmp-2.c: Likewise.
* gcc.target/arm/thumb2-cond-cmp-3.c: Likewise.
* gcc.target/arm/thumb2-cond-cmp-4.c: Likewise.

Regression test against cortex-M0/M3/M4 profile with -mthumb option
doesn't show any new failures.

Thanks,
-Jiangning

fix_cond_cmp_5.patch
Description: Binary data


RE: [PATCH, ARM] Generate conditional compares in Thumb2 state

2011-08-11 Thread Jiangning Liu
Ramana,

I only see the following three combinations are meaningful, 

* cmp l, lPy // keep cmp, and length is 2, on thumb
* cmp r, rI  // keep cmp, and length is 4
* cmp r, L   // convert to cmn, and length is 4

According to ARM ARM, for negative immediate, all encodings for cmp/cmn are
4-byte long, i.e.

* CMP: encoding T2 and A1
* CMN: encoding T1 and A1

so we needn't to make difference to cover Pw and Pv.

 Length is 8 bytes if you have Pw and L

For this cases, the length should be 10 for Thumb2 instead. 

Finally, if we want to describe all possibilities in constraints, we would
have the followings 9 combinations,

* cmp1 has operands

(l,  l,  l,  r, r, r, r, r, r)
(lPy,lPy,lPy,rI,rI,rI,L, L, L)

* cmp2 has operands

(l,  r, r, l,  r, r, l,  r, r)
(lPy,rI,L, lPy,rI,L, lPy,rI,L)

So the length would be like below, (I don't know how to write it in
attribute section yet. )

if (TARGET_THUMB2) {
(set_attr length 6,8,8,8,10,10,8,10,10)]
} else {
(set_attr length 4,6,6,6,8,8,6,8,8)]
}

Does it make sense?

Thanks,
-Jiangning

 -Original Message-
 From: Ramana Radhakrishnan [mailto:ramana.radhakrish...@linaro.org]
 Sent: Wednesday, August 10, 2011 6:40 PM
 To: Jiangning Liu
 Cc: gcc-patches@gcc.gnu.org
 Subject: Re: [PATCH, ARM] Generate conditional compares in Thumb2 state
 
 On 10 August 2011 09:20, Jiangning Liu jiangning@arm.com wrote:
  PING...
 
 
  BTW, in patch fix_cond_cmp_2.patch, the file mode of thumb2.md is
 carelessly
  changed, so please check attached new patch file fix_cond_cmp_3.patch.
 
 
 Please do not top post.
 
 I've been away for the whole of last week and then been away from my
 desk
 for the first 2 days this week and am still catching up with email .
 
 I'm missing a Changelog entry for this . Do I assume it is the same ?
 
 I've just noticed that the length attribute isn't correct for Thumb2
 state. When I last
 looked at this I must have missed the case where the cmp and the cmn
 are both 32 bit instructions.
 
 The length can vary between 6 and 10 bytes for the Thumb2 variant from
 my reading of the ARM-ARM.
 
 i.e cmp reg, 8bitconstant
  itcond
  cmn reg, 8bitconstant
 
 Length = 6 bytes
 
 or even with
cmp reg, reg
it cond
cmncond reg, reg
 
  All registers betwen r0 and r7 .
 
 Length is 6 bytes if you have this for both l constraints.
 Length is 6 bytes - you should catch this with Pw and Pv respectively .
 Length is 8 bytes if you have Pw and L
 Length is 8 bytes if you have I and Pv
 Length is 10 bytes if you have I and L .
 
 
 Showing you an example with l and Py at this point of time and leaving
 the rest as homework. Yes it will duplicate a few cases but it helps
 getting the lengths absolutely right.
 
 (define_insn *cmp_ite0
   [(set (match_operand 6 dominant_cc_register )
   (compare
(if_then_else:SI
 (match_operator 4 arm_comparison_operator
  [(match_operand:SI 0 s_register_operand l,r,r,r,r)
   (match_operand:SI 1 arm_add_operand lPy,rI,L,rI,L)])
 (match_operator:SI 5 arm_comparison_operator
  [(match_operand:SI 2 s_register_operand l,r,r,r,r)
   (match_operand:SI 3 arm_add_operand lPy,rI,rI,L,L)])
 (const_int 0))
(const_int 0)))]
   TARGET_32BIT
   *
   {
 static const char * const cmp1[5][2] =
 {
   {\cmp\\t%2, %3\,
\cmp\\t%0, %1\},
   {\cmp\\t%2, %3\,
\cmp\\t%0, %1\},
   {\cmp\\t%2, %3\,
\cmn\\t%0, #%n1\},
   {\cmn\\t%2, #%n3\,
\cmp\\t%0, %1\},
   {\cmn\\t%2, #%n3\,
\cmn\\t%0, #%n1\}
 };
 static const char * const cmp2[5][2] =
 {
   {\cmp%d5\\t%0, %1\,
\cmp%d4\\t%2, %3\},
   {\cmp%d5\\t%0, %1\,
\cmp%d4\\t%2, %3\},
   {\cmn%d5\\t%0, #%n1\,
\cmp%d4\\t%2, %3\},
   {\cmp%d5\\t%0, %1\,
\cmn%d4\\t%2, #%n3\},
   {\cmn%d5\\t%0, #%n1\,
\cmn%d4\\t%2, #%n3\}
 };
 static const char * const ite[2] =
 {
   \it\\t%d5\,
   \it\\t%d4\
 };
 int swap =
   comparison_dominates_p (GET_CODE (operands[5]), GET_CODE
 (operands[4]));
 
 output_asm_insn (cmp1[which_alternative][swap], operands);
 if (TARGET_THUMB2) {
 output_asm_insn (ite[swap], operands);
 }
 output_asm_insn (cmp2[which_alternative][swap], operands);
 return \\;
   }
   [(set_attr conds set)
(set_attr arch t2,any,any,any,any)
(set_attr length 6,8,8,8,8)]
 )
 
 
 As for the extra problem exposed by this specific case, may we treat
 it as a
 separate fix to decouple it with this one, and I can give follow up
 later
 on? I think it is a general problem not only for the particular
 pattern
 it/op/it/op. But I'm not sure how far we can go to optimize this kind
 of
 problems introduced by IT block.
 
 
 Currently the way in which the Thumb2 backend generates
 conditional instructions and combines them further with other
 IT blocks is by running a state machine at the very end before assembly

RE: [PATCH, ARM] Generate conditional compares in Thumb2 state

2011-08-10 Thread Jiangning Liu
PING...

BTW, in patch fix_cond_cmp_2.patch, the file mode of thumb2.md is carelessly
changed, so please check attached new patch file fix_cond_cmp_3.patch.

Thanks,
-Jiangning

 -Original Message-
 From: Jiangning Liu [mailto:jiangning@arm.com]
 Sent: Monday, August 08, 2011 2:01 PM
 To: 'Ramana Radhakrishnan'
 Cc: gcc-patches@gcc.gnu.org
 Subject: RE: [PATCH, ARM] Generate conditional compares in Thumb2 state
 
 In attached new patch, arm_arch_thumb2 is changed to TARGET_THUMB2. I
 tried that with my patch command line option -mcpu=armv7-a9 doesn't
 generate IT instruction any longer, unless option -mthumb is being
 added.
 
 All of my tests assume command line option -mthumb, while cortex-M0,
 cortex-M3 cortex-M4 are covered by options -mcpu=cortex-m0, -
 march=armv7-m, and -march=armv7e-m respectively.
 
 As for the extra problem exposed by this specific case, may we treat it
 as a separate fix to decouple it with this one, and I can give follow
 up later on? I think it is a general problem not only for the
 particular pattern it/op/it/op. But I'm not sure how far we can go to
 optimize this kind of problems introduced by IT block. For this
 specific case, I see if conversion already generates conditional move
 before combination pass. So basically the peephole rules may probably
 work for most of the general scenarios. My initial thought is go over
 the rules introducing IT block and try to figure out all of the
 combination that two of this kinds of rules can be in sequential order.
 Maybe we only need to handle the most common case like this one. Since
 I do see a bunch of rules have something to do with problem, I'd like
 to look into all of them to give a most reasonable solution in a
 separate fix.
 
 Does it make sense?
 
 Thanks,
 -Jiangning
 
  -Original Message-
  From: Ramana Radhakrishnan [mailto:ramana.radhakrish...@linaro.org]
  Sent: Friday, August 05, 2011 9:20 AM
  To: Jiangning Liu
  Cc: gcc-patches@gcc.gnu.org
  Subject: Re: [PATCH, ARM] Generate conditional compares in Thumb2
  state
 
  On 3 August 2011 08:48, Jiangning Liu jiangning@arm.com wrote:
   This patch is to generate more conditional compare instructions in
  Thumb2
   state. Given an example like below,
  
   int f(int i, int j)
   {
    if ( (i == '+') || (j == '-') ) {
      return i;
    } else {
      return j;
    }
   }
  
   Without the patch, compiler generates the following codes,
  
          sub     r2, r0, #43
          rsbs    r3, r2, #0
          adc     r3, r3, r2
          cmp     r1, #45
          it      eq
          orreq   r3, r3, #1
          cmp     r3, #0
          it      eq
          moveq   r0, r1
          bx      lr
  
   With the patch, compiler can generate conditional jump like below,
  
          cmp     r0, #43
          it      ne
          cmpne   r1, #45
          it      ne
          movne   r0, r1
          bx      lr
 
 
  Nice improvement but there could be a single it block to handle both
  and thus you could make this even better with
 
  cmp r0, #43
  itt ne
  cmpne r1 ,#45
  movne r0, r1
 
  The way to do this would be to try and split this post-reload
  unfortunately into the cmp instruction and the conditional compare
  with the appropriate instruction length - Then the backend has a
  chance of merging some of this into a single instruction.
  Unfortunately that won't be very straight-forward but that's a
  direction we probably ought to proceed with in this case.
 
  In a number of places:
 
   +   if (arm_arch_thumb2)
 
  Ah instead of this please use if (TARGET_THUMB2) - arm_arch_thumb2 is
  true based on the architecture levels and not necessarily if the user
  wants to generate Thumb code. I don't want an unnecessary IT
  instruction being emitted in the ASM block in ARM state for v7-a and
  above.
 
   Tested against arm-none-eabi target and no regression found.
 
  Presumably for ARM and Thumb2 state ?
 
 
  cheers
  Ramana


fix_cond_cmp_3.patch
Description: Binary data


RE: [PATCH, ARM] Generate conditional compares in Thumb2 state

2011-08-08 Thread Jiangning Liu
In attached new patch, arm_arch_thumb2 is changed to TARGET_THUMB2. I tried
that with my patch command line option -mcpu=armv7-a9 doesn't generate IT
instruction any longer, unless option -mthumb is being added.

All of my tests assume command line option -mthumb, while cortex-M0,
cortex-M3 cortex-M4 are covered by options -mcpu=cortex-m0, -march=armv7-m,
and -march=armv7e-m respectively.

As for the extra problem exposed by this specific case, may we treat it as a
separate fix to decouple it with this one, and I can give follow up later
on? I think it is a general problem not only for the particular pattern
it/op/it/op. But I'm not sure how far we can go to optimize this kind of
problems introduced by IT block. For this specific case, I see if
conversion already generates conditional move before combination pass. So
basically the peephole rules may probably work for most of the general
scenarios. My initial thought is go over the rules introducing IT block and
try to figure out all of the combination that two of this kinds of rules can
be in sequential order. Maybe we only need to handle the most common case
like this one. Since I do see a bunch of rules have something to do with
problem, I'd like to look into all of them to give a most reasonable
solution in a separate fix. 

Does it make sense?

Thanks,
-Jiangning

 -Original Message-
 From: Ramana Radhakrishnan [mailto:ramana.radhakrish...@linaro.org]
 Sent: Friday, August 05, 2011 9:20 AM
 To: Jiangning Liu
 Cc: gcc-patches@gcc.gnu.org
 Subject: Re: [PATCH, ARM] Generate conditional compares in Thumb2 state
 
 On 3 August 2011 08:48, Jiangning Liu jiangning@arm.com wrote:
  This patch is to generate more conditional compare instructions in
 Thumb2
  state. Given an example like below,
 
  int f(int i, int j)
  {
   if ( (i == '+') || (j == '-') ) {
     return i;
   } else {
     return j;
   }
  }
 
  Without the patch, compiler generates the following codes,
 
         sub     r2, r0, #43
         rsbs    r3, r2, #0
         adc     r3, r3, r2
         cmp     r1, #45
         it      eq
         orreq   r3, r3, #1
         cmp     r3, #0
         it      eq
         moveq   r0, r1
         bx      lr
 
  With the patch, compiler can generate conditional jump like below,
 
         cmp     r0, #43
         it      ne
         cmpne   r1, #45
         it      ne
         movne   r0, r1
         bx      lr
 
 
 Nice improvement but there could be a single it block to handle both
 and thus you
 could make this even better with
 
 cmp r0, #43
 itt ne
 cmpne r1 ,#45
 movne r0, r1
 
 The way to do this would be to try and split this post-reload
 unfortunately into the cmp instruction and the conditional compare
 with the appropriate instruction length - Then the backend has a
 chance of merging some of this into a single instruction.
 Unfortunately that won't be very straight-forward but that's a
 direction we probably ought to proceed with in this case.
 
 In a number of places:
 
  +   if (arm_arch_thumb2)
 
 Ah instead of this please use if (TARGET_THUMB2) - arm_arch_thumb2 is
 true based on the architecture levels and not necessarily if the user
 wants to generate Thumb code. I don't want an unnecessary IT
 instruction being emitted in the ASM block in ARM state for v7-a and
 above.
 
  Tested against arm-none-eabi target and no regression found.
 
 Presumably for ARM and Thumb2 state ?
 
 
 cheers
 Ramana


fix_cond_cmp_2.patch
Description: Binary data


[PATCH, ARM] Generate conditional compares in Thumb2 state

2011-08-03 Thread Jiangning Liu
This patch is to generate more conditional compare instructions in Thumb2
state. Given an example like below,

int f(int i, int j) 
{
  if ( (i == '+') || (j == '-') ) {
return i;
  } else {
return j;
  }
}

Without the patch, compiler generates the following codes,

sub r2, r0, #43
rsbsr3, r2, #0
adc r3, r3, r2
cmp r1, #45
it  eq
orreq   r3, r3, #1
cmp r3, #0
it  eq
moveq   r0, r1
bx  lr

With the patch, compiler can generate conditional jump like below,

cmp r0, #43
it  ne
cmpne   r1, #45
it  ne
movne   r0, r1
bx  lr

The patch is essentially to insert *it* instruction for the following rules
in arm.md,

* cmp_ite0
* cmp_ite1
* cmp_and
* cmp_ior

Tested against arm-none-eabi target and no regression found.

Source code Changelog would be:

2011-07-29  Jiangning Liu  jiangning@arm.com

* config/arm/arm.md (*ior_scc_scc): Enable for Thumb2 as well.
(*ior_scc_scc_cmp): Likewise
(*and_scc_scc): Likewise.
(*and_scc_scc_cmp): Likewise.
(*and_scc_scc_nodom): Likewise.
(*cmp_ite0, *cmp_ite1, *cmp_and, *cmp_ior): Handle Thumb2.

Testsuite Changelog would be:

2011-07-29  Jiangning Liu  jiangning@arm.com

* gcc.target/arm/thumb2-cond-cmp-1.c: New. Make sure conditional 
compare can be generated.
* gcc.target/arm/thumb2-cond-cmp-2.c: Likewise.
* gcc.target/arm/thumb2-cond-cmp-3.c: Likewise.
* gcc.target/arm/thumb2-cond-cmp-4.c: Likewise.

Thanks,
-Jiangning

fix_cond_cmp.patch
Description: Binary data


RE: [RFC] Add middle end hook for stack red zone size

2011-08-01 Thread Jiangning Liu
The answer is ARM can. However, if you look into the bugs PR30282 and 
PR38644, PR44199, you may find in history, there are several different cases

in different ports reporting the similar failures, covering x86, PowerPC and

ARM. You are right, they were all fixed in back-ends in the past, but we
should 
fix the bug in a general way to make GCC infrastructure stronger, rather 
than fixing the problem target-by-target and case-by-case! If you further 
look into the back-end fixes in x86 and PowerPC, you may find they looks 
quite similar in back-ends. 

Thanks,
-Jiangning

 -Original Message-
 From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org]
 On Behalf Of Jakub Jelinek
 Sent: Monday, August 01, 2011 5:12 PM
 To: Jiangning Liu
 Cc: 'Joern Rennecke'; gcc@gcc.gnu.org; gcc-patc...@gcc.gnu.org;
 vmaka...@redhat.com; dje@gmail.com; Richard Henderson; Ramana
 Radhakrishnan; 'Ramana Radhakrishnan'
 Subject: Re: [RFC] Add middle end hook for stack red zone size
 
 On Mon, Aug 01, 2011 at 11:44:04AM +0800, Jiangning Liu wrote:
  It's quite necessary to solve the general problem in middle-end rather
than in
 back-end.
 
 That's what we disagree on.  All back-ends but ARM are able to handle it
 right, why can't ARM too?  The ABI rules for stack handling in the
epilogues
 are simply too diverse and complex to be handled easily in the scheduler.
 
   Jakub






A case that PRE optimization hurts performance

2011-08-01 Thread Jiangning Liu
Hi,

For the following simple test case, PRE optimization hoists computation
(s!=1) into the default branch of the switch statement, and finally causes
very poor code generation. This problem occurs in both X86 and ARM, and I
believe it is also a problem for other targets. 

int f(char *t) {
int s=0;

while (*t  s != 1) {
switch (s) {
case 0:
s = 2;
break;
case 2:
s = 1;
break;
default:
if (*t == '-') 
s = 1;
break;
}
t++;
}

return s;
}

Taking X86 as an example, with option -O2 you may find 52 instructions
generated like below,

 f:
   0:   55  push   %ebp
   1:   31 c0   xor%eax,%eax
   3:   89 e5   mov%esp,%ebp
   5:   57  push   %edi
   6:   56  push   %esi
   7:   53  push   %ebx
   8:   8b 55 08mov0x8(%ebp),%edx
   b:   0f b6 0amovzbl (%edx),%ecx
   e:   84 c9   test   %cl,%cl
  10:   74 50   je 62 f+0x62
  12:   83 c2 01add$0x1,%edx
  15:   85 c0   test   %eax,%eax
  17:   75 23   jne3c f+0x3c
  19:   8d b4 26 00 00 00 00lea0x0(%esi,%eiz,1),%esi
  20:   0f b6 0amovzbl (%edx),%ecx
  23:   84 c9   test   %cl,%cl
  25:   0f 95 c0setne  %al
  28:   89 c7   mov%eax,%edi
  2a:   b8 02 00 00 00  mov$0x2,%eax
  2f:   89 fb   mov%edi,%ebx
  31:   83 c2 01add$0x1,%edx
  34:   84 db   test   %bl,%bl
  36:   74 2a   je 62 f+0x62
  38:   85 c0   test   %eax,%eax
  3a:   74 e4   je 20 f+0x20
  3c:   83 f8 02cmp$0x2,%eax
  3f:   74 1f   je 60 f+0x60
  41:   80 f9 2dcmp$0x2d,%cl
  44:   74 22   je 68 f+0x68
  46:   0f b6 0amovzbl (%edx),%ecx
  49:   83 f8 01cmp$0x1,%eax
  4c:   0f 95 c3setne  %bl
  4f:   89 df   mov%ebx,%edi
  51:   84 c9   test   %cl,%cl
  53:   0f 95 c3setne  %bl
  56:   89 de   mov%ebx,%esi
  58:   21 f7   and%esi,%edi
  5a:   eb d3   jmp2f f+0x2f
  5c:   8d 74 26 00 lea0x0(%esi,%eiz,1),%esi
  60:   b0 01   mov$0x1,%al
  62:   5b  pop%ebx
  63:   5e  pop%esi
  64:   5f  pop%edi
  65:   5d  pop%ebp
  66:   c3  ret
  67:   90  nop
  68:   b8 01 00 00 00  mov$0x1,%eax
  6d:   5b  pop%ebx
  6e:   5e  pop%esi
  6f:   5f  pop%edi
  70:   5d  pop%ebp
  71:   c3  ret

But with command line option -O2 -fno-tree-pre, there are only 12
instructions generated, and the code would be very clean like below,

 f:
   0:   55  push   %ebp
   1:   31 c0   xor%eax,%eax
   3:   89 e5   mov%esp,%ebp
   5:   8b 55 08mov0x8(%ebp),%edx
   8:   80 3a 00cmpb   $0x0,(%edx)
   b:   74 0e   je 1b f+0x1b
   d:   80 7a 01 00 cmpb   $0x0,0x1(%edx)
  11:   b0 02   mov$0x2,%al
  13:   ba 01 00 00 00  mov$0x1,%edx
  18:   0f 45 c2cmovne %edx,%eax
  1b:   5d  pop%ebp
  1c:   c3  ret

Do you have any idea about this?

Thanks,
-Jiangning





RE: [RFC] Add middle end hook for stack red zone size

2011-08-01 Thread Jiangning Liu
Hi Jakub,

Appreciate for your valuable comments!

I think SPARC V9 ABI doesn't have red zone defined, right? So
stack_red_zone_size should be defined as zero by default, the scheduler
would block moving memory accesses across stack adjustment no matter what
the offset is. I don't see any risk here. Also, in my patch function *abs*
is being used to avoid the opposite stack direction issue as you mentioned.

Some people like you insist on the ABI diversity, and actually I agree with
you on this. But part of the ABI definition is general for all targets. The
point here is memory access beyond stack red zone should be avoided, which
is the general part of ABI that compiler should guarantee. For this general
part, middle end should take the responsibility.

Thanks,
-Jiangning

 -Original Message-
 From: Jakub Jelinek [mailto:ja...@redhat.com]
 Sent: Monday, August 01, 2011 6:31 PM
 To: Jiangning Liu
 Cc: 'Joern Rennecke'; gcc@gcc.gnu.org; gcc-patc...@gcc.gnu.org;
 vmaka...@redhat.com; dje@gmail.com; Richard Henderson; Ramana
 Radhakrishnan; 'Ramana Radhakrishnan'
 Subject: Re: [RFC] Add middle end hook for stack red zone size
 
 On Mon, Aug 01, 2011 at 06:14:27PM +0800, Jiangning Liu wrote:
  ARM. You are right, they were all fixed in back-ends in the past, but
 we
  should
  fix the bug in a general way to make GCC infrastructure stronger,
 rather
  than fixing the problem target-by-target and case-by-case! If you
 further
  look into the back-end fixes in x86 and PowerPC, you may find they
 looks
  quite similar in back-ends.
 
 
 Red zone is only one difficulty, your patch is e.g. completely ignoring
 existence of biased stack pointers (e.g. SPARC -m64 has them).
 Some targets have stack growing in opposite direction, etc.
 We have really a huge amount of very diverse ABIs and making the
 middle-end
 grok what is an invalid stack access is difficult.
 
   Jakub






RE: [RFC] Add middle end hook for stack red zone size

2011-08-01 Thread Jiangning Liu
The answer is ARM can. However, if you look into the bugs PR30282 and 
PR38644, PR44199, you may find in history, there are several different cases

in different ports reporting the similar failures, covering x86, PowerPC and

ARM. You are right, they were all fixed in back-ends in the past, but we
should 
fix the bug in a general way to make GCC infrastructure stronger, rather 
than fixing the problem target-by-target and case-by-case! If you further 
look into the back-end fixes in x86 and PowerPC, you may find they looks 
quite similar in back-ends. 

Thanks,
-Jiangning

 -Original Message-
 From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org]
 On Behalf Of Jakub Jelinek
 Sent: Monday, August 01, 2011 5:12 PM
 To: Jiangning Liu
 Cc: 'Joern Rennecke'; g...@gcc.gnu.org; gcc-patches@gcc.gnu.org;
 vmaka...@redhat.com; dje@gmail.com; Richard Henderson; Ramana
 Radhakrishnan; 'Ramana Radhakrishnan'
 Subject: Re: [RFC] Add middle end hook for stack red zone size
 
 On Mon, Aug 01, 2011 at 11:44:04AM +0800, Jiangning Liu wrote:
  It's quite necessary to solve the general problem in middle-end rather
than in
 back-end.
 
 That's what we disagree on.  All back-ends but ARM are able to handle it
 right, why can't ARM too?  The ABI rules for stack handling in the
epilogues
 are simply too diverse and complex to be handled easily in the scheduler.
 
   Jakub






RE: [RFC] Add middle end hook for stack red zone size

2011-08-01 Thread Jiangning Liu
Hi Jakub,

Appreciate for your valuable comments!

I think SPARC V9 ABI doesn't have red zone defined, right? So
stack_red_zone_size should be defined as zero by default, the scheduler
would block moving memory accesses across stack adjustment no matter what
the offset is. I don't see any risk here. Also, in my patch function *abs*
is being used to avoid the opposite stack direction issue as you mentioned.

Some people like you insist on the ABI diversity, and actually I agree with
you on this. But part of the ABI definition is general for all targets. The
point here is memory access beyond stack red zone should be avoided, which
is the general part of ABI that compiler should guarantee. For this general
part, middle end should take the responsibility.

Thanks,
-Jiangning

 -Original Message-
 From: Jakub Jelinek [mailto:ja...@redhat.com]
 Sent: Monday, August 01, 2011 6:31 PM
 To: Jiangning Liu
 Cc: 'Joern Rennecke'; g...@gcc.gnu.org; gcc-patches@gcc.gnu.org;
 vmaka...@redhat.com; dje@gmail.com; Richard Henderson; Ramana
 Radhakrishnan; 'Ramana Radhakrishnan'
 Subject: Re: [RFC] Add middle end hook for stack red zone size
 
 On Mon, Aug 01, 2011 at 06:14:27PM +0800, Jiangning Liu wrote:
  ARM. You are right, they were all fixed in back-ends in the past, but
 we
  should
  fix the bug in a general way to make GCC infrastructure stronger,
 rather
  than fixing the problem target-by-target and case-by-case! If you
 further
  look into the back-end fixes in x86 and PowerPC, you may find they
 looks
  quite similar in back-ends.
 
 
 Red zone is only one difficulty, your patch is e.g. completely ignoring
 existence of biased stack pointers (e.g. SPARC -m64 has them).
 Some targets have stack growing in opposite direction, etc.
 We have really a huge amount of very diverse ABIs and making the
 middle-end
 grok what is an invalid stack access is difficult.
 
   Jakub






RE: [RFC] Add middle end hook for stack red zone size

2011-07-31 Thread Jiangning Liu
Joern,

Thanks for your valuable feedback! This is only a RFC, and I will send out 
formal patch along with ChangLog later on. 

Basically, my patch is only to add new dependence in scheduler, and it only 
blocks some instruction movements, so it is NO RISK to compiler correctness. 
For whatever stack pointer changes you gave in different scenarios, the current 
code base should already work. My patch intends neither to replace old 
dependences, nor maximize the scheduler capability due to the existence of red 
zone in stack. It is only to block the memory access moving over stack pointer 
adjustment if distance is beyond red zone size, which is an OS requirement due 
to interruption existence. 

Stack adjustment in epilogue is a very general usage in stack frame. It's quite 
necessary to solve the general problem in middle-end rather than in back-end. 
Also, that old patch you attached is to solve the data dependence between two 
memory accesses, but stack pointer doesn't really have data dependence with 
memory access without using stack pointer, so they have different stories. 
Alternative solution of without adding blunt scheduling barrier is we insert an 
independent pass before scheduler to create RTL barrier by using the same 
interface stack_red_zone_size, but it would really be an over-design, if we add 
a new pass only for this *small* functionality.

In my patch, *abs* of offset is being used, so you are right that it's possible 
to get false positive to be too conservative, but there won't exist false 
negative, because my code would only add new dependences. 

Since the compilation is based on function, it would be OK if red zone size 
varies due to different ABI. Could you please tell me exactly on what system 
being supported by GCC red zone size can be different for incoming and 
outgoing? And also how scheduler guarantee the correctness in current code 
base? Anyway, I don't think my patch will break the original solution.

Thanks,
-Jiangning

 -Original Message-
 From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org]
 On Behalf Of Joern Rennecke
 Sent: Tuesday, July 26, 2011 10:33 AM
 To: Jiangning Liu
 Cc: gcc@gcc.gnu.org; gcc-patc...@gcc.gnu.org; vmaka...@redhat.com;
 dje@gmail.com; Richard Henderson; Ramana Radhakrishnan; 'Ramana
 Radhakrishnan'
 Subject: RE: [RFC] Add middle end hook for stack red zone size
 
 Quoting Jiangning Liu jiangning@arm.com:
 
  Hi,
 
  One month ago, I sent out this RFC to *gcc-patches* mail list, but I
   didn't receive any response yet. So I'm forwarding this mail to
  *gcc* mail list. Can anybody here really give feedback to me?
 
 Well, I couldn't approve any patch, but I can point out some issues with your 
 patch.
 
 First, it's missing a ChangeLog, and you don't state how you have tested it.
 And regarding the code in sched_analyze_1, I think you'll get false positives 
 with
 alloca, and false negatives when registers are involved to compute offsets or 
 to
 restore the stack pointer from.
 
 FWIW, I think generally blunt scheduling barriers should be avoided, and 
 instead
 the dependencies made visible to the scheduler.
 E.g., I've been working with another architecture with a redzone, where at 
 -fno-
 omit-frame-pointer, the prologue can put pretend_args into the redzone, then 
 after
 stack adjustment and frame allocation, these arguments are accessed via the 
 frame
 pointer.
 
 With the attached patchlet, alias analysis works for this situation too, so 
 no blunt
 scheduling block is required.
 
 Likewise, with stack adjustments, they should not affect scheduling in 
 general, but
 be considered to clobber the part of the frame that is being exposed to 
 interrupt
 writes either before or after the adjustment.
 At the moment, each port that wants to have such selective scheduling 
 blockages
 has to define a stack_adjust pattern with a memory clobber in a parallel, 
 with a
 memref that shows the high water mark of possible interrupt stack writes.
 Prima facia it would seem convenient if you only had to tell the middle-end 
 about
 the redzone size, and it could figure out the implicit clobbers when the 
 stack is
 changed.  However, when a big stack adjustment is being made, or the stack is
 realigned, or restored from the frame pointer / another register where it was
 saved due to realignment, the adjustment is not so obvious.  I'm not sure if 
 you can
 actually create an robust interface that's simpler to use than putting the 
 right
 memory clobber in the stack adjust pattern.  Note also that the redzone size 
 can
 vary from function to function depending on ABI-altering attributes, in 
 particular
 for interrupt functions, which can also have different incoming and outgoing
 redzone sizes.  Plus, you can have an NMI / reset handler which can use the 
 stack
 like an ordinary address register.





RE: [RFC] Add middle end hook for stack red zone size

2011-07-31 Thread Jiangning Liu
Joern,

Thanks for your valuable feedback! This is only a RFC, and I will send out 
formal patch along with ChangLog later on. 

Basically, my patch is only to add new dependence in scheduler, and it only 
blocks some instruction movements, so it is NO RISK to compiler correctness. 
For whatever stack pointer changes you gave in different scenarios, the current 
code base should already work. My patch intends neither to replace old 
dependences, nor maximize the scheduler capability due to the existence of red 
zone in stack. It is only to block the memory access moving over stack pointer 
adjustment if distance is beyond red zone size, which is an OS requirement due 
to interruption existence. 

Stack adjustment in epilogue is a very general usage in stack frame. It's quite 
necessary to solve the general problem in middle-end rather than in back-end. 
Also, that old patch you attached is to solve the data dependence between two 
memory accesses, but stack pointer doesn't really have data dependence with 
memory access without using stack pointer, so they have different stories. 
Alternative solution of without adding blunt scheduling barrier is we insert an 
independent pass before scheduler to create RTL barrier by using the same 
interface stack_red_zone_size, but it would really be an over-design, if we add 
a new pass only for this *small* functionality.

In my patch, *abs* of offset is being used, so you are right that it's possible 
to get false positive to be too conservative, but there won't exist false 
negative, because my code would only add new dependences. 

Since the compilation is based on function, it would be OK if red zone size 
varies due to different ABI. Could you please tell me exactly on what system 
being supported by GCC red zone size can be different for incoming and 
outgoing? And also how scheduler guarantee the correctness in current code 
base? Anyway, I don't think my patch will break the original solution.

Thanks,
-Jiangning

 -Original Message-
 From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org]
 On Behalf Of Joern Rennecke
 Sent: Tuesday, July 26, 2011 10:33 AM
 To: Jiangning Liu
 Cc: g...@gcc.gnu.org; gcc-patches@gcc.gnu.org; vmaka...@redhat.com;
 dje@gmail.com; Richard Henderson; Ramana Radhakrishnan; 'Ramana
 Radhakrishnan'
 Subject: RE: [RFC] Add middle end hook for stack red zone size
 
 Quoting Jiangning Liu jiangning@arm.com:
 
  Hi,
 
  One month ago, I sent out this RFC to *gcc-patches* mail list, but I
   didn't receive any response yet. So I'm forwarding this mail to
  *gcc* mail list. Can anybody here really give feedback to me?
 
 Well, I couldn't approve any patch, but I can point out some issues with your 
 patch.
 
 First, it's missing a ChangeLog, and you don't state how you have tested it.
 And regarding the code in sched_analyze_1, I think you'll get false positives 
 with
 alloca, and false negatives when registers are involved to compute offsets or 
 to
 restore the stack pointer from.
 
 FWIW, I think generally blunt scheduling barriers should be avoided, and 
 instead
 the dependencies made visible to the scheduler.
 E.g., I've been working with another architecture with a redzone, where at 
 -fno-
 omit-frame-pointer, the prologue can put pretend_args into the redzone, then 
 after
 stack adjustment and frame allocation, these arguments are accessed via the 
 frame
 pointer.
 
 With the attached patchlet, alias analysis works for this situation too, so 
 no blunt
 scheduling block is required.
 
 Likewise, with stack adjustments, they should not affect scheduling in 
 general, but
 be considered to clobber the part of the frame that is being exposed to 
 interrupt
 writes either before or after the adjustment.
 At the moment, each port that wants to have such selective scheduling 
 blockages
 has to define a stack_adjust pattern with a memory clobber in a parallel, 
 with a
 memref that shows the high water mark of possible interrupt stack writes.
 Prima facia it would seem convenient if you only had to tell the middle-end 
 about
 the redzone size, and it could figure out the implicit clobbers when the 
 stack is
 changed.  However, when a big stack adjustment is being made, or the stack is
 realigned, or restored from the frame pointer / another register where it was
 saved due to realignment, the adjustment is not so obvious.  I'm not sure if 
 you can
 actually create an robust interface that's simpler to use than putting the 
 right
 memory clobber in the stack adjust pattern.  Note also that the redzone size 
 can
 vary from function to function depending on ABI-altering attributes, in 
 particular
 for interrupt functions, which can also have different incoming and outgoing
 redzone sizes.  Plus, you can have an NMI / reset handler which can use the 
 stack
 like an ordinary address register.





RE: [RFC] Add middle end hook for stack red zone size

2011-07-25 Thread Jiangning Liu
Hi,

One month ago, I sent out this RFC to *gcc-patches* mail list, but I didn't 
receive any response yet. So I'm forwarding this mail to *gcc* mail list. Can 
anybody here really give feedback to me?

Appreciate your help in advance!

-Jiangning

-Original Message-
From: Ramana Radhakrishnan [mailto:ramana.radhakrish...@linaro.org] 
Sent: Tuesday, July 19, 2011 6:18 PM
To: Jiangning Liu
Cc: gcc-patc...@gcc.gnu.org; vmaka...@redhat.com; dje@gmail.com; Richard 
Henderson; Ramana Radhakrishnan
Subject: Re: [RFC] Add middle end hook for stack red zone size

2011/7/19 Jiangning Liu jiangning@arm.com:

 I see a lot of feedbacks on other posts, but mine is still with ZERO
 response in the past 3 weeks, so I'm wondering if I made any mistake in my
 process? Who can help me?

It would be worth CC'ing the other relevant target maintainers as well
to get some feedback since the patch touches ARM, x86 and Powerpc.
I've added the maintainers for i386 and PPC to the CC list using the
email addresses from the MAINTAINERS file.

Thanks,
Ramana


 Thanks,
 -Jiangning

 -Original Message-
 From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org]
 On Behalf Of Jiangning Liu
 Sent: Tuesday, July 05, 2011 8:32 AM
 To: gcc-patc...@gcc.gnu.org; rgue...@gcc.gnu.org
 Subject: RE: [RFC] Add middle end hook for stack red zone size

 PING...

 I just merged with the latest code base and generated new patch as attached.

 Thanks,
 -Jiangning

 -Original Message-
 From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
 ow...@gcc.gnu.org] On Behalf Of Jiangning Liu
 Sent: 2011年6月28日 4:38 PM
 To: gcc-patc...@gcc.gnu.org
 Subject: [RFC] Add middle end hook for stack red zone size

 This patch is to fix PR38644, which is a bug with long history about
 stack red zone access, and PR30282 is correlated.

 Originally red zone concept is not exposed to middle-end, and back-end
 uses special logic to add extra memory barrier RTL and help the
 correct dependence in middle-end. This way different back-ends must
 handle red zone problem by themselves. For example, X86 target
 introduced function
 ix86_using_red_zone() to judge red zone access, while POWER introduced
 offset_below_red_zone_p() to judge it. Note that they have different
 semantics, but the logic in caller sites of back-end uses them to
 decide whether adding memory barrier RTL or not. If back-end
 incorrectly handles this, bug would be introduced.

 Therefore, the correct method should be middle-end handles red zone
 related things to avoid the burden in different back-ends. To be
 specific for PR38644, this middle-end problem causes incorrect
 behavior for ARM target.
 This patch exposes red zone concept to middle-end by introducing a
 middle-end/back-end hook TARGET_STACK_RED_ZONE_SIZE defined in
 target.def, and by default its value is 0. Back-end may redefine this
 function to provide concrete red zone size according to specific ABI
 requirements.

 In middle end, scheduling dependence is modified by using this hook
 plus checking stack frame pointer adjustment instruction to decide
 whether memory references need to be all flushed out or not. In
 theory, if TARGET_STACK_RED_ZONE_SIZE is defined correctly, back-end
 would not be required to specially handle this scheduling dependence
 issue by introducing extra memory barrier RTL.

 In back-end, the following changes are made to define the hook,
 1) For X86, TARGET_STACK_RED_ZONE_SIZE is redefined to be
 ix86_stack_red_zone_size() in i386.c, which is an newly introduced
 function.
 2) For POWER, TARGET_STACK_RED_ZONE_SIZE is redefined to be
 rs6000_stack_red_zone_size() in rs6000.c, which is also a newly
 defined function.
 3) For ARM and others, TARGET_STACK_RED_ZONE_SIZE is defined to be
 default_stack_red_zone_size in targhooks.c, and this function returns
 0, which means ARM eabi and others don't support red zone access at all.

 In summary, the relationship between ABI and red zone access is like
 below,

 -
 |   ARCH   |  ARM  |   X86 |POWER  | others |
 |--|---|---|---||
 |ABI   | EABI  | MS_64 | other |   AIX  |  V4  ||
 |--|---|---|---||--||
 | RED ZONE |  No   |  YES  |  No   |  YES   |  No  |   No   |
 |--|---|---|---||--||
 | RED ZONE SIZE|   0   |  128  |   0   |220/288 |   0  |0   |
 -

 Thanks,
 -Jiangning










RE: [RFC] Add middle end hook for stack red zone size

2011-07-25 Thread Jiangning Liu
Hi,

One month ago, I sent out this RFC to *gcc-patches* mail list, but I didn't 
receive any response yet. So I'm forwarding this mail to *gcc* mail list. Can 
anybody here really give feedback to me?

Appreciate your help in advance!

-Jiangning

-Original Message-
From: Ramana Radhakrishnan [mailto:ramana.radhakrish...@linaro.org] 
Sent: Tuesday, July 19, 2011 6:18 PM
To: Jiangning Liu
Cc: gcc-patches@gcc.gnu.org; vmaka...@redhat.com; dje@gmail.com; Richard 
Henderson; Ramana Radhakrishnan
Subject: Re: [RFC] Add middle end hook for stack red zone size

2011/7/19 Jiangning Liu jiangning@arm.com:

 I see a lot of feedbacks on other posts, but mine is still with ZERO
 response in the past 3 weeks, so I'm wondering if I made any mistake in my
 process? Who can help me?

It would be worth CC'ing the other relevant target maintainers as well
to get some feedback since the patch touches ARM, x86 and Powerpc.
I've added the maintainers for i386 and PPC to the CC list using the
email addresses from the MAINTAINERS file.

Thanks,
Ramana


 Thanks,
 -Jiangning

 -Original Message-
 From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org]
 On Behalf Of Jiangning Liu
 Sent: Tuesday, July 05, 2011 8:32 AM
 To: gcc-patches@gcc.gnu.org; rgue...@gcc.gnu.org
 Subject: RE: [RFC] Add middle end hook for stack red zone size

 PING...

 I just merged with the latest code base and generated new patch as attached.

 Thanks,
 -Jiangning

 -Original Message-
 From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
 ow...@gcc.gnu.org] On Behalf Of Jiangning Liu
 Sent: 2011年6月28日 4:38 PM
 To: gcc-patches@gcc.gnu.org
 Subject: [RFC] Add middle end hook for stack red zone size

 This patch is to fix PR38644, which is a bug with long history about
 stack red zone access, and PR30282 is correlated.

 Originally red zone concept is not exposed to middle-end, and back-end
 uses special logic to add extra memory barrier RTL and help the
 correct dependence in middle-end. This way different back-ends must
 handle red zone problem by themselves. For example, X86 target
 introduced function
 ix86_using_red_zone() to judge red zone access, while POWER introduced
 offset_below_red_zone_p() to judge it. Note that they have different
 semantics, but the logic in caller sites of back-end uses them to
 decide whether adding memory barrier RTL or not. If back-end
 incorrectly handles this, bug would be introduced.

 Therefore, the correct method should be middle-end handles red zone
 related things to avoid the burden in different back-ends. To be
 specific for PR38644, this middle-end problem causes incorrect
 behavior for ARM target.
 This patch exposes red zone concept to middle-end by introducing a
 middle-end/back-end hook TARGET_STACK_RED_ZONE_SIZE defined in
 target.def, and by default its value is 0. Back-end may redefine this
 function to provide concrete red zone size according to specific ABI
 requirements.

 In middle end, scheduling dependence is modified by using this hook
 plus checking stack frame pointer adjustment instruction to decide
 whether memory references need to be all flushed out or not. In
 theory, if TARGET_STACK_RED_ZONE_SIZE is defined correctly, back-end
 would not be required to specially handle this scheduling dependence
 issue by introducing extra memory barrier RTL.

 In back-end, the following changes are made to define the hook,
 1) For X86, TARGET_STACK_RED_ZONE_SIZE is redefined to be
 ix86_stack_red_zone_size() in i386.c, which is an newly introduced
 function.
 2) For POWER, TARGET_STACK_RED_ZONE_SIZE is redefined to be
 rs6000_stack_red_zone_size() in rs6000.c, which is also a newly
 defined function.
 3) For ARM and others, TARGET_STACK_RED_ZONE_SIZE is defined to be
 default_stack_red_zone_size in targhooks.c, and this function returns
 0, which means ARM eabi and others don't support red zone access at all.

 In summary, the relationship between ABI and red zone access is like
 below,

 -
 |   ARCH   |  ARM  |   X86 |POWER  | others |
 |--|---|---|---||
 |ABI   | EABI  | MS_64 | other |   AIX  |  V4  ||
 |--|---|---|---||--||
 | RED ZONE |  No   |  YES  |  No   |  YES   |  No  |   No   |
 |--|---|---|---||--||
 | RED ZONE SIZE|   0   |  128  |   0   |220/288 |   0  |0   |
 -

 Thanks,
 -Jiangning










RE: [RFC] Add middle end hook for stack red zone size

2011-07-04 Thread Jiangning Liu
PING...

I just merged with the latest code base and generated new patch as attached.

Thanks,
-Jiangning

 -Original Message-
 From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
 ow...@gcc.gnu.org] On Behalf Of Jiangning Liu
 Sent: 2011年6月28日 4:38 PM
 To: gcc-patches@gcc.gnu.org
 Subject: [RFC] Add middle end hook for stack red zone size
 
 This patch is to fix PR38644, which is a bug with long history about
 stack red zone access, and PR30282 is correlated.
 
 Originally red zone concept is not exposed to middle-end, and back-end
 uses special logic to add extra memory barrier RTL and help the correct
 dependence in middle-end. This way different back-ends must handle red
 zone problem by themselves. For example, X86 target introduced function
 ix86_using_red_zone() to judge red zone access, while POWER introduced
 offset_below_red_zone_p() to judge it. Note that they have different
 semantics, but the logic in caller sites of back-end uses them to
 decide whether adding memory barrier RTL or not. If back-end
 incorrectly handles this, bug would be introduced.
 
 Therefore, the correct method should be middle-end handles red zone
 related things to avoid the burden in different back-ends. To be
 specific for PR38644, this middle-end problem causes incorrect behavior
 for ARM target.
 This patch exposes red zone concept to middle-end by introducing a
 middle-end/back-end hook TARGET_STACK_RED_ZONE_SIZE defined in
 target.def, and by default its value is 0. Back-end may redefine this
 function to provide concrete red zone size according to specific ABI
 requirements.
 
 In middle end, scheduling dependence is modified by using this hook
 plus checking stack frame pointer adjustment instruction to decide
 whether memory references need to be all flushed out or not. In theory,
 if TARGET_STACK_RED_ZONE_SIZE is defined correctly, back-end would not
 be required to specially handle this scheduling dependence issue by
 introducing extra memory barrier RTL.
 
 In back-end, the following changes are made to define the hook,
 1) For X86, TARGET_STACK_RED_ZONE_SIZE is redefined to be
 ix86_stack_red_zone_size() in i386.c, which is an newly introduced
 function.
 2) For POWER, TARGET_STACK_RED_ZONE_SIZE is redefined to be
 rs6000_stack_red_zone_size() in rs6000.c, which is also a newly defined
 function.
 3) For ARM and others, TARGET_STACK_RED_ZONE_SIZE is defined to be
 default_stack_red_zone_size in targhooks.c, and this function returns 0,
 which means ARM eabi and others don't support red zone access at all.
 
 In summary, the relationship between ABI and red zone access is like
 below,
 
 -
 |   ARCH   |  ARM  |   X86 |POWER  | others |
 |--|---|---|---||
 |ABI   | EABI  | MS_64 | other |   AIX  |  V4  ||
 |--|---|---|---||--||
 | RED ZONE |  No   |  YES  |  No   |  YES   |  No  |   No   |
 |--|---|---|---||--||
 | RED ZONE SIZE|   0   |  128  |   0   |220/288 |   0  |0   |
 -
 
 Thanks,
 -Jiangning

stack-red-zone-patch-38644-4.patch
Description: Binary data


[RFC] Add middle end hook for stack red zone size

2011-06-28 Thread Jiangning Liu
This patch is to fix PR38644, which is a bug with long history about stack
red zone access, and PR30282 is correlated.

Originally red zone concept is not exposed to middle-end, and back-end uses
special logic to add extra memory barrier RTL and help the correct
dependence in middle-end. This way different back-ends must handle red zone
problem by themselves. For example, X86 target introduced function
ix86_using_red_zone() to judge red zone access, while POWER introduced
offset_below_red_zone_p() to judge it. Note that they have different
semantics, but the logic in caller sites of back-end uses them to decide
whether adding memory barrier RTL or not. If back-end incorrectly handles
this, bug would be introduced. 

Therefore, the correct method should be middle-end handles red zone related
things to avoid the burden in different back-ends. To be specific for
PR38644, this middle-end problem causes incorrect behavior for ARM target.
This patch exposes red zone concept to middle-end by introducing a
middle-end/back-end hook TARGET_STACK_RED_ZONE_SIZE defined in target.def,
and by default its value is 0. Back-end may redefine this function to
provide concrete red zone size according to specific ABI requirements. 

In middle end, scheduling dependence is modified by using this hook plus
checking stack frame pointer adjustment instruction to decide whether memory
references need to be all flushed out or not. In theory, if
TARGET_STACK_RED_ZONE_SIZE is defined correctly, back-end would not be
required to specially handle this scheduling dependence issue by introducing
extra memory barrier RTL.

In back-end, the following changes are made to define the hook,
1) For X86, TARGET_STACK_RED_ZONE_SIZE is redefined to be
ix86_stack_red_zone_size() in i386.c, which is an newly introduced function.
2) For POWER, TARGET_STACK_RED_ZONE_SIZE is redefined to be
rs6000_stack_red_zone_size() in rs6000.c, which is also a newly defined
function.
3) For ARM and others, TARGET_STACK_RED_ZONE_SIZE is defined to be
default_stack_red_zone_size in targhooks.c, and this function returns 0,
which means ARM eabi and others don't support red zone access at all.

In summary, the relationship between ABI and red zone access is like below,

-
|   ARCH   |  ARM  |   X86 |POWER  | others |
|--|---|---|---||
|ABI   | EABI  | MS_64 | other |   AIX  |  V4  ||
|--|---|---|---||--||
| RED ZONE |  No   |  YES  |  No   |  YES   |  No  |   No   |
|--|---|---|---||--||
| RED ZONE SIZE|   0   |  128  |   0   |220/288 |   0  |0   |
-

Thanks,
-Jiangning

stack-red-zone-patch-38644-3.patch
Description: Binary data