An issue on loop optimization/vectorization

2018-07-11 Thread jiangning liu
For the case below, the code generated by “gcc -O3” is very ugly.



char g_d[1024], g_s1[1024], g_s2[1024];

void test_loop(void)

{

char *d = g_d, *s1 = g_s1, *s2 = g_s2;



for( int y = 0; y < 128; y++ )

{

for( int x = 0; x < 16; x++ )

d[x] = s1[x] + s2[x];

d += 16;

}

}



If we change “for( int x = 0; x < 16; x++ )” to be like “for( int x = 0; x
< 32; x++ )”, very beautiful vectorization code would be generated,



test_loop:

.LFB0:

.cfi_startproc

adrpx2, g_s1

adrpx3, g_s2

add x2, x2, :lo12:g_s1

add x3, x3, :lo12:g_s2

adrpx0, g_d

adrpx1, g_d+2048

add x0, x0, :lo12:g_d

add x1, x1, :lo12:g_d+2048

ldp q1, q2, [x2]

ldp q3, q0, [x3]

add v1.16b, v1.16b, v3.16b

add v0.16b, v0.16b, v2.16b

.p2align 3,,7

.L2:

str q1, [x0]

str q0, [x0, 16]!

cmp x0, x1

bne .L2

ret


The code generated for " for( int x = 0; x < 8; x++ )" is also very ugly.


It looks gcc has potential bugs on loop vectorization. Any idea?



Thanks,

-Jiangning


RE: A question about loop ivopt

2012-05-22 Thread Jiangning Liu


> -Original Message-
> From: Richard Guenther [mailto:richard.guent...@gmail.com]
> Sent: Tuesday, May 22, 2012 6:36 PM
> To: Jiangning Liu
> Cc: Zdenek Dvorak; Jiangning Liu; gcc@gcc.gnu.org
> Subject: Re: A question about loop ivopt
> 
> On Tue, May 22, 2012 at 11:19 AM, Jiangning Liu 
> wrote:
> >
> >
> >> -Original Message-
> >> From: Richard Guenther [mailto:richard.guent...@gmail.com]
> >> Sent: Tuesday, May 15, 2012 10:17 PM
> >> To: Zdenek Dvorak
> >> Cc: Jiangning Liu; gcc@gcc.gnu.org; Jiangning Liu
> >> Subject: Re: A question about loop ivopt
> >>
> >> On Tue, May 15, 2012 at 4:13 PM, Zdenek Dvorak
> >>  wrote:
> >> > Hi,
> >> >
> >> >> > > > Why can't we replace function force_expr_to_var_cost
> directly
> >> with
> >> >> function
> >> >> > > > computation_cost in tree-ssa-loop-ivopt.c?
> >> >> > > >
> >> >> > > > Actually I think it is inaccurate for the current recursive
> >> algorithm
> >> >> in
> >> >> > > > force_expr_to_var_cost to estimate expr cost. Instead
> >> >> computation_cost can
> >> >> > > > count some back-end factors and make the estimation more
> >> accurate.
> >> >> > > >
> >> >> > > > For example, using computation_cost, we may check whether
> >> back-ends
> >> >> can tie
> >> >> > > > some modes transformation expressed by SUBREG or not. If we
> >> use
> >> >> > > > force_expr_to_var_cost, some more computations around type
> >> >> > > > promotion/demotion would increase the cost estimated.
> >> >> > > >
> >> >> > > > Looking at the algorithm in force_expr_to_var_cost, it is
> just
> >> to
> >> >> analyze
> >> >> > > > the operator in the expression and give estimation. Should
> it
> >> be the
> >> >> same as
> >> >> > > > expanding EXPR to RTX and give estimation like in
> >> computation_cost?
> >> >> > > >
> >> >> > > > Any thoughts?
> >> >> > >
> >> >> > > I suppose Zdenek may remember.
> >> >> >
> >> >> > computation_cost actually expands the expression to RTL.  Since
> >> cost
> >> >> estimates
> >> >> > are computed a lot in ivopts, using it directly would require a
> >> huge
> >> >> amount of memory,
> >> >>
> >> >> Zdenek,
> >> >>
> >> >> Do you know how huge is it? Any data on this? For GCC, is this
> >> "huge"
> >> >> memory consumption is critical enough, and there aren't any other
> >> else
> >> >> consuming even more memory?
> >> >
> >> > no, not really (I haven't worked on this for a few years now),
> >> although
> >> > of course I did some measurements when ivopts were created.  Feel
> >> free
> >> > to experiment with that, if it turned out that the memory
> consumption
> >> > and extra time spent by it is negligible, it would be a nice
> cleanup.
> >>
> >> Well, I don't think we should feed arbitrary expressions to expand
> at
> >> IVOPTs time.  What probably matters most is address costs, no?
> >> At least that is where expand probably makes the most difference.
> >> So why not improve force_expr_to_var_cost instead?
> >
> > OK, yes, the thing that matter most is just address cost, so I can
> improve
> > force_expr_to_var_cost.
> >
> > Would it sound OK if I expose MODES_TIEABLE_P to middle-end by
> defining a
> > new target hook? I need this function to strip some operations and
> make the
> > cost estimate more accurate. If I don't expand to RTL, I would need a
> method
> > to check the modes conversion in middle end, anyway. Any idea?
> 
> You are already in the middle-end and thus can use MODES_TIABLE_P
> directly.  Modes are also available on gimple variables via
> DECL/TYPE_MODE.

Richard,

But MODES_TIEABLE_P is a macro hook and isn't exposed to TREE level, so I
would have to modify xxx-protos.h for all back-ends.

An alternative way is I define a new function hook. This way I needn't to
change all back-ends, but support several back-ends required first.

Which solution is usually preferred?

Thanks,
-Jiangning

> 
> Richard.
> 
> > Thanks,
> > -Jiangning
> >
> >>
> >> Richard.
> >>
> >>
> >> > Zdenek
> >
> >
> >
> >






RE: A question about loop ivopt

2012-05-22 Thread Jiangning Liu


> -Original Message-
> From: Richard Guenther [mailto:richard.guent...@gmail.com]
> Sent: Tuesday, May 15, 2012 10:17 PM
> To: Zdenek Dvorak
> Cc: Jiangning Liu; gcc@gcc.gnu.org; Jiangning Liu
> Subject: Re: A question about loop ivopt
> 
> On Tue, May 15, 2012 at 4:13 PM, Zdenek Dvorak
>  wrote:
> > Hi,
> >
> >> > > > Why can't we replace function force_expr_to_var_cost directly
> with
> >> function
> >> > > > computation_cost in tree-ssa-loop-ivopt.c?
> >> > > >
> >> > > > Actually I think it is inaccurate for the current recursive
> algorithm
> >> in
> >> > > > force_expr_to_var_cost to estimate expr cost. Instead
> >> computation_cost can
> >> > > > count some back-end factors and make the estimation more
> accurate.
> >> > > >
> >> > > > For example, using computation_cost, we may check whether
> back-ends
> >> can tie
> >> > > > some modes transformation expressed by SUBREG or not. If we
> use
> >> > > > force_expr_to_var_cost, some more computations around type
> >> > > > promotion/demotion would increase the cost estimated.
> >> > > >
> >> > > > Looking at the algorithm in force_expr_to_var_cost, it is just
> to
> >> analyze
> >> > > > the operator in the expression and give estimation. Should it
> be the
> >> same as
> >> > > > expanding EXPR to RTX and give estimation like in
> computation_cost?
> >> > > >
> >> > > > Any thoughts?
> >> > >
> >> > > I suppose Zdenek may remember.
> >> >
> >> > computation_cost actually expands the expression to RTL.  Since
> cost
> >> estimates
> >> > are computed a lot in ivopts, using it directly would require a
> huge
> >> amount of memory,
> >>
> >> Zdenek,
> >>
> >> Do you know how huge is it? Any data on this? For GCC, is this
> "huge"
> >> memory consumption is critical enough, and there aren't any other
> else
> >> consuming even more memory?
> >
> > no, not really (I haven't worked on this for a few years now),
> although
> > of course I did some measurements when ivopts were created.  Feel
> free
> > to experiment with that, if it turned out that the memory consumption
> > and extra time spent by it is negligible, it would be a nice cleanup.
> 
> Well, I don't think we should feed arbitrary expressions to expand at
> IVOPTs time.  What probably matters most is address costs, no?
> At least that is where expand probably makes the most difference.
> So why not improve force_expr_to_var_cost instead?

OK, yes, the thing that matter most is just address cost, so I can improve
force_expr_to_var_cost.

Would it sound OK if I expose MODES_TIEABLE_P to middle-end by defining a
new target hook? I need this function to strip some operations and make the
cost estimate more accurate. If I don't expand to RTL, I would need a method
to check the modes conversion in middle end, anyway. Any idea?

Thanks,
-Jiangning

> 
> Richard.
> 
> 
> > Zdenek






A question about loop ivopt

2012-05-14 Thread Jiangning Liu
Hi,

Why can't we replace function force_expr_to_var_cost directly with function
computation_cost in tree-ssa-loop-ivopt.c?

Actually I think it is inaccurate for the current recursive algorithm in
force_expr_to_var_cost to estimate expr cost. Instead computation_cost can
count some back-end factors and make the estimation more accurate.

For example, using computation_cost, we may check whether back-ends can tie
some modes transformation expressed by SUBREG or not. If we use
force_expr_to_var_cost, some more computations around type
promotion/demotion would increase the cost estimated.

Looking at the algorithm in force_expr_to_var_cost, it is just to analyze
the operator in the expression and give estimation. Should it be the same as
expanding EXPR to RTX and give estimation like in computation_cost?

Any thoughts?

Thanks,
-Jiangning





Why can't copy renaming capture this assignment?

2012-03-30 Thread Jiangning Liu
Hi,

For this small case, 

char garr[100];
void f(void)
{
unsigned short h, s;

s = 20;

for (h = 1; h < (s-1); h++)
{
garr[h] = 0;
}
}

After copyrename3, we have the following dump,

f ()
{
  short unsigned int h;
  int D.4066;

:
  D.4066_14 = 1;
  if (D.4066_14 <= 18)
goto ;
  else
goto ;

:
  # h_15 = PHI 
  # D.4066_16 = PHI 
  garr[D.4066_16] = 0;
  h_8 = h_15 + 1;
  D.4066_4 = (int) h_8;
  if (D.4066_4 <= 18)
goto ;
  else
goto ;

:
  return;

}

copy renaming fails to capture the assignment statement "D.4066_4 = (int)
h_8;" to trigger renaming partition coalesce. 

I find gimple_assign_single_p invoked by gimple_assign_ssa_name_copy_p
always returns false, because for this statement " gs->gsbase.subcode" is
NOP_EXPR rather than GIMPLE_SINGLE_RHS.

Should subcode be correctly initialized anywhere to fix this problem?

BTW, my expectation after copy renaming is like below, 

f ()
{
  int D.4679;

:
  D.4679_7 = 1;
  if (D.4679_7 != 19)
goto ;
  else
goto ;

:
  # D.4679_15 = PHI 
  # D.4679_17 = PHI 
  garr[D.4679_15] = 0;
  D.4679_14 = D.4679_17 + 1;
  D.4679_4 = D.4679_14;
  if (D.4679_4 != 19)
goto ;
  else
goto ;

:
  return;

}

and then PRE can finally remove that redundancy for symbol D. away.

Thanks,
-Jiangning







RE: A question about redundant PHI expression stmt

2012-02-28 Thread Jiangning Liu


> -Original Message-
> From: Jiangning Liu
> Sent: Tuesday, February 28, 2012 4:07 PM
> To: Jiangning Liu; 'William J. Schmidt'
> Cc: gcc@gcc.gnu.org; wschm...@gcc.gnu.org
> Subject: RE: A question about redundant PHI expression stmt
> 
> 
> 
> > -Original Message-
> > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf
> Of
> > Jiangning Liu
> > Sent: Tuesday, February 28, 2012 11:19 AM
> > To: 'William J. Schmidt'
> > Cc: gcc@gcc.gnu.org; wschm...@gcc.gnu.org
> > Subject: RE: A question about redundant PHI expression stmt
> >
> >
> >
> > > -Original Message-
> > > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On
> Behalf
> > Of
> > > William J. Schmidt
> > > Sent: Monday, February 27, 2012 11:32 PM
> > > To: Jiangning Liu
> > > Cc: gcc@gcc.gnu.org; wschm...@gcc.gnu.org
> > > Subject: Re: A question about redundant PHI expression stmt
> > >
> > > On Fri, 2012-02-24 at 16:07 +0800, Jiangning Liu wrote:
> > > > Hi,
> > > >
> > > > For the small case below, there are some redundant PHI expression
> > > stmt
> > > > generated, and finally cause back-end generates redundant copy
> > > instructions
> > > > due to some reasons around IRA.
> > > >
> > > > int *l, *r, *g;
> > > > void test_func(int n)
> > > > {
> > > > int i;
> > > > static int j;
> > > > static int pos, direction, direction_pre;
> > > >
> > > > pos = 0;
> > > > direction = 1;
> > > >
> > > > for ( i = 0; i < n; i++ )
> > > > {
> > > > direction_pre = direction;
> > > >
> > > > for ( j = 0; j <= 400; j++ )
> > > > {
> > > >
> > > > if ( g[pos] == 200 )
> > > > break;
> > > >
> > > > if ( direction == 0 )
> > > > pos = l[pos];
> > > > else
> > > > pos = r[pos];
> > > >
> > > > if ( pos == -1 )
> > > > {
> > > > if ( direction_pre != direction )
> > > > break;
> > > >
> > > > pos = 0;
> > > > direction = !direction;
> > > > }
> > > >
> > > > }
> > > >
> > > > f(g[pos]);
> > > > }
> > > > }
> > > >
> > > > I know PR39976 has something to do with this case, and check-in
> > > r182140
> > > > caused big degradation on the real benchmark, but I'm still
> > confusing
> > > around
> > > > this issue.
> > > >
> > > > The procedure is like this,
> > > >
> > > > Loop store motion generates code below,
> > > >
> > > > :
> > > >   D.4084_17 = l.4_13 + D.4077_70;
> > > >   pos.5_18 = *D.4084_17;
> > > >   pos_lsm.34_103 = pos.5_18;
> > > >   goto ;
> > > >
> > > > :
> > > >   D.4088_23 = r.6_19 + D.4077_70;
> > > >   pos.7_24 = *D.4088_23;
> > > >   pos_lsm.34_104 = pos.7_24;
> > > >
> > > > :
> > > >   # prephitmp.29_89 = PHI 
> > > >   # pos_lsm.34_53 = PHI 
> > > >   pos.2_25 = prephitmp.29_89;
> > > >   if (pos.2_25 == -1)
> > > > goto ;
> > > >   else
> > > > goto ;
> > > >
> > > > And then, copy propagation transforms block 8 it into
> > > >
> > > > :
> > > >   # prephitmp.29_89 = PHI 
> > > >   # pos_lsm.34_53 = PHI 
> > > >   ...
> > > >
> > > > And then, these two duplicated PHI stmts have been kept until
> > expand
> > > pass.
> > > >
> > > > In dom2, a stmt like below
> > > >
> > > >   # pos_lsm.34_54 = PHI 
> > > >
> > > > is transformed into
> > > >
> > &

RE: A question about redundant PHI expression stmt

2012-02-28 Thread Jiangning Liu


> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
> Jiangning Liu
> Sent: Tuesday, February 28, 2012 11:19 AM
> To: 'William J. Schmidt'
> Cc: gcc@gcc.gnu.org; wschm...@gcc.gnu.org
> Subject: RE: A question about redundant PHI expression stmt
> 
> 
> 
> > -Original Message-
> > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf
> Of
> > William J. Schmidt
> > Sent: Monday, February 27, 2012 11:32 PM
> > To: Jiangning Liu
> > Cc: gcc@gcc.gnu.org; wschm...@gcc.gnu.org
> > Subject: Re: A question about redundant PHI expression stmt
> >
> > On Fri, 2012-02-24 at 16:07 +0800, Jiangning Liu wrote:
> > > Hi,
> > >
> > > For the small case below, there are some redundant PHI expression
> > stmt
> > > generated, and finally cause back-end generates redundant copy
> > instructions
> > > due to some reasons around IRA.
> > >
> > > int *l, *r, *g;
> > > void test_func(int n)
> > > {
> > >   int i;
> > >   static int j;
> > >   static int pos, direction, direction_pre;
> > >
> > >   pos = 0;
> > >   direction = 1;
> > >
> > >   for ( i = 0; i < n; i++ )
> > >   {
> > >   direction_pre = direction;
> > >
> > >   for ( j = 0; j <= 400; j++ )
> > >   {
> > >
> > >   if ( g[pos] == 200 )
> > >   break;
> > >
> > >   if ( direction == 0 )
> > >   pos = l[pos];
> > >   else
> > >   pos = r[pos];
> > >
> > >   if ( pos == -1 )
> > >   {
> > >   if ( direction_pre != direction )
> > >   break;
> > >
> > >   pos = 0;
> > >   direction = !direction;
> > >   }
> > >
> > >   }
> > >
> > >   f(g[pos]);
> > >   }
> > > }
> > >
> > > I know PR39976 has something to do with this case, and check-in
> > r182140
> > > caused big degradation on the real benchmark, but I'm still
> confusing
> > around
> > > this issue.
> > >
> > > The procedure is like this,
> > >
> > > Loop store motion generates code below,
> > >
> > > :
> > >   D.4084_17 = l.4_13 + D.4077_70;
> > >   pos.5_18 = *D.4084_17;
> > >   pos_lsm.34_103 = pos.5_18;
> > >   goto ;
> > >
> > > :
> > >   D.4088_23 = r.6_19 + D.4077_70;
> > >   pos.7_24 = *D.4088_23;
> > >   pos_lsm.34_104 = pos.7_24;
> > >
> > > :
> > >   # prephitmp.29_89 = PHI 
> > >   # pos_lsm.34_53 = PHI 
> > >   pos.2_25 = prephitmp.29_89;
> > >   if (pos.2_25 == -1)
> > > goto ;
> > >   else
> > > goto ;
> > >
> > > And then, copy propagation transforms block 8 it into
> > >
> > > :
> > >   # prephitmp.29_89 = PHI 
> > >   # pos_lsm.34_53 = PHI 
> > >   ...
> > >
> > > And then, these two duplicated PHI stmts have been kept until
> expand
> > pass.
> > >
> > > In dom2, a stmt like below
> > >
> > >   # pos_lsm.34_54 = PHI 
> > >
> > > is transformed into
> > >
> > >   # pos_lsm.34_54 = PHI 
> > >
> > > just because they are equal.
> > >
> > > Unfortunately, this transformation changed back-end behavior to
> > generate
> > > redundant copy instructions and hurt performance. This case is from
> a
> > real
> > > benchmark and hurt performance a lot.
> > >
> > > Does the fix in r182140 intend to totally clean up this kind of
> > redundancy?
> > > Where should we get it fixed?
> > >
> >
> > Hi, sorry not to have responded sooner -- I just now got some time to
> > look at this.
> >
> > I compiled your code with -O3 for revisions 182139 and 182140 of GCC
> > trunk, and found no difference between the code produced by the
> middle
> > end for the two versions.  So something else has apparently come
> along
> > since then that helped produce the problematic code generation for
> you.
> > Either that or I need to

RE: A question about redundant PHI expression stmt

2012-02-27 Thread Jiangning Liu


> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
> William J. Schmidt
> Sent: Monday, February 27, 2012 11:32 PM
> To: Jiangning Liu
> Cc: gcc@gcc.gnu.org; wschm...@gcc.gnu.org
> Subject: Re: A question about redundant PHI expression stmt
> 
> On Fri, 2012-02-24 at 16:07 +0800, Jiangning Liu wrote:
> > Hi,
> >
> > For the small case below, there are some redundant PHI expression
> stmt
> > generated, and finally cause back-end generates redundant copy
> instructions
> > due to some reasons around IRA.
> >
> > int *l, *r, *g;
> > void test_func(int n)
> > {
> > int i;
> > static int j;
> > static int pos, direction, direction_pre;
> >
> > pos = 0;
> > direction = 1;
> >
> > for ( i = 0; i < n; i++ )
> > {
> > direction_pre = direction;
> >
> > for ( j = 0; j <= 400; j++ )
> > {
> >
> > if ( g[pos] == 200 )
> > break;
> >
> > if ( direction == 0 )
> > pos = l[pos];
> > else
> > pos = r[pos];
> >
> > if ( pos == -1 )
> > {
> > if ( direction_pre != direction )
> > break;
> >
> > pos = 0;
> > direction = !direction;
> > }
> >
> > }
> >
> > f(g[pos]);
> > }
> > }
> >
> > I know PR39976 has something to do with this case, and check-in
> r182140
> > caused big degradation on the real benchmark, but I'm still confusing
> around
> > this issue.
> >
> > The procedure is like this,
> >
> > Loop store motion generates code below,
> >
> > :
> >   D.4084_17 = l.4_13 + D.4077_70;
> >   pos.5_18 = *D.4084_17;
> >   pos_lsm.34_103 = pos.5_18;
> >   goto ;
> >
> > :
> >   D.4088_23 = r.6_19 + D.4077_70;
> >   pos.7_24 = *D.4088_23;
> >   pos_lsm.34_104 = pos.7_24;
> >
> > :
> >   # prephitmp.29_89 = PHI 
> >   # pos_lsm.34_53 = PHI 
> >   pos.2_25 = prephitmp.29_89;
> >   if (pos.2_25 == -1)
> > goto ;
> >   else
> > goto ;
> >
> > And then, copy propagation transforms block 8 it into
> >
> > :
> >   # prephitmp.29_89 = PHI 
> >   # pos_lsm.34_53 = PHI 
> >   ...
> >
> > And then, these two duplicated PHI stmts have been kept until expand
> pass.
> >
> > In dom2, a stmt like below
> >
> >   # pos_lsm.34_54 = PHI 
> >
> > is transformed into
> >
> >   # pos_lsm.34_54 = PHI 
> >
> > just because they are equal.
> >
> > Unfortunately, this transformation changed back-end behavior to
> generate
> > redundant copy instructions and hurt performance. This case is from a
> real
> > benchmark and hurt performance a lot.
> >
> > Does the fix in r182140 intend to totally clean up this kind of
> redundancy?
> > Where should we get it fixed?
> >
> 
> Hi, sorry not to have responded sooner -- I just now got some time to
> look at this.
> 
> I compiled your code with -O3 for revisions 182139 and 182140 of GCC
> trunk, and found no difference between the code produced by the middle
> end for the two versions.  So something else has apparently come along
> since then that helped produce the problematic code generation for you.
> Either that or I need to use different compile flags; you didn't
> specify
> what you used.
> 
> The fix in r182140 does just what you saw in dom2:  identifies
> duplicate
> PHIs in the same block and ensures only one of them is used.  This
> actually avoids inserting extra blocks during expand in certain loop
> cases.  I am not sure why you are getting redundant copies as a result,
> but it sounds from your comments like IRA didn't coalesce a register
> copy or something like that.  You may want to bisect revisions on the
> trunk to see where your bad code generation started to occur to get a
> better handle on what happened.
> 
> As Richard said, the dom pass is likely to be removed someday, whenever
> someone can get around to it.  My redundant-phi band-aid in dom would
> go
> away then as well, but presumably an extra pass of PRE would replace it,
> and redundant PHIs would still be removed.
> 

Bill,

Thanks a lot for your explanation!

The bug could be exposed with -O2 if you apply the check-in r184435 made by 
Richard.

BTW, at present is there anybody working on the NEW pass replacing dom? If no, 
in short term, it would be good to still get it fixed just because it is 
hurting benchmark. If Yes, may I know what that NEW pass will be focusing on?

Thanks,
-Jiangning

> Thanks,
> Bill
> 
> > Thanks,
> > -Jiangning
> >
> >
> >
> 






A question about redundant PHI expression stmt

2012-02-24 Thread Jiangning Liu
Hi,

For the small case below, there are some redundant PHI expression stmt
generated, and finally cause back-end generates redundant copy instructions
due to some reasons around IRA.

int *l, *r, *g;
void test_func(int n)
{
int i;
static int j;
static int pos, direction, direction_pre;

pos = 0;
direction = 1;

for ( i = 0; i < n; i++ )
{
direction_pre = direction;

for ( j = 0; j <= 400; j++ )
{

if ( g[pos] == 200 )
break;

if ( direction == 0 )
pos = l[pos];
else
pos = r[pos];

if ( pos == -1 )
{
if ( direction_pre != direction )
break;

pos = 0;
direction = !direction;
}

}

f(g[pos]);
}
}

I know PR39976 has something to do with this case, and check-in r182140
caused big degradation on the real benchmark, but I'm still confusing around
this issue.

The procedure is like this,

Loop store motion generates code below,

:
  D.4084_17 = l.4_13 + D.4077_70;
  pos.5_18 = *D.4084_17;
  pos_lsm.34_103 = pos.5_18;
  goto ;

:
  D.4088_23 = r.6_19 + D.4077_70;
  pos.7_24 = *D.4088_23;
  pos_lsm.34_104 = pos.7_24;

:
  # prephitmp.29_89 = PHI 
  # pos_lsm.34_53 = PHI 
  pos.2_25 = prephitmp.29_89;
  if (pos.2_25 == -1)
goto ;
  else
goto ;

And then, copy propagation transforms block 8 it into 

:
  # prephitmp.29_89 = PHI 
  # pos_lsm.34_53 = PHI 
  ...

And then, these two duplicated PHI stmts have been kept until expand pass. 

In dom2, a stmt like below

  # pos_lsm.34_54 = PHI 

is transformed into

  # pos_lsm.34_54 = PHI 

just because they are equal.

Unfortunately, this transformation changed back-end behavior to generate
redundant copy instructions and hurt performance. This case is from a real
benchmark and hurt performance a lot.

Does the fix in r182140 intend to totally clean up this kind of redundancy?
Where should we get it fixed?

Thanks,
-Jiangning





RE: A problem about loop store motion

2012-02-21 Thread Jiangning Liu


> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
> Richard Guenther
> Sent: Tuesday, February 21, 2012 6:19 PM
> To: Jiangning Liu
> Cc: gcc@gcc.gnu.org
> Subject: Re: A problem about loop store motion
> 
> On Tue, Feb 21, 2012 at 10:57 AM, Jiangning Liu 
> wrote:
> >
> >
> >> -Original Message-
> >> From: Richard Guenther [mailto:richard.guent...@gmail.com]
> >> Sent: Tuesday, February 21, 2012 5:40 PM
> >> To: Jiangning Liu
> >> Cc: gcc@gcc.gnu.org
> >> Subject: Re: A problem about loop store motion
> >>
> >> On Tue, Feb 21, 2012 at 9:54 AM, Jiangning Liu
> 
> >> wrote:
> >> >> The MEM form is more "canonical", so the loop SM machinery to
> detect
> >> >> equality should be adjusted accordingly.  Alternatively you can
> >> teach
> >> >> PRE insertion to strip off the MEM if possible (though
> >> >> fold_stmt_inplace should
> >> >> arelady do this if possible).
> >> >
> >> > Richard,
> >> >
> >> > Thank you! You are right. I noticed on latest trunk the problem in
> >> PRE was
> >> > already fixed by invoking fold_stmt_inplace.
> >> >
> >> > Unfortunately for this small case, the latest trunk code still
> can't
> >> do SM
> >> > for variable pos, because refs_may_alias_p(*D.4074_10, pos) is
> true,
> >> that
> >> > is, pos has alias with l[pos].
> >> >
> >> > I think alias analysis should be able to know they don't have
> alias
> >> with
> >> > each other, unless there is an assignment statement like "l=&pos;".
> >> >
> >> > Can alias analysis fix the problem?
> >>
> >> The problem is that 'pos' is marked TREE_ADDRESSABLE, so we think
> >> its address got taken.  'l' points to any global possibly address-
> taken
> >> variable.  It get's the TREE_ADDRESSABLE flag via PRE insertion
> which
> >> re-gimplifies the tree it creates which marks the variable as
> >> addressable.
> >>
> >> I'll have a look.
> >
> > Terrific! :-) Could you please let me know when you have a patch to
> fix
> > this, because I want to double check the big case, and there might be
> other
> > hidden problems?
> 
> PR52324, I am testing the following patch (in reality the gimplifier
> should not
> mark things addressable unless it itself makes them so, but frontends
> are
> broken, so we do that ... ugh).
> 

Richard,

Now trunk works for the big case as well. Thanks a lot for your quick fix.

BTW, why can't we fix front-end directly?

Thanks,
-Jiangning

> Index: gcc/gimplify.c
> ===
> --- gcc/gimplify.c  (revision 184428)
> +++ gcc/gimplify.c  (working copy)
> @@ -7061,15 +7061,20 @@ gimplify_expr (tree *expr_p, gimple_seq
>   ret = GS_OK;
>   break;
> }
> - ret = gimplify_expr (&TREE_OPERAND (*expr_p, 0), pre_p,
> post_p,
> -  is_gimple_mem_ref_addr, fb_rvalue);
> - if (ret == GS_ERROR)
> -   break;
> + /* Avoid re-gimplifying the address operand if it is already
> +in suitable form.  */
> + if (!is_gimple_mem_ref_addr (TREE_OPERAND (*expr_p, 0)))
> +   {
> + ret = gimplify_expr (&TREE_OPERAND (*expr_p, 0), pre_p,
> post_p,
> +  is_gimple_mem_ref_addr, fb_rvalue);
> + if (ret == GS_ERROR)
> +   break;
> +   }
>   recalculate_side_effects (*expr_p);
>   ret = GS_ALL_DONE;
>   break;
> 
> - /* Constants need not be gimplified.  */
> +   /* Constants need not be gimplified.  */
> case INTEGER_CST:
> case REAL_CST:
> case FIXED_CST:






RE: A problem about loop store motion

2012-02-21 Thread Jiangning Liu


> -Original Message-
> From: Richard Guenther [mailto:richard.guent...@gmail.com]
> Sent: Tuesday, February 21, 2012 5:40 PM
> To: Jiangning Liu
> Cc: gcc@gcc.gnu.org
> Subject: Re: A problem about loop store motion
> 
> On Tue, Feb 21, 2012 at 9:54 AM, Jiangning Liu 
> wrote:
> >> The MEM form is more "canonical", so the loop SM machinery to detect
> >> equality should be adjusted accordingly.  Alternatively you can
> teach
> >> PRE insertion to strip off the MEM if possible (though
> >> fold_stmt_inplace should
> >> arelady do this if possible).
> >
> > Richard,
> >
> > Thank you! You are right. I noticed on latest trunk the problem in
> PRE was
> > already fixed by invoking fold_stmt_inplace.
> >
> > Unfortunately for this small case, the latest trunk code still can't
> do SM
> > for variable pos, because refs_may_alias_p(*D.4074_10, pos) is true,
> that
> > is, pos has alias with l[pos].
> >
> > I think alias analysis should be able to know they don't have alias
> with
> > each other, unless there is an assignment statement like "l=&pos;".
> >
> > Can alias analysis fix the problem?
> 
> The problem is that 'pos' is marked TREE_ADDRESSABLE, so we think
> its address got taken.  'l' points to any global possibly address-taken
> variable.  It get's the TREE_ADDRESSABLE flag via PRE insertion which
> re-gimplifies the tree it creates which marks the variable as
> addressable.
> 
> I'll have a look.

Terrific! :-) Could you please let me know when you have a patch to fix
this, because I want to double check the big case, and there might be other
hidden problems?

Thanks,
-Jiangning

> 
> Richard.
> 
> 
> 
> > Thanks,
> > -Jiangning
> >
> >>
> >> Richard.
> >>
> >
> >
> >






RE: A problem about loop store motion

2012-02-21 Thread Jiangning Liu
> The MEM form is more "canonical", so the loop SM machinery to detect
> equality should be adjusted accordingly.  Alternatively you can teach
> PRE insertion to strip off the MEM if possible (though
> fold_stmt_inplace should
> arelady do this if possible).

Richard,

Thank you! You are right. I noticed on latest trunk the problem in PRE was
already fixed by invoking fold_stmt_inplace.

Unfortunately for this small case, the latest trunk code still can't do SM
for variable pos, because refs_may_alias_p(*D.4074_10, pos) is true, that
is, pos has alias with l[pos].

I think alias analysis should be able to know they don't have alias with
each other, unless there is an assignment statement like "l=&pos;". 

Can alias analysis fix the problem?

Thanks,
-Jiangning

> 
> Richard.
> 





RE: A problem about loop store motion

2012-02-20 Thread Jiangning Liu


> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
> Jiangning Liu
> Sent: Friday, February 17, 2012 5:53 PM
> To: gcc@gcc.gnu.org
> Subject: A problem about loop store motion
> 
> Hi,
> 
> For this small test case,
> 
> int *l, *r;
> int test_func(void)
> {
>   int i;
>   int direction;
>   static int pos;
> 
>   pos = 0;
>   direction = 1;
> 
>   for ( i = 0; i <= 400; i++ )
>   {
>   if ( direction == 0 )
>   pos = l[pos];
>   else
>   pos = r[pos];
> 
>   if ( pos == -1 )
>   {
>   pos = 0;
>   direction = !direction;
>   }
>   }
>   return i;
> }
> 
> In middle end, I don't see pos is sunk out of loop by loop store motion.
> Any
> idea?
> 
> The dump after lim is like below, and I expect a SSA symbole xxx_lsm
> could
> be created with this pass.
> 
> ;; Function test_func (test_func, funcdef_no=0, decl_uid=4057,
> cgraph_uid=0)
> 
> 
> Symbols to be put in SSA form
> { .MEM }
> Incremental SSA update started at block: 0
> Number of blocks in CFG: 12
> Number of blocks to update: 11 ( 92%)
> 
> 
> test_func ()
> {
>   int pretmp.14;
>   unsigned int pretmp.13;
>   int prephitmp.12;
>   int pretmp.11;
>   unsigned int pretmp.10;
>   int pretmp.9;
>   int D.4088;
>   static int pos;
>   int direction;
>   int i;
>   _Bool D.4082;
>   int pos.5;
>   int * D.4078;
>   int * r.4;
>   int pos.3;
>   int * D.4074;
>   unsigned int D.4073;
>   unsigned int pos.2;
>   int pos.1;
>   int * l.0;
> 
> :
>   pos = 0;
>   l.0_6 = l;
>   r.4_12 = r;
> 
> :
>   # i_32 = PHI 
>   # direction_37 = PHI 
>   # prephitmp.12_35 = PHI 
>   if (direction_37 == 0)
> goto ;
>   else
> goto ;
> 
> :
>   pos.1_7 = prephitmp.12_35;
>   pos.2_8 = (unsigned int) pos.1_7;
>   D.4073_9 = pos.2_8 * 4;
>   D.4074_10 = l.0_6 + D.4073_9;
>   pos.3_11 = *D.4074_10;
>   pos = pos.3_11;
>   goto ;
> 
> :
>   pos.1_13 = prephitmp.12_35;
>   pos.2_14 = (unsigned int) pos.1_13;
>   D.4073_15 = pos.2_14 * 4;
>   D.4078_16 = r.4_12 + D.4073_15;
>   pos.5_17 = *D.4078_16;
>   pos = pos.5_17;
> 
> :
>   # prephitmp.12_31 = PHI 
>   pos.1_18 = prephitmp.12_31;
>   if (pos.1_18 == -1)
> goto ;
>   else
> goto ;
> 
> :
>   goto ;
> 
> :
>   pos = 0;
>   D.4088_36 = direction_37 ^ 1;
>   direction_20 = D.4088_36 & 1;
> 
> :
>   # direction_2 = PHI 
>   i_21 = i_32 + 1;
>   if (i_21 != 401)
> goto ;
>   else
> goto ;
> 
> :
>   pretmp.11_1 = pos;

Hi,

pos in RHS seems to be transformed to MEM[&pos] by the code in
tree-ssa-sccvn.c like below,

case VAR_DECL:
  if (DECL_HARD_REGISTER (ref))
{
  temp.op0 = ref;
  break;
}
  /* Fallthru.  */
case PARM_DECL:
case CONST_DECL:
case RESULT_DECL:
  /* Canonicalize decls to MEM[&decl] which is what we end up with
 when valueizing MEM[ptr] with ptr = &decl.  */
  temp.opcode = MEM_REF;
  temp.op0 = build_int_cst (build_pointer_type (TREE_TYPE (ref)),
0);
  temp.off = 0;
  VEC_safe_push (vn_reference_op_s, heap, *result, &temp);
  temp.opcode = ADDR_EXPR;
  temp.op0 = build_fold_addr_expr (ref);
  temp.type = TREE_TYPE (temp.op0);
  temp.off = -1;
  break;

But the LHS doesn't have this kind of transformation on gimple,

So the code below in gather_mem_refs_stmt of tree-ssa-loop-im.c can't find
MEM[&pos] is just pos,

  hash = iterative_hash_expr (*mem, 0);
  slot = htab_find_slot_with_hash (memory_accesses.refs, *mem, hash,
INSERT);

Finally pos is set to be dependent on pretmp.11_1 at code below in
ref_indep_loop_p_1, and then loop store motion fails to find pos.

  EXECUTE_IF_SET_IN_BITMAP (refs_to_check, 0, i, bi)
{
  aref = VEC_index (mem_ref_p, memory_accesses.refs_list, i);
  if (!refs_independent_p (ref, aref))
{
  ret = false;
  record_indep_loop (loop, aref, false);
  break;
}
}

Should we fix this problem by enhancing iterative_hash_expr, or we should
use MEM[&pos] to describe pos at all?

Any hints? I need to understand why we have to use MEM[&pos] to describe
pos? This looks strange to me.

BTW, this bug caused significant regression for an important benchmark.

Thanks,
-Jiangning

>   goto ;
> 
> :
>   return 401;
> 
> }
> 
> Thanks,
> -Jiangning
> 
> 
> 






A problem about loop store motion

2012-02-17 Thread Jiangning Liu
Hi,

For this small test case,

int *l, *r;
int test_func(void)
{
int i;
int direction;
static int pos;

pos = 0;
direction = 1;

for ( i = 0; i <= 400; i++ )
{
if ( direction == 0 )
pos = l[pos];
else
pos = r[pos];

if ( pos == -1 )
{
pos = 0;
direction = !direction;
}
}
return i;
}

In middle end, I don't see pos is sunk out of loop by loop store motion. Any
idea?

The dump after lim is like below, and I expect a SSA symbole xxx_lsm could
be created with this pass.

;; Function test_func (test_func, funcdef_no=0, decl_uid=4057, cgraph_uid=0)


Symbols to be put in SSA form
{ .MEM }
Incremental SSA update started at block: 0
Number of blocks in CFG: 12
Number of blocks to update: 11 ( 92%)


test_func ()
{
  int pretmp.14;
  unsigned int pretmp.13;
  int prephitmp.12;
  int pretmp.11;
  unsigned int pretmp.10;
  int pretmp.9;
  int D.4088;
  static int pos;
  int direction;
  int i;
  _Bool D.4082;
  int pos.5;
  int * D.4078;
  int * r.4;
  int pos.3;
  int * D.4074;
  unsigned int D.4073;
  unsigned int pos.2;
  int pos.1;
  int * l.0;

:
  pos = 0;
  l.0_6 = l;
  r.4_12 = r;

:
  # i_32 = PHI 
  # direction_37 = PHI 
  # prephitmp.12_35 = PHI 
  if (direction_37 == 0)
goto ;
  else
goto ;

:
  pos.1_7 = prephitmp.12_35;
  pos.2_8 = (unsigned int) pos.1_7;
  D.4073_9 = pos.2_8 * 4;
  D.4074_10 = l.0_6 + D.4073_9;
  pos.3_11 = *D.4074_10;
  pos = pos.3_11;
  goto ;

:
  pos.1_13 = prephitmp.12_35;
  pos.2_14 = (unsigned int) pos.1_13;
  D.4073_15 = pos.2_14 * 4;
  D.4078_16 = r.4_12 + D.4073_15;
  pos.5_17 = *D.4078_16;
  pos = pos.5_17;

:
  # prephitmp.12_31 = PHI 
  pos.1_18 = prephitmp.12_31;
  if (pos.1_18 == -1)
goto ;
  else
goto ;

:
  goto ;

:
  pos = 0;
  D.4088_36 = direction_37 ^ 1;
  direction_20 = D.4088_36 & 1;

:
  # direction_2 = PHI 
  i_21 = i_32 + 1;
  if (i_21 != 401)
goto ;
  else
goto ;

:
  pretmp.11_1 = pos;
  goto ;

:
  return 401;

}

Thanks,
-Jiangning





RE: A case exposing code sink issue

2012-01-05 Thread Jiangning Liu


> -Original Message-
> From: Richard Guenther [mailto:richard.guent...@gmail.com]
> Sent: Thursday, January 05, 2012 5:32 PM
> To: Jiangning Liu
> Cc: Michael Matz; gcc@gcc.gnu.org
> Subject: Re: A case exposing code sink issue
> 
> On Thu, Jan 5, 2012 at 9:34 AM, Jiangning Liu 
> wrote:
> >> >> > In final value replacement, expression "&a + D." can be
> >> figured
> >> >> out,
> >> >> > while "&a[i_xxx]" failed to be CHRECed, so I'm wondering if we
> >> should
> >> >> > lower
> >> >> > &a[i_xxx] to "&a + unitsize(a) * i_xxx" first? It seems GCC
> >> intends
> >> >> to
> >> >> > keep
> >> >> > &a[i_xxx] until cfgexpand pass.
> >
> > Richard,
> >
> > Thanks a lot for your explanation!
> >
> > Any comments for this question as well? Does GCC intends to keep &a[i]
> until
> > expand pass?
> 
> It's kept as it suits - it's shorter and easier to analyze (we know the
> implicit
> multiplication won't overflow)
> 
> > Any special reason? If I change CHREC algorithm, I see ivopt
> > would have done this lowering, so we wouldn't see &a[i] in expand
> pass. Any
> > hurt?
> 
> No, I think changing the CHREC algorithm is ok (I didn't look closely
> at
> your patch).  This is stage1 material though.

Actually I didn't send it out at all. :-) And I just send out a RFC here
http://gcc.gnu.org/ml/gcc-patches/2012-01/msg00236.html, can you have a
look?

> 
> Thanks,
> Richard.
> 
> > Thanks,
> > -Jiangning
> >
> >
> >
> >






RE: A case exposing code sink issue

2012-01-05 Thread Jiangning Liu
> >> > In final value replacement, expression "&a + D." can be
> figured
> >> out,
> >> > while "&a[i_xxx]" failed to be CHRECed, so I'm wondering if we
> should
> >> > lower
> >> > &a[i_xxx] to "&a + unitsize(a) * i_xxx" first? It seems GCC
> intends
> >> to
> >> > keep
> >> > &a[i_xxx] until cfgexpand pass. 

Richard,

Thanks a lot for your explanation!

Any comments for this question as well? Does GCC intends to keep &a[i] until
expand pass? Any special reason? If I change CHREC algorithm, I see ivopt
would have done this lowering, so we wouldn't see &a[i] in expand pass. Any
hurt?

Thanks,
-Jiangning






RE: A case exposing code sink issue

2011-12-29 Thread Jiangning Liu


> -Original Message-
> From: Jiangning Liu
> Sent: Wednesday, December 28, 2011 5:38 PM
> To: Jiangning Liu; 'Richard Guenther'
> Cc: Michael Matz; gcc@gcc.gnu.org
> Subject: RE: A case exposing code sink issue
> 
> 
> 
> > -Original Message-
> > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf
> Of
> > Jiangning Liu
> > Sent: Tuesday, December 27, 2011 5:10 PM
> > To: 'Richard Guenther'
> > Cc: Michael Matz; gcc@gcc.gnu.org
> > Subject: RE: A case exposing code sink issue
> >
> > >
> > > The job to do this is final value replacement, not sinking (we do
> not
> > > sink non-invariant expressions - you'd have to translate them
> through
> > > the loop-closed SSA exit PHI node, certainly doable, patches
> > > welcome ;)).
> > >
> >
> > Richard,
> >
> > In final value replacement, expression "&a + D." can be figured
> out,
> > while "&a[i_xxx]" failed to be CHRECed, so I'm wondering if we should
> > lower
> > &a[i_xxx] to "&a + unitsize(a) * i_xxx" first? It seems GCC intends
> to
> > keep
> > &a[i_xxx] until cfgexpand pass. Or we have to directly modify CHREC
> > algorithm to get it calculated?
> >
> > Appreciate your kindly help in advance!
> >
> 
> Richard,
> 
> Now I have a patch working for the case of step "i++", by directly
> modifying
> scalar evolution algorithm. the following code would be generated after
> SCCP,
> l
>   # i_13 = PHI 
>   a_p.0_4 = &a[i_13];
>   MEM[(int *)&a][i_13] = 100;
>   i_6 = i_13 + 1;
>   if (i_6 <= 999)
> goto ;
>   else
> goto ;
> 
> :
>   a_p_lsm.5_11 = &MEM[(void *)&a + 3996B];
>   a_p = a_p_lsm.5_11;
>   goto ;
> 
> It looks good, but I still have problem when the case has step "i+=k".
> 
> For this case the value of variable i exiting loop isn't invariant, the
> algorithm below in scalar evolution doesn't work on it,
> 
> compute_overall_effect_of_inner_loop()
> {
>   ...
> tree nb_iter = number_of_latch_executions (inner_loop);
> 
> if (nb_iter == chrec_dont_know)
>   return chrec_dont_know;
> else
>   {
> tree res;
> 
> /* evolution_fn is the evolution function in LOOP.  Get
>its value in the nb_iter-th iteration.  */
> res = chrec_apply (inner_loop->num, evolution_fn, nb_iter);
> 
> if (chrec_contains_symbols_defined_in_loop (res, loop->num))
>   res = instantiate_parameters (loop, res);
> 
> /* Continue the computation until ending on a parent of LOOP.
> */
> return compute_overall_effect_of_inner_loop (loop, res);
>   }
> }
> 
> In theory, we can still have the transformation like below even if the
> step
> is "i+=k",
> 
>   # i_13 = PHI 
>   i_14 = i_13,
>   a_p.0_4 = &a[i_13];
>   MEM[(int *)&a][i_13] = 100;
>   i_6 = i_13 + k_2(D); // i+=k
>   if (i_6 <= 999)
> goto ;
>   else
> goto ;
> 
> :
>   a_p_lsm.5_11 = &a[i_14];
>   a_p = a_p_lsm.5_11;
>   goto ;
> 
> But I realize this is not a loop closed SSA form at all, because i_14
> is
> being used out of the loop. Where could we extend the liverange of
> variable
> i in GCC infrastructure and finally solve this problem?
> 

It seems many people are still in the happy of the upcoming 2012 New Year.
:-)

Following my story, I find the following code in tree-ssa-copy.c

  /* Avoid copy propagation from an inner into an outer loop.
 Otherwise, this may move loop variant variables outside of
 their loops and prevent coalescing opportunities.  If the
 value was loop invariant, it will be hoisted by LICM and
 exposed for copy propagation.
 ???  The value will be always loop invariant.
 In loop-closed SSA form do not copy-propagate through
 PHI nodes in blocks with a loop exit edge predecessor.  */
  if (current_loops
  && TREE_CODE (arg_value) == SSA_NAME
  && (loop_depth_of_name (arg_value) > loop_depth_of_name (lhs)
  || (loops_state_satisfies_p (LOOP_CLOSED_SSA)
  && loop_exit_edge_p (e->src->loop_father, e
{
  phi_val.value = lhs;
  break;
}

Here http://gcc.gnu.org/ml/gcc-patches/2006-12/msg00066.html, Dan said 

"The original check was not because of coalescing, but because we would
copy prop in-loop variables outside the loop, causing *more*
invariantness in nested loops."

Can anybody give me a concrete example to explain this statement?

Anyway, for my simple case, I don't see bad thing would happen when
propagate &a[i] out of the loop. Also, after this propagation, "a_p.0_4 =
&a[i_13];" within the loop would be dead code and removed in the passes
afterwards. In my opinion, that way the computation would be reduced in the
loop. Did I make any mistake?

> > Thanks,
> > -Jiangning
> >
> >
> >
> 
> 






RE: A case exposing code sink issue

2011-12-28 Thread Jiangning Liu


> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
> Jiangning Liu
> Sent: Tuesday, December 27, 2011 5:10 PM
> To: 'Richard Guenther'
> Cc: Michael Matz; gcc@gcc.gnu.org
> Subject: RE: A case exposing code sink issue
> 
> >
> > The job to do this is final value replacement, not sinking (we do not
> > sink non-invariant expressions - you'd have to translate them through
> > the loop-closed SSA exit PHI node, certainly doable, patches
> > welcome ;)).
> >
> 
> Richard,
> 
> In final value replacement, expression "&a + D." can be figured out,
> while "&a[i_xxx]" failed to be CHRECed, so I'm wondering if we should
> lower
> &a[i_xxx] to "&a + unitsize(a) * i_xxx" first? It seems GCC intends to
> keep
> &a[i_xxx] until cfgexpand pass. Or we have to directly modify CHREC
> algorithm to get it calculated?
> 
> Appreciate your kindly help in advance!
> 

Richard,

Now I have a patch working for the case of step "i++", by directly modifying
scalar evolution algorithm. the following code would be generated after
SCCP,
l
  # i_13 = PHI 
  a_p.0_4 = &a[i_13];
  MEM[(int *)&a][i_13] = 100;
  i_6 = i_13 + 1;
  if (i_6 <= 999)
goto ;
  else
goto ;

:
  a_p_lsm.5_11 = &MEM[(void *)&a + 3996B];
  a_p = a_p_lsm.5_11;
  goto ;

It looks good, but I still have problem when the case has step "i+=k". 

For this case the value of variable i exiting loop isn't invariant, the
algorithm below in scalar evolution doesn't work on it,

compute_overall_effect_of_inner_loop()
{
  ...
  tree nb_iter = number_of_latch_executions (inner_loop);

  if (nb_iter == chrec_dont_know)
return chrec_dont_know;
  else
{
  tree res;

  /* evolution_fn is the evolution function in LOOP.  Get
 its value in the nb_iter-th iteration.  */
  res = chrec_apply (inner_loop->num, evolution_fn, nb_iter);

  if (chrec_contains_symbols_defined_in_loop (res, loop->num))
res = instantiate_parameters (loop, res);

  /* Continue the computation until ending on a parent of LOOP.
*/
  return compute_overall_effect_of_inner_loop (loop, res);
}
}

In theory, we can still have the transformation like below even if the step
is "i+=k",

  # i_13 = PHI 
  i_14 = i_13,
  a_p.0_4 = &a[i_13];
  MEM[(int *)&a][i_13] = 100;
  i_6 = i_13 + k_2(D); // i+=k
  if (i_6 <= 999)
goto ;
  else
goto ;

:
  a_p_lsm.5_11 = &a[i_14];
  a_p = a_p_lsm.5_11;
  goto ;

But I realize this is not a loop closed SSA form at all, because i_14 is
being used out of the loop. Where could we extend the liverange of variable
i in GCC infrastructure and finally solve this problem?

> Thanks,
> -Jiangning
> 
> 
> 






RE: A case exposing code sink issue

2011-12-27 Thread Jiangning Liu
> 
> The job to do this is final value replacement, not sinking (we do not
> sink non-invariant expressions - you'd have to translate them through
> the loop-closed SSA exit PHI node, certainly doable, patches
> welcome ;)).
> 

Richard,

In final value replacement, expression "&a + D." can be figured out,
while "&a[i_xxx]" failed to be CHRECed, so I'm wondering if we should lower
&a[i_xxx] to "&a + unitsize(a) * i_xxx" first? It seems GCC intends to keep
&a[i_xxx] until cfgexpand pass. Or we have to directly modify CHREC
algorithm to get it calculated?

Appreciate your kindly help in advance!

Thanks,
-Jiangning





RE: A case exposing code sink issue

2011-12-22 Thread Jiangning Liu
> Yes, the number of iterations of the i loop simply is too difficult for
> our loop iteration calculator to comprehend:
> 
>   for (i=k; i<500; i+=k)
> 
> iterates for roundup((500-k)/k) time.  In particular if the step is
> non-constant our nr-of-iteration calculator gives up.
> 

I'm trying to give an even smaller case,

int a[512] ;
int *a_p ;

int f(int k)
{
int i ;

for(i=0; i:
  # i_13 = PHI 
  # ivtmp.10_9 = PHI 
  a_p_lsm.6_4 = &a[i_13];
  ivtmp.10_1 = ivtmp.10_9 + 4;
  D.4085_16 = (void *) ivtmp.10_1;
  MEM[base: D.4085_16, offset: 0B] = 7;
  i_6 = i_13 + 1;
  if (i_6 != k_3(D))
goto ;
  else
goto ;

:
  # a_p_lsm.6_11 = PHI 
  a_p = a_p_lsm.6_11;
  goto ;

Why can't we still sunk &a[i_13] out of loop? For example, I expect to
generate the code like below,

:
  # i_13 = PHI 
  # ivtmp.10_9 = PHI 
  i_14 = i_13;
  ivtmp.10_1 = ivtmp.10_9 + 4;
  D.4085_16 = (void *) ivtmp.10_1;
  MEM[base: D.4085_16, offset: 0B] = 7;
  i_6 = i_13 + 1;
  if (i_6 != k_3(D))
goto ;
  else
goto ;

:
  # a_p_lsm.6_11 = PHI 
  a_p_lsm.6_4 = &a[i_14];
  a_p = a_p_lsm.6_11;
  goto ;

This way the computation of &a[i] would be saved within the loop. 

Any idea?

Thanks,
-Jiangning





RE: A case exposing code sink issue

2011-12-01 Thread Jiangning Liu


> -Original Message-
> From: Michael Matz [mailto:m...@suse.de]
> Sent: Monday, November 28, 2011 9:07 PM
> To: Jiangning Liu
> Cc: gcc@gcc.gnu.org
> Subject: RE: A case exposing code sink issue
> 
> Hi,
> 
> On Mon, 28 Nov 2011, Jiangning Liu wrote:
> 
> > > > One more question...
> > > >
> > > > Can " i = i.6_18;" be sinked out of loop, because it doesn't have
> > > memory
> > > > dependence with others?
> > >
> > > With current trunk the stores to i, a_p, b_p and k are sunken after
> the
> > > loop.  (There are no aliasing problems because the decls can't
> > > conflict).
> > >
> > > What isn't sunken is the calculation of the &a[D.2248_7] expression.
> > > First, the number of iterations of the inner loop can't be
> determined
> > > by
> > > current code (replacing i+=k with e.g. i++ could be handled for
> > > instance).
> >
> > Hi Michael,
> >
> > Do you know what the essential problem is in the case of loop
> iteration
> > uncertainty?
> 
> Yes, the number of iterations of the i loop simply is too difficult for
> our loop iteration calculator to comprehend:
> 
>   for (i=k; i<500; i+=k)
> 
> iterates for roundup((500-k)/k) time.  In particular if the step is
> non-constant our nr-of-iteration calculator gives up.

So do you think this can be improved somewhere?

For this case, looking at the result in middle end, "a_p.2_8 =
&a[D.2248_7];" should be able to sunken out of loop. That way the
computation of &a[D.2248_7] would be saved in loop, although the consequence
is the liverange of D.2248_7 is longer and it needs to live out of loop. But
anyway the register pressure would be decreased within the loop, and we
would less possibly have spill/fill code. This is what I want.

I think we can simply use loop induction variable analysis to solve this
problem. Do you think so?

Thanks,
-Jiangning

> 
> > I thought it was still an aliasing problem.
> 
> No.  All accesses are resolved to final objects (i.e. no pointers), and
> hence can be trivially disambiguated.
> 
> 
> Ciao,
> Michael.






RE: A case exposing code sink issue

2011-11-27 Thread Jiangning Liu


> -Original Message-
> From: Michael Matz [mailto:m...@suse.de]
> Sent: Friday, November 25, 2011 11:23 PM
> To: Jiangning Liu
> Cc: gcc@gcc.gnu.org
> Subject: RE: A case exposing code sink issue
> 
> Hi,
> 
> On Thu, 24 Nov 2011, Jiangning Liu wrote:
> 
> > One more question...
> >
> > Can " i = i.6_18;" be sinked out of loop, because it doesn't have
> memory
> > dependence with others?
> 
> With current trunk the stores to i, a_p, b_p and k are sunken after the
> loop.  (There are no aliasing problems because the decls can't
> conflict).
> 
> What isn't sunken is the calculation of the &a[D.2248_7] expression.
> First, the number of iterations of the inner loop can't be determined
> by
> current code (replacing i+=k with e.g. i++ could be handled for
> instance).

Hi Michael,

Do you know what the essential problem is in the case of loop iteration
uncertainty? I thought it was still an aliasing problem.

Thanks,
-Jiangning

> Then this code could be handled by final value replacement, but isn't
> because interpret_rhs_expr doesn't deal with ADDR_EXPR of ARRAY_REFs.
> 
> 
> Ciao,
> Michael.






RE: A case exposing code sink issue

2011-11-23 Thread Jiangning Liu
One more question...

Can " i = i.6_18;" be sinked out of loop, because it doesn't have memory
dependence with others?

Thanks,
-Jiangning

> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
> Jiangning Liu
> Sent: Thursday, November 24, 2011 2:57 PM
> To: gcc@gcc.gnu.org
> Subject: RE: A case exposing code sink issue
> 
> Sorry, I realize we can't do that optimization because a_p may have
> dependence upon other memory accesses like MEM[(int *)&a][D.2248_7].
> 
> For example, if it happens a_p equals &a_p, that optimization would be
> wrong.
> 
> But can alias analysis solve the problem if we can guarantee (i+(1< is
> less than the upbound of array a's definition?
> 
> Or is there any GCC command line switch assuming no array bound
> overflow?
> That way we can do more aggressive optimizations, right?
> 
> Thanks,
> -Jiangning
> 
> > -Original Message-
> > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf
> Of
> > Jiangning Liu
> > Sent: Thursday, November 24, 2011 12:05 PM
> > To: gcc@gcc.gnu.org
> > Subject: A case exposing code sink issue
> >
> > Hi,
> >
> > For this small test case,
> >
> > int a[512] ;
> > int b[512] ;
> > int *a_p ;
> > int *b_p ;
> > int i ;
> > int k ;
> >
> > int f(void)
> > {
> > for( k = 1 ; k <= 9 ; k++)
> > {
> > for( i = k ; i < 512 ; i += k)
> > {
> > a_p = &a[i + (1< > b_p = &b[i + (1< > *a_p = 7 ;
> > *b_p = 7 ;
> > }
> > }
> > }
> >
> > Before sink pass we have,
> >
> > f ()
> > {
> >   int pretmp.11;
> >   int k.7;
> >   int i.6;
> >   int * b_p.3;
> >   int * a_p.2;
> >   int D.2248;
> >   int i.1;
> >   int D.2246;
> >   int k.0;
> >
> > :
> >   k = 1;
> >
> > :
> >   # k.0_9 = PHI 
> >   i = k.0_9;
> >   if (k.0_9 <= 511)
> > goto ;
> >   else
> > goto ;
> >
> > :
> > Invalid sum of incoming frequencies 900, should be 81
> >   goto ;
> >
> > :
> >   pretmp.11_19 = 1 << k.0_9;
> >
> > :
> >   # i.1_34 = PHI 
> >   D.2246_5 = pretmp.11_19;
> >   D.2248_7 = i.1_34 + D.2246_5;
> >   a_p.2_8 = &a[D.2248_7];
> >   a_p = a_p.2_8;
> >   b_p.3_13 = &b[D.2248_7];
> >   b_p = b_p.3_13;
> >   MEM[(int *)&a][D.2248_7] = 7;
> >   MEM[(int *)&b][D.2248_7] = 7;
> >   i.6_18 = k.0_9 + i.1_34;
> >   i = i.6_18;
> >   if (i.6_18 <= 511)
> > goto ;
> >   else
> > goto ;
> >
> > :
> >   goto ;
> >
> > :
> > Invalid sum of incoming frequencies 81, should be 900
> >   k.7_20 = k.0_9 + 1;
> >   k = k.7_20;
> >   if (k.7_20 <= 9)
> > goto ;
> >   else
> > goto ;
> >
> > :
> >   goto ;
> >
> > :
> >   return;
> >
> > }
> >
> > Can the following statements be sinked out of loop? I don't see this
> > optimization happen in trunk. The consequence is register pressure
> > increased
> > and a spill/fill occurs in RA.
> >
> >   a_p.2_8 = &a[D.2248_7];
> >   a_p = a_p.2_8;
> >   b_p.3_13 = &b[D.2248_7];
> >   b_p = b_p.3_13;
> >
> > I know the sink would happen in sink pass if a_p and b_p are local
> > variables.
> >
> > If this is the root cause, which optimization pass in GCC take the
> role
> > to
> > sink them out of loop? How should we get it fixed?
> >
> > Thanks,
> > -Jiangning
> >
> >
> >
> 
> 
> 
> 






RE: A case exposing code sink issue

2011-11-23 Thread Jiangning Liu
Sorry, I realize we can't do that optimization because a_p may have
dependence upon other memory accesses like MEM[(int *)&a][D.2248_7].

For example, if it happens a_p equals &a_p, that optimization would be
wrong.

But can alias analysis solve the problem if we can guarantee (i+(1< -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
> Jiangning Liu
> Sent: Thursday, November 24, 2011 12:05 PM
> To: gcc@gcc.gnu.org
> Subject: A case exposing code sink issue
> 
> Hi,
> 
> For this small test case,
> 
> int a[512] ;
> int b[512] ;
> int *a_p ;
> int *b_p ;
> int i ;
> int k ;
> 
> int f(void)
> {
> for( k = 1 ; k <= 9 ; k++)
> {
> for( i = k ; i < 512 ; i += k)
> {
> a_p = &a[i + (1< b_p = &b[i + (1< *a_p = 7 ;
> *b_p = 7 ;
> }
> }
> }
> 
> Before sink pass we have,
> 
> f ()
> {
>   int pretmp.11;
>   int k.7;
>   int i.6;
>   int * b_p.3;
>   int * a_p.2;
>   int D.2248;
>   int i.1;
>   int D.2246;
>   int k.0;
> 
> :
>   k = 1;
> 
> :
>   # k.0_9 = PHI 
>   i = k.0_9;
>   if (k.0_9 <= 511)
> goto ;
>   else
> goto ;
> 
> :
> Invalid sum of incoming frequencies 900, should be 81
>   goto ;
> 
> :
>   pretmp.11_19 = 1 << k.0_9;
> 
> :
>   # i.1_34 = PHI 
>   D.2246_5 = pretmp.11_19;
>   D.2248_7 = i.1_34 + D.2246_5;
>   a_p.2_8 = &a[D.2248_7];
>   a_p = a_p.2_8;
>   b_p.3_13 = &b[D.2248_7];
>   b_p = b_p.3_13;
>   MEM[(int *)&a][D.2248_7] = 7;
>   MEM[(int *)&b][D.2248_7] = 7;
>   i.6_18 = k.0_9 + i.1_34;
>   i = i.6_18;
>   if (i.6_18 <= 511)
> goto ;
>   else
> goto ;
> 
> :
>   goto ;
> 
> :
> Invalid sum of incoming frequencies 81, should be 900
>   k.7_20 = k.0_9 + 1;
>   k = k.7_20;
>   if (k.7_20 <= 9)
> goto ;
>   else
> goto ;
> 
> :
>   goto ;
> 
> :
>   return;
> 
> }
> 
> Can the following statements be sinked out of loop? I don't see this
> optimization happen in trunk. The consequence is register pressure
> increased
> and a spill/fill occurs in RA.
> 
>   a_p.2_8 = &a[D.2248_7];
>   a_p = a_p.2_8;
>   b_p.3_13 = &b[D.2248_7];
>   b_p = b_p.3_13;
> 
> I know the sink would happen in sink pass if a_p and b_p are local
> variables.
> 
> If this is the root cause, which optimization pass in GCC take the role
> to
> sink them out of loop? How should we get it fixed?
> 
> Thanks,
> -Jiangning
> 
> 
> 






RE: A case exposing code sink issue

2011-11-23 Thread Jiangning Liu


> -Original Message-
> From: Andrew Pinski [mailto:pins...@gmail.com]
> Sent: Thursday, November 24, 2011 12:15 PM
> To: Jiangning Liu
> Cc: gcc@gcc.gnu.org
> Subject: Re: A case exposing code sink issue
> 
> On Wed, Nov 23, 2011 at 8:05 PM, Jiangning Liu 
> wrote:
> > If this is the root cause, which optimization pass in GCC take the
> role to
> > sink them out of loop? How should we get it fixed?
> 
> lim1 handles the case just fine for me.  lim1 is the first loop pass.
> 
> After lim1 I get:
> 
> :
>   # i.1_34 = PHI 
>   D.2934_5 = pretmp.11_33;
>   D.2936_7 = i.1_34 + D.2934_5;
>   a_p.2_8 = &a[D.2936_7];
>   a_p_lsm.13_37 = a_p.2_8;
>   b_p.3_13 = &b[D.2936_7];
>   b_p_lsm.14_38 = b_p.3_13;
>   MEM[(int *)&a][D.2936_7] = 7;
>   MEM[(int *)&b][D.2936_7] = 7;
>   i.6_18 = k.0_9 + i.1_34;
>   i_lsm.12_39 = i.6_18;
>   if (i.6_18 <= 511)
> goto ;
>   else
> goto ;
> 
> :
>   goto ;
> 

&a[D.2936_7] and &b[D.2936_7] are not loop invariants, so it seems lim1 
shouldn't be able to sink them, right? Do I misunderstand this optimization?

Thanks,
-Jiangning

> 
> 
> 
> Thanks,
> Andrew Pinski






A case exposing code sink issue

2011-11-23 Thread Jiangning Liu
Hi,

For this small test case,

int a[512] ;
int b[512] ;
int *a_p ;
int *b_p ;
int i ;
int k ;

int f(void)
{
for( k = 1 ; k <= 9 ; k++)
{
for( i = k ; i < 512 ; i += k)
{
a_p = &a[i + (1<:
  k = 1;

:
  # k.0_9 = PHI 
  i = k.0_9;
  if (k.0_9 <= 511)
goto ;
  else
goto ;

:
Invalid sum of incoming frequencies 900, should be 81
  goto ;

:
  pretmp.11_19 = 1 << k.0_9;

:
  # i.1_34 = PHI 
  D.2246_5 = pretmp.11_19;
  D.2248_7 = i.1_34 + D.2246_5;
  a_p.2_8 = &a[D.2248_7];
  a_p = a_p.2_8;
  b_p.3_13 = &b[D.2248_7];
  b_p = b_p.3_13;
  MEM[(int *)&a][D.2248_7] = 7;
  MEM[(int *)&b][D.2248_7] = 7;
  i.6_18 = k.0_9 + i.1_34;
  i = i.6_18;
  if (i.6_18 <= 511)
goto ;
  else
goto ;

:
  goto ;

:
Invalid sum of incoming frequencies 81, should be 900
  k.7_20 = k.0_9 + 1;
  k = k.7_20;
  if (k.7_20 <= 9)
goto ;
  else
goto ;

:
  goto ;

:
  return;

}

Can the following statements be sinked out of loop? I don't see this
optimization happen in trunk. The consequence is register pressure increased
and a spill/fill occurs in RA.

  a_p.2_8 = &a[D.2248_7];
  a_p = a_p.2_8;
  b_p.3_13 = &b[D.2248_7];
  b_p = b_p.3_13;

I know the sink would happen in sink pass if a_p and b_p are local
variables. 

If this is the root cause, which optimization pass in GCC take the role to
sink them out of loop? How should we get it fixed?

Thanks,
-Jiangning





RE: Is VRP is too conservative to identify boolean value 0 and 1?

2011-11-20 Thread Jiangning Liu


> -Original Message-
> From: Richard Guenther [mailto:richard.guent...@gmail.com]
> Sent: Friday, September 02, 2011 5:07 PM
> To: Jiangning Liu
> Cc: gcc@gcc.gnu.org
> Subject: Re: Is VRP is too conservative to identify boolean value 0 and
> 1?
> 
> On Fri, Sep 2, 2011 at 7:58 AM, Jiangning Liu 
> wrote:
> > Hi,
> >
> > For the following small case,
> >
> > int f(int i, int j)
> > {
> >        if (i==1 && j==2)
> >                return i;
> >        else
> >                return j;
> > }
> >
> > with -O2 option, GCC has vrp2 dump like below,
> >
> > ==
> >
> > Value ranges after VRP:
> >
> > i_1: VARYING
> > i_2(D): VARYING
> > D.1249_3: [0, +INF]
> > j_4(D): VARYING
> > D.1250_5: [0, +INF]
> > D.1251_6: [0, +INF]
> > j_10: [2, 2]  EQUIVALENCES: { j_4(D) } (1 elements)
> >
> >
> > Removing basic block 3
> > f (int i, int j)
> > {
> >  _Bool D.1251;
> >  _Bool D.1250;
> >  _Bool D.1249;
> >
> > :
> >  D.1249_3 = i_2(D) == 1;
> >  D.1250_5 = j_4(D) == 2;
> >  D.1251_6 = D.1250_5 & D.1249_3;
> >  if (D.1251_6 != 0)
> >    goto ;
> >  else
> >    goto ;
> >
> > :
> >
> > :
> >  # i_1 = PHI <1(3), j_4(D)(2)>
> >  return i_1;
> >
> > }
> >
> > 
> >
> > Variable D.1249_3, D.1250_5 and D.1251_6 should be boolean values, so
> the
> > their value ranges should be
> >
> > D.1249_3: [0, 1]
> > D.1250_5: [0, 1]
> > D.1251_6: [0, 1]
> >
> > So why current VRP can't find out this value range?
> 
> It does - it just prints it as [0, +INF], they are bools with
> TYPE_MAX_VALUE
> == 1 after all.

Richard,

May I use REG_EXPR(rtx of D.1249_3) in xxx.md file to detect whether
D.1249_3 is a bool or not? 

Some comments in GCC says REG_EXPR may be lost in back-end. True?

If we do have REG_EXPR info for some cases in back-end, is it guaranteed to
be correct? May I implementing back-end peephole optimization depending on
REG_EXPR?

Thanks,
-Jiangning

> 
> Richard.
> 
> >
> > I'm asking this question because the optimizations in back-end need
> this
> > info to do advanced optimization.
> >
> > Thanks,
> > -Jiangning
> >
> >
> >






A question about redudant load elimination

2011-11-14 Thread Jiangning Liu
Hi,

For this test case,

int x;
extern void f(void);

void g(int *a)
{
a[x] = 1;
if (x == 100)
f();
a[x] = 2;
}

For trunk, the x86 assembly code is like below,

movlx, %eax
movl16(%esp), %ebx
movl$1, (%ebx,%eax,4)
movlx, %eax   // Is this a redundant one?
cmpl$100, %eax
je  .L4
movl$2, (%ebx,%eax,4)
addl$8, %esp
.cfi_remember_state
.cfi_def_cfa_offset 8
popl%ebx
.cfi_restore 3
.cfi_def_cfa_offset 4
ret
.p2align 4,,7
.p2align 3
.L4:
.cfi_restore_state
callf
movlx, %eax
movl$2, (%ebx,%eax,4)
addl$8, %esp
.cfi_def_cfa_offset 8
popl%ebx
.cfi_restore 3
.cfi_def_cfa_offset 4
Ret

Is the 2nd "movl x, %eax" is a redundant one for single thread programming
model? If yes, can this be optimized away?

Thanks,
-Jiangning





RE: A case that PRE optimization hurts performance

2011-09-26 Thread Jiangning Liu


> -Original Message-
> From: Jeff Law [mailto:l...@redhat.com]
> Sent: Tuesday, September 27, 2011 12:43 AM
> To: Richard Guenther
> Cc: Jiangning Liu; gcc@gcc.gnu.org
> Subject: Re: A case that PRE optimization hurts performance
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 09/26/11 05:00, Richard Guenther wrote:
> > On Mon, Sep 26, 2011 at 9:39 AM, Jiangning Liu
> >  wrote:
> >>>> * Without PRE,
> >>>>
> >>>> Path1: movl$2, %eax cmpl$1, %eax je  .L3
> >>>>
> >>>> Path2: movb$3, %al cmpl$1, %eax je  .L3
> >>>>
> >>>> Path3: cmpl$1, %eax jne .L14
> >>>>
> >>>> * With PRE,
> >>>>
> >>>> Path1: movl$1, %ebx movl$2, %eax testb   %bl, %bl je
> >>>> .L3
> >>>>
> >>>> Path2: movl$1, %ebx movb$3, %al testb   %bl, %bl je
> >>>> .L3
> >>>>
> >>>> Path3: cmpl$1, %eax setne   %bl testb   %bl, %bl jne
> >>>> .L14
> >>>>
> >>>> Do you have any more thoughts?
> >>>
> >>> It seems to me that with PRE all the testb %bl, %bl should be
> >>> evaluated at compile-time considering the preceeding movl $1,
> >>> %ebx.  Am I missing something?
> >>>
> >>
> >> Yes. Can this be done by PRE or any other optimizations in middle
> >> end?
> >
> > Hm, the paths as you quote them are obfuscated by missed
> > jump-threading. On the tree level we have
> >
> > # s_2 = PHI <2(5), 3(4), 2(6), s_25(7)> # prephitmp.6_1 = PHI
> > <1(5), 1(4), 1(6), prephitmp.6_3(7)> : t_14 = t_24 + 1;
> > D.2729_6 = MEM[base: t_14, offset: 0B]; D.2732_7 = D.2729_6 != 0;
> > D.2734_9 = prephitmp.6_1 & D.2732_7; if (D.2734_9 != 0)
> >
> > where we could thread the cases with prephitmp.6_1 == 1,
> > ultimately removing the & and forwarding the D.2729_6 != 0 test.
> > Which would of course cause some code duplication.
> >
> > Jeff, you recently looked at tree jump-threading, can you see if
> > we can improve things on this particular testcase?
> There's nothing threading can do here because it doesn't know anything
> about the value MEM[t14].
> 

Jeff, 

Could you please explain more about this? What information does jump
threading want to know on MEM[t14]? Do you mean it's hard to duplicate that
basic block due to some reasons?

Thanks,
-Jiangning

> 
> Jeff
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> 
> iQEcBAEBAgAGBQJOgKuLAAoJEBRtltQi2kC75aIH/iikuOQXrMrQJFbQw0COXznB
> OGq8iXdGwTJGH13vxdItTE0upJp7RgUVLzuhdqj1elTLHv/ujYygMsQRNGKcc8tb
> GMLECmWDhZqQTFXcTJCgJNZiv7MH1PNELXSdIkkSnxY+pwyn9AX5D3+HcTSjGU6B
> 51AdUNVph/VSaVboAgcrFpu9S0pX9HVTqFy4JI83Lh613zDVSmPo14DDy7vjBvE9
> 2Srlvlw0srYup97bGmRqN8wT4ZLLlyYSB2rjEFc6jmgXVncxiteQYIUZpy0lcC0M
> q3j80aXjZ57/iWyAbqDr1jI5tbVKDBkRa9LL1jvn9534adiG4GrnSMPhoog0ibA=
> =azr5
> -END PGP SIGNATURE-






RE: A case that PRE optimization hurts performance

2011-09-26 Thread Jiangning Liu
> > * Without PRE,
> >
> > Path1:
> >        movl    $2, %eax
> >        cmpl    $1, %eax
> >        je      .L3
> >
> > Path2:
> >        movb    $3, %al
> >        cmpl    $1, %eax
> >        je      .L3
> >
> > Path3:
> >        cmpl    $1, %eax
> >        jne     .L14
> >
> > * With PRE,
> >
> > Path1:
> >        movl    $1, %ebx
> >        movl    $2, %eax
> >        testb   %bl, %bl
> >        je      .L3
> >
> > Path2:
> >        movl    $1, %ebx
> >        movb    $3, %al
> >        testb   %bl, %bl
> >        je      .L3
> >
> > Path3:
> >        cmpl    $1, %eax
> >        setne   %bl
> >        testb   %bl, %bl
> >        jne     .L14
> >
> > Do you have any more thoughts?
> 
> It seems to me that with PRE all the testb %bl, %bl
> should be evaluated at compile-time considering the
> preceeding movl $1, %ebx.  Am I missing something?
> 

Yes. Can this be done by PRE or any other optimizations in middle end?

Thanks,
-Jiangning

> Richard.
> 






RE: A question about detecting array bounds for case Warray-bounds-3.c

2011-09-26 Thread Jiangning Liu
PING...

> -Original Message-
> From: Jiangning Liu [mailto:jiangning@arm.com]
> Sent: Thursday, September 22, 2011 10:19 AM
> To: gcc@gcc.gnu.org
> Cc: 'ja...@gcc.gnu.org'; 'muel...@gcc.gnu.org'; 'rgue...@gcc.gnu.org';
> Matthew Gretton-Dann
> Subject: A question about detecting array bounds for case Warray-
> bounds-3.c
> 
> Hi,
> 
> For case gcc/testsuite/gcc.dg/Warray-bounds-3.c, obviously it is an
> invalid C program, because the last iterations of all the loops cause
> the access of arrays is beyond the max size of corresponding array
> declarations. The condition of checking upper bound should be "<"
> rather than "<=".
> 
> Right now, GCC compiler doesn't report any warning messages for this
> case, should it be a bug in both test case and compiler?
> 
> But looking at http://gcc.gnu.org/PR31227 , it seems this test case is
> designed to be like this on purpose. Anybody can explain about this?
> 
> The case is like below,
> 
> /* { dg-do compile } */
> /* { dg-options "-O2 -Warray-bounds" } */
> /* based on PR 31227 */
> 
> struct S
> {
>   const char *abday[7];
>   const char *day[7];
>   const char *abmon[12];
>   const char *mon[12];
>   const char *am_pm[2];
> };
> 
> ...
> 
>   for (cnt = 0; cnt <= 7; ++cnt)
> {
>   iov[2 + cnt].iov_base = (void *) (time->abday[cnt] ?: "");
>   iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1;
> }
> 
>   for (; cnt <= 14; ++cnt)
> {
>   iov[2 + cnt].iov_base = (void *) (time->day[cnt - 7] ?: "");
>   iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1;
> }
> 
>   for (; cnt <= 26; ++cnt)
> {
>   iov[2 + cnt].iov_base = (void *) (time->abmon[cnt - 14] ?: "");
>   iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1;
> }
> 
>   for (; cnt <= 38; ++cnt)
> {
>   iov[2 + cnt].iov_base = (void *) (time->mon[cnt - 26] ?: "");
>   iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1;
> }
> 
>   for (; cnt <= 40; ++cnt)
> {
>   iov[2 + cnt].iov_base =  (void *) (time->am_pm[cnt - 38] ?: "");
>   iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1;
> }
> 
> Thanks,
> -Jiangning





A question about detecting array bounds for case Warray-bounds-3.c

2011-09-21 Thread Jiangning Liu
Hi,

For case gcc/testsuite/gcc.dg/Warray-bounds-3.c, obviously it is an invalid
C program, because the last iterations of all the loops cause the access of
arrays is beyond the max size of corresponding array declarations. The
condition of checking upper bound should be "<" rather than "<=". 

Right now, GCC compiler doesn't report any warning messages for this case,
should it be a bug in both test case and compiler?

But looking at http://gcc.gnu.org/PR31227 , it seems this test case is
designed to be like this on purpose. Anybody can explain about this?

The case is like below,

/* { dg-do compile } */
/* { dg-options "-O2 -Warray-bounds" } */
/* based on PR 31227 */

struct S
{
  const char *abday[7];
  const char *day[7];
  const char *abmon[12];
  const char *mon[12];
  const char *am_pm[2];
};

...

  for (cnt = 0; cnt <= 7; ++cnt)
{
  iov[2 + cnt].iov_base = (void *) (time->abday[cnt] ?: "");
  iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1;
}

  for (; cnt <= 14; ++cnt)
{
  iov[2 + cnt].iov_base = (void *) (time->day[cnt - 7] ?: "");
  iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1;
}

  for (; cnt <= 26; ++cnt)
{
  iov[2 + cnt].iov_base = (void *) (time->abmon[cnt - 14] ?: "");
  iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1;
}

  for (; cnt <= 38; ++cnt)
{
  iov[2 + cnt].iov_base = (void *) (time->mon[cnt - 26] ?: "");
  iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1;
}

  for (; cnt <= 40; ++cnt)
{
  iov[2 + cnt].iov_base =  (void *) (time->am_pm[cnt - 38] ?: "");
  iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1;
}

Thanks,
-Jiangning





RE: A case that PRE optimization hurts performance

2011-09-15 Thread Jiangning Liu
Hi Richard,

I slightly changed the case to be like below,

int f(char *t) {
int s=0;

while (*t && s != 1) {
switch (s) {
case 0:   /* path 1 */
s = 2;
break;
case 2:   /* path 2 */
s = 3; /* changed */
break;
default:  /* path 3 */
if (*t == '-') 
s = 2;
break;
}
t++;
}

return s;
}

"-O2" is still worse than "-O2 -fno-tree-pre". 

"-O2 -fno-tree-pre" result is 

f:
pushl   %ebp
xorl%eax, %eax
movl%esp, %ebp
movl8(%ebp), %edx
movzbl  (%edx), %ecx
jmp .L14
.p2align 4,,7
.p2align 3
.L5:
movl$2, %eax
.L7:
addl$1, %edx
cmpl$1, %eax
movzbl  (%edx), %ecx
je  .L3
.L14:
testb   %cl, %cl
je  .L3
testl   %eax, %eax
je  .L5
cmpl$2, %eax
.p2align 4,,5
je  .L17
cmpb$45, %cl
.p2align 4,,5
je  .L5
addl$1, %edx
cmpl$1, %eax
movzbl  (%edx), %ecx
jne .L14
.p2align 4,,7
.p2align 3
.L3:
popl%ebp
.p2align 4,,2
ret
.p2align 4,,7
.p2align 3
.L17:
movb$3, %al
.p2align 4,,3
jmp .L7

While "-O2" result is 

f:
pushl   %ebp
xorl%eax, %eax
movl%esp, %ebp
movl8(%ebp), %edx
pushl   %ebx
movzbl  (%edx), %ecx
jmp .L14
.p2align 4,,7
.p2align 3
.L5:
movl$1, %ebx
movl$2, %eax
.L7:
addl$1, %edx
testb   %bl, %bl
movzbl  (%edx), %ecx
je  .L3
.L14:
testb   %cl, %cl
je  .L3
testl   %eax, %eax
je  .L5
cmpl$2, %eax
.p2align 4,,5
je  .L16
cmpb$45, %cl
.p2align 4,,5
je  .L5
cmpl$1, %eax
setne   %bl
addl$1, %edx
testb   %bl, %bl
movzbl  (%edx), %ecx
jne .L14
.p2align 4,,7
.p2align 3
.L3:
popl%ebx
popl%ebp
ret
.p2align 4,,7
.p2align 3
.L16:
movl$1, %ebx
movb$3, %al
jmp .L7

You may notice that register ebx is introduced, and some more instructions
around ebx are generated as well. i.e.

setne   %bl
testb   %bl, %bl

I agree with you that in theory PRE does the right thing to minimize the
computation cost on gimple level. However, the problem is the cost of
converting comparison result to a bool value is not considered, so it
actually makes binary code worse. For this case, as I summarized below, to
complete the same functionality "With PRE" is worse than "Without PRE" for
all three paths,

* Without PRE,

Path1:
movl$2, %eax
cmpl$1, %eax
je  .L3

Path2:
movb$3, %al
cmpl$1, %eax
je  .L3

Path3:
cmpl$1, %eax
jne .L14

* With PRE,

Path1:
movl$1, %ebx
movl$2, %eax
testb   %bl, %bl
je  .L3

Path2:
movl$1, %ebx
movb$3, %al
testb   %bl, %bl
je  .L3

Path3:
cmpl$1, %eax
setne   %bl
testb   %bl, %bl
jne .L14

Do you have any more thoughts?

Thanks,
-Jiangning

> -Original Message-
> From: Richard Guenther [mailto:richard.guent...@gmail.com]
> Sent: Tuesday, August 02, 2011 5:23 PM
> To: Jiangning Liu
> Cc: gcc@gcc.gnu.org
> Subject: Re: A case that PRE optimization hurts performance
> 
> On Tue, Aug 2, 2011 at 4:37 AM, Jiangning Liu 
> wrote:
> > Hi,
> >
> > For the following simple test case, PRE optimization hoists
> computation
> > (s!=1) into the default branch of the switch statement, and finally
> causes
> > very poor code generation. This problem occurs in both X86 and ARM,
> and I
> > believe it is also a problem for other targets.
> >
> > int f(char *t) {
> >    int s=0;
> >
> >    while (*t && s != 1) {
> >        switch (s) {
> >        case 0:
> >            s = 2;
> >            break;
> >        case 2:
> >            s = 1;
> >            break;
> >        default:
> >            if (*t == '-')
> >                s = 1;
> >            break;
> >        }
> >        t++;
> >    }
> >
> >    return s;
> > }
> >
> > Taking X86 as an example, with option "-O2" you may find 52
> instructions
> > generated like below,
> >
> >  :
> 

RE: Is VRP is too conservative to identify boolean value 0 and 1?

2011-09-02 Thread Jiangning Liu
Andrew,

I realize I needn't back-end solution for my case at all, and in middle end I 
can directly use the _Bool type info! Appreciate your reply!

Thanks,
-Jiangning


> -Original Message-
> From: Andrew Pinski [mailto:pins...@gmail.com]
> Sent: Friday, September 02, 2011 2:27 PM
> To: Jiangning Liu
> Cc: gcc@gcc.gnu.org
> Subject: Re: Is VRP is too conservative to identify boolean value 0 and
> 1?
> 
> On Thu, Sep 1, 2011 at 10:58 PM, Jiangning Liu 
> wrote:
> > D.1249_3: [0, 1]
> > D.1250_5: [0, 1]
> > D.1251_6: [0, 1]
> 
> Those are equivalent to [0, MAX] as _Bool only has two different
> values, 0 and 1 (MAX).  Can you explain more about the optimization
> which you are working on that needs the ranges as (int)[0,1] rather
> than (_Bool)[0,MAX] ?
> 
> Thanks,
> Andrew Pinski





Is VRP is too conservative to identify boolean value 0 and 1?

2011-09-01 Thread Jiangning Liu
Hi,

For the following small case,

int f(int i, int j)
{
if (i==1 && j==2)
return i;
else
return j;
}

with -O2 option, GCC has vrp2 dump like below,

==

Value ranges after VRP:

i_1: VARYING
i_2(D): VARYING
D.1249_3: [0, +INF]
j_4(D): VARYING
D.1250_5: [0, +INF]
D.1251_6: [0, +INF]
j_10: [2, 2]  EQUIVALENCES: { j_4(D) } (1 elements)


Removing basic block 3
f (int i, int j)
{
  _Bool D.1251;
  _Bool D.1250;
  _Bool D.1249;

:
  D.1249_3 = i_2(D) == 1;
  D.1250_5 = j_4(D) == 2;
  D.1251_6 = D.1250_5 & D.1249_3;
  if (D.1251_6 != 0)
goto ;
  else
goto ;

:

:
  # i_1 = PHI <1(3), j_4(D)(2)>
  return i_1;

}



Variable D.1249_3, D.1250_5 and D.1251_6 should be boolean values, so the
their value ranges should be

D.1249_3: [0, 1]
D.1250_5: [0, 1]
D.1251_6: [0, 1]

So why current VRP can't find out this value range?

I'm asking this question because the optimizations in back-end need this
info to do advanced optimization.

Thanks,
-Jiangning




RE: [RFC] Add middle end hook for stack red zone size

2011-08-01 Thread Jiangning Liu
Hi Jakub,

Appreciate for your valuable comments!

I think SPARC V9 ABI doesn't have red zone defined, right? So
stack_red_zone_size should be defined as zero by default, the scheduler
would block moving memory accesses across stack adjustment no matter what
the offset is. I don't see any risk here. Also, in my patch function *abs*
is being used to avoid the opposite stack direction issue as you mentioned.

Some people like you insist on the ABI diversity, and actually I agree with
you on this. But part of the ABI definition is general for all targets. The
point here is memory access beyond stack red zone should be avoided, which
is the general part of ABI that compiler should guarantee. For this general
part, middle end should take the responsibility.

Thanks,
-Jiangning

> -Original Message-
> From: Jakub Jelinek [mailto:ja...@redhat.com]
> Sent: Monday, August 01, 2011 6:31 PM
> To: Jiangning Liu
> Cc: 'Joern Rennecke'; gcc@gcc.gnu.org; gcc-patc...@gcc.gnu.org;
> vmaka...@redhat.com; dje@gmail.com; Richard Henderson; Ramana
> Radhakrishnan; 'Ramana Radhakrishnan'
> Subject: Re: [RFC] Add middle end hook for stack red zone size
> 
> On Mon, Aug 01, 2011 at 06:14:27PM +0800, Jiangning Liu wrote:
> > ARM. You are right, they were all fixed in back-ends in the past, but
> we
> > should
> > fix the bug in a general way to make GCC infrastructure stronger,
> rather
> > than fixing the problem target-by-target and case-by-case! If you
> further
> > look into the back-end fixes in x86 and PowerPC, you may find they
> looks
> > quite similar in back-ends.
> >
> 
> Red zone is only one difficulty, your patch is e.g. completely ignoring
> existence of biased stack pointers (e.g. SPARC -m64 has them).
> Some targets have stack growing in opposite direction, etc.
> We have really a huge amount of very diverse ABIs and making the
> middle-end
> grok what is an invalid stack access is difficult.
> 
>   Jakub






A case that PRE optimization hurts performance

2011-08-01 Thread Jiangning Liu
Hi,

For the following simple test case, PRE optimization hoists computation
(s!=1) into the default branch of the switch statement, and finally causes
very poor code generation. This problem occurs in both X86 and ARM, and I
believe it is also a problem for other targets. 

int f(char *t) {
int s=0;

while (*t && s != 1) {
switch (s) {
case 0:
s = 2;
break;
case 2:
s = 1;
break;
default:
if (*t == '-') 
s = 1;
break;
}
t++;
}

return s;
}

Taking X86 as an example, with option "-O2" you may find 52 instructions
generated like below,

 :
   0:   55  push   %ebp
   1:   31 c0   xor%eax,%eax
   3:   89 e5   mov%esp,%ebp
   5:   57  push   %edi
   6:   56  push   %esi
   7:   53  push   %ebx
   8:   8b 55 08mov0x8(%ebp),%edx
   b:   0f b6 0amovzbl (%edx),%ecx
   e:   84 c9   test   %cl,%cl
  10:   74 50   je 62 
  12:   83 c2 01add$0x1,%edx
  15:   85 c0   test   %eax,%eax
  17:   75 23   jne3c 
  19:   8d b4 26 00 00 00 00lea0x0(%esi,%eiz,1),%esi
  20:   0f b6 0amovzbl (%edx),%ecx
  23:   84 c9   test   %cl,%cl
  25:   0f 95 c0setne  %al
  28:   89 c7   mov%eax,%edi
  2a:   b8 02 00 00 00  mov$0x2,%eax
  2f:   89 fb   mov%edi,%ebx
  31:   83 c2 01add$0x1,%edx
  34:   84 db   test   %bl,%bl
  36:   74 2a   je 62 
  38:   85 c0   test   %eax,%eax
  3a:   74 e4   je 20 
  3c:   83 f8 02cmp$0x2,%eax
  3f:   74 1f   je 60 
  41:   80 f9 2dcmp$0x2d,%cl
  44:   74 22   je 68 
  46:   0f b6 0amovzbl (%edx),%ecx
  49:   83 f8 01cmp$0x1,%eax
  4c:   0f 95 c3setne  %bl
  4f:   89 df   mov%ebx,%edi
  51:   84 c9   test   %cl,%cl
  53:   0f 95 c3setne  %bl
  56:   89 de   mov%ebx,%esi
  58:   21 f7   and%esi,%edi
  5a:   eb d3   jmp2f 
  5c:   8d 74 26 00 lea0x0(%esi,%eiz,1),%esi
  60:   b0 01   mov$0x1,%al
  62:   5b  pop%ebx
  63:   5e  pop%esi
  64:   5f  pop%edi
  65:   5d  pop%ebp
  66:   c3  ret
  67:   90  nop
  68:   b8 01 00 00 00  mov$0x1,%eax
  6d:   5b  pop%ebx
  6e:   5e  pop%esi
  6f:   5f  pop%edi
  70:   5d  pop%ebp
  71:   c3  ret

But with command line option "-O2 -fno-tree-pre", there are only 12
instructions generated, and the code would be very clean like below,

 :
   0:   55  push   %ebp
   1:   31 c0   xor%eax,%eax
   3:   89 e5   mov%esp,%ebp
   5:   8b 55 08mov0x8(%ebp),%edx
   8:   80 3a 00cmpb   $0x0,(%edx)
   b:   74 0e   je 1b 
   d:   80 7a 01 00 cmpb   $0x0,0x1(%edx)
  11:   b0 02   mov$0x2,%al
  13:   ba 01 00 00 00  mov$0x1,%edx
  18:   0f 45 c2cmovne %edx,%eax
  1b:   5d  pop%ebp
  1c:   c3  ret

Do you have any idea about this?

Thanks,
-Jiangning





RE: [RFC] Add middle end hook for stack red zone size

2011-08-01 Thread Jiangning Liu
The answer is ARM can. However, if you look into the bugs PR30282 and 
PR38644, PR44199, you may find in history, there are several different cases

in different ports reporting the similar failures, covering x86, PowerPC and

ARM. You are right, they were all fixed in back-ends in the past, but we
should 
fix the bug in a general way to make GCC infrastructure stronger, rather 
than fixing the problem target-by-target and case-by-case! If you further 
look into the back-end fixes in x86 and PowerPC, you may find they looks 
quite similar in back-ends. 

Thanks,
-Jiangning

> -Original Message-
> From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org]
> On Behalf Of Jakub Jelinek
> Sent: Monday, August 01, 2011 5:12 PM
> To: Jiangning Liu
> Cc: 'Joern Rennecke'; gcc@gcc.gnu.org; gcc-patc...@gcc.gnu.org;
> vmaka...@redhat.com; dje@gmail.com; Richard Henderson; Ramana
> Radhakrishnan; 'Ramana Radhakrishnan'
> Subject: Re: [RFC] Add middle end hook for stack red zone size
> 
> On Mon, Aug 01, 2011 at 11:44:04AM +0800, Jiangning Liu wrote:
> > It's quite necessary to solve the general problem in middle-end rather
than in
> back-end.
> 
> That's what we disagree on.  All back-ends but ARM are able to handle it
> right, why can't ARM too?  The ABI rules for stack handling in the
epilogues
> are simply too diverse and complex to be handled easily in the scheduler.
> 
>   Jakub






RE: [RFC] Add middle end hook for stack red zone size

2011-07-31 Thread Jiangning Liu
Joern,

Thanks for your valuable feedback! This is only a RFC, and I will send out 
formal patch along with ChangLog later on. 

Basically, my patch is only to add new dependence in scheduler, and it only 
blocks some instruction movements, so it is NO RISK to compiler correctness. 
For whatever stack pointer changes you gave in different scenarios, the current 
code base should already work. My patch intends neither to replace old 
dependences, nor maximize the scheduler capability due to the existence of red 
zone in stack. It is only to block the memory access moving over stack pointer 
adjustment if distance is beyond red zone size, which is an OS requirement due 
to interruption existence. 

Stack adjustment in epilogue is a very general usage in stack frame. It's quite 
necessary to solve the general problem in middle-end rather than in back-end. 
Also, that old patch you attached is to solve the data dependence between two 
memory accesses, but stack pointer doesn't really have data dependence with 
memory access without using stack pointer, so they have different stories. 
Alternative solution of without adding blunt scheduling barrier is we insert an 
independent pass before scheduler to create RTL barrier by using the same 
interface stack_red_zone_size, but it would really be an over-design, if we add 
a new pass only for this *small* functionality.

In my patch, *abs* of offset is being used, so you are right that it's possible 
to get false positive to be too conservative, but there won't exist false 
negative, because my code would only add new dependences. 

Since the compilation is based on function, it would be OK if red zone size 
varies due to different ABI. Could you please tell me exactly on what system 
being supported by GCC red zone size can be different for incoming and 
outgoing? And also how scheduler guarantee the correctness in current code 
base? Anyway, I don't think my patch will break the original solution.

Thanks,
-Jiangning

> -Original Message-
> From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org]
> On Behalf Of Joern Rennecke
> Sent: Tuesday, July 26, 2011 10:33 AM
> To: Jiangning Liu
> Cc: gcc@gcc.gnu.org; gcc-patc...@gcc.gnu.org; vmaka...@redhat.com;
> dje@gmail.com; Richard Henderson; Ramana Radhakrishnan; 'Ramana
> Radhakrishnan'
> Subject: RE: [RFC] Add middle end hook for stack red zone size
> 
> Quoting Jiangning Liu :
> 
> > Hi,
> >
> > One month ago, I sent out this RFC to *gcc-patches* mail list, but I
> >  didn't receive any response yet. So I'm forwarding this mail to
> > *gcc* mail list. Can anybody here really give feedback to me?
> 
> Well, I couldn't approve any patch, but I can point out some issues with your 
> patch.
> 
> First, it's missing a ChangeLog, and you don't state how you have tested it.
> And regarding the code in sched_analyze_1, I think you'll get false positives 
> with
> alloca, and false negatives when registers are involved to compute offsets or 
> to
> restore the stack pointer from.
> 
> FWIW, I think generally blunt scheduling barriers should be avoided, and 
> instead
> the dependencies made visible to the scheduler.
> E.g., I've been working with another architecture with a redzone, where at 
> -fno-
> omit-frame-pointer, the prologue can put pretend_args into the redzone, then 
> after
> stack adjustment and frame allocation, these arguments are accessed via the 
> frame
> pointer.
> 
> With the attached patchlet, alias analysis works for this situation too, so 
> no blunt
> scheduling block is required.
> 
> Likewise, with stack adjustments, they should not affect scheduling in 
> general, but
> be considered to clobber the part of the frame that is being exposed to 
> interrupt
> writes either before or after the adjustment.
> At the moment, each port that wants to have such selective scheduling 
> blockages
> has to define a stack_adjust pattern with a memory clobber in a parallel, 
> with a
> memref that shows the high water mark of possible interrupt stack writes.
> Prima facia it would seem convenient if you only had to tell the middle-end 
> about
> the redzone size, and it could figure out the implicit clobbers when the 
> stack is
> changed.  However, when a big stack adjustment is being made, or the stack is
> realigned, or restored from the frame pointer / another register where it was
> saved due to realignment, the adjustment is not so obvious.  I'm not sure if 
> you can
> actually create an robust interface that's simpler to use than putting the 
> right
> memory clobber in the stack adjust pattern.  Note also that the redzone size 
> can
> vary from function to function depending on ABI-altering attributes, in 
> particular
> for interrupt functions, which can also have different incoming and outgoing
> redzone sizes.  Plus, you can have an NMI / reset handler which can use the 
> stack
> like an ordinary address register.





RE: [RFC] Add middle end hook for stack red zone size

2011-07-25 Thread Jiangning Liu
Hi,

One month ago, I sent out this RFC to *gcc-patches* mail list, but I didn't 
receive any response yet. So I'm forwarding this mail to *gcc* mail list. Can 
anybody here really give feedback to me?

Appreciate your help in advance!

-Jiangning

-Original Message-
From: Ramana Radhakrishnan [mailto:ramana.radhakrish...@linaro.org] 
Sent: Tuesday, July 19, 2011 6:18 PM
To: Jiangning Liu
Cc: gcc-patc...@gcc.gnu.org; vmaka...@redhat.com; dje@gmail.com; Richard 
Henderson; Ramana Radhakrishnan
Subject: Re: [RFC] Add middle end hook for stack red zone size

2011/7/19 Jiangning Liu :
>
> I see a lot of feedbacks on other posts, but mine is still with ZERO
> response in the past 3 weeks, so I'm wondering if I made any mistake in my
> process? Who can help me?

It would be worth CC'ing the other relevant target maintainers as well
to get some feedback since the patch touches ARM, x86 and Powerpc.
I've added the maintainers for i386 and PPC to the CC list using the
email addresses from the MAINTAINERS file.

Thanks,
Ramana

>
> Thanks,
> -Jiangning
>
> -Original Message-
> From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org]
> On Behalf Of Jiangning Liu
> Sent: Tuesday, July 05, 2011 8:32 AM
> To: gcc-patc...@gcc.gnu.org; rgue...@gcc.gnu.org
> Subject: RE: [RFC] Add middle end hook for stack red zone size
>
> PING...
>
> I just merged with the latest code base and generated new patch as attached.
>
> Thanks,
> -Jiangning
>
>> -Original Message-
>> From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
>> ow...@gcc.gnu.org] On Behalf Of Jiangning Liu
>> Sent: 2011年6月28日 4:38 PM
>> To: gcc-patc...@gcc.gnu.org
>> Subject: [RFC] Add middle end hook for stack red zone size
>>
>> This patch is to fix PR38644, which is a bug with long history about
>> stack red zone access, and PR30282 is correlated.
>>
>> Originally red zone concept is not exposed to middle-end, and back-end
>> uses special logic to add extra memory barrier RTL and help the
>> correct dependence in middle-end. This way different back-ends must
>> handle red zone problem by themselves. For example, X86 target
>> introduced function
>> ix86_using_red_zone() to judge red zone access, while POWER introduced
>> offset_below_red_zone_p() to judge it. Note that they have different
>> semantics, but the logic in caller sites of back-end uses them to
>> decide whether adding memory barrier RTL or not. If back-end
>> incorrectly handles this, bug would be introduced.
>>
>> Therefore, the correct method should be middle-end handles red zone
>> related things to avoid the burden in different back-ends. To be
>> specific for PR38644, this middle-end problem causes incorrect
>> behavior for ARM target.
>> This patch exposes red zone concept to middle-end by introducing a
>> middle-end/back-end hook TARGET_STACK_RED_ZONE_SIZE defined in
>> target.def, and by default its value is 0. Back-end may redefine this
>> function to provide concrete red zone size according to specific ABI
>> requirements.
>>
>> In middle end, scheduling dependence is modified by using this hook
>> plus checking stack frame pointer adjustment instruction to decide
>> whether memory references need to be all flushed out or not. In
>> theory, if TARGET_STACK_RED_ZONE_SIZE is defined correctly, back-end
>> would not be required to specially handle this scheduling dependence
>> issue by introducing extra memory barrier RTL.
>>
>> In back-end, the following changes are made to define the hook,
>> 1) For X86, TARGET_STACK_RED_ZONE_SIZE is redefined to be
>> ix86_stack_red_zone_size() in i386.c, which is an newly introduced
>> function.
>> 2) For POWER, TARGET_STACK_RED_ZONE_SIZE is redefined to be
>> rs6000_stack_red_zone_size() in rs6000.c, which is also a newly
>> defined function.
>> 3) For ARM and others, TARGET_STACK_RED_ZONE_SIZE is defined to be
>> default_stack_red_zone_size in targhooks.c, and this function returns
>> 0, which means ARM eabi and others don't support red zone access at all.
>>
>> In summary, the relationship between ABI and red zone access is like
>> below,
>>
>> -
>> |   ARCH   |  ARM  |   X86 |POWER  | others |
>> |--|---|---|---||
>> |ABI   | EABI  | MS_64 | other |   AIX  |  V4  ||
>> |--|---|---|---||--||
>> | RED ZONE |  No   |  YES  |  No   |  YES   |  No  |   No   |
>> |--|---|---|---||--||
>> | RED ZONE SIZE|   0   |  128  |   0   |220/288 |   0  |0   |
>> -
>>
>> Thanks,
>> -Jiangning
>
>
>
>