An issue on loop optimization/vectorization
For the case below, the code generated by “gcc -O3” is very ugly. char g_d[1024], g_s1[1024], g_s2[1024]; void test_loop(void) { char *d = g_d, *s1 = g_s1, *s2 = g_s2; for( int y = 0; y < 128; y++ ) { for( int x = 0; x < 16; x++ ) d[x] = s1[x] + s2[x]; d += 16; } } If we change “for( int x = 0; x < 16; x++ )” to be like “for( int x = 0; x < 32; x++ )”, very beautiful vectorization code would be generated, test_loop: .LFB0: .cfi_startproc adrpx2, g_s1 adrpx3, g_s2 add x2, x2, :lo12:g_s1 add x3, x3, :lo12:g_s2 adrpx0, g_d adrpx1, g_d+2048 add x0, x0, :lo12:g_d add x1, x1, :lo12:g_d+2048 ldp q1, q2, [x2] ldp q3, q0, [x3] add v1.16b, v1.16b, v3.16b add v0.16b, v0.16b, v2.16b .p2align 3,,7 .L2: str q1, [x0] str q0, [x0, 16]! cmp x0, x1 bne .L2 ret The code generated for " for( int x = 0; x < 8; x++ )" is also very ugly. It looks gcc has potential bugs on loop vectorization. Any idea? Thanks, -Jiangning
RE: A question about loop ivopt
> -Original Message- > From: Richard Guenther [mailto:richard.guent...@gmail.com] > Sent: Tuesday, May 22, 2012 6:36 PM > To: Jiangning Liu > Cc: Zdenek Dvorak; Jiangning Liu; gcc@gcc.gnu.org > Subject: Re: A question about loop ivopt > > On Tue, May 22, 2012 at 11:19 AM, Jiangning Liu > wrote: > > > > > >> -Original Message- > >> From: Richard Guenther [mailto:richard.guent...@gmail.com] > >> Sent: Tuesday, May 15, 2012 10:17 PM > >> To: Zdenek Dvorak > >> Cc: Jiangning Liu; gcc@gcc.gnu.org; Jiangning Liu > >> Subject: Re: A question about loop ivopt > >> > >> On Tue, May 15, 2012 at 4:13 PM, Zdenek Dvorak > >> wrote: > >> > Hi, > >> > > >> >> > > > Why can't we replace function force_expr_to_var_cost > directly > >> with > >> >> function > >> >> > > > computation_cost in tree-ssa-loop-ivopt.c? > >> >> > > > > >> >> > > > Actually I think it is inaccurate for the current recursive > >> algorithm > >> >> in > >> >> > > > force_expr_to_var_cost to estimate expr cost. Instead > >> >> computation_cost can > >> >> > > > count some back-end factors and make the estimation more > >> accurate. > >> >> > > > > >> >> > > > For example, using computation_cost, we may check whether > >> back-ends > >> >> can tie > >> >> > > > some modes transformation expressed by SUBREG or not. If we > >> use > >> >> > > > force_expr_to_var_cost, some more computations around type > >> >> > > > promotion/demotion would increase the cost estimated. > >> >> > > > > >> >> > > > Looking at the algorithm in force_expr_to_var_cost, it is > just > >> to > >> >> analyze > >> >> > > > the operator in the expression and give estimation. Should > it > >> be the > >> >> same as > >> >> > > > expanding EXPR to RTX and give estimation like in > >> computation_cost? > >> >> > > > > >> >> > > > Any thoughts? > >> >> > > > >> >> > > I suppose Zdenek may remember. > >> >> > > >> >> > computation_cost actually expands the expression to RTL. Since > >> cost > >> >> estimates > >> >> > are computed a lot in ivopts, using it directly would require a > >> huge > >> >> amount of memory, > >> >> > >> >> Zdenek, > >> >> > >> >> Do you know how huge is it? Any data on this? For GCC, is this > >> "huge" > >> >> memory consumption is critical enough, and there aren't any other > >> else > >> >> consuming even more memory? > >> > > >> > no, not really (I haven't worked on this for a few years now), > >> although > >> > of course I did some measurements when ivopts were created. Feel > >> free > >> > to experiment with that, if it turned out that the memory > consumption > >> > and extra time spent by it is negligible, it would be a nice > cleanup. > >> > >> Well, I don't think we should feed arbitrary expressions to expand > at > >> IVOPTs time. What probably matters most is address costs, no? > >> At least that is where expand probably makes the most difference. > >> So why not improve force_expr_to_var_cost instead? > > > > OK, yes, the thing that matter most is just address cost, so I can > improve > > force_expr_to_var_cost. > > > > Would it sound OK if I expose MODES_TIEABLE_P to middle-end by > defining a > > new target hook? I need this function to strip some operations and > make the > > cost estimate more accurate. If I don't expand to RTL, I would need a > method > > to check the modes conversion in middle end, anyway. Any idea? > > You are already in the middle-end and thus can use MODES_TIABLE_P > directly. Modes are also available on gimple variables via > DECL/TYPE_MODE. Richard, But MODES_TIEABLE_P is a macro hook and isn't exposed to TREE level, so I would have to modify xxx-protos.h for all back-ends. An alternative way is I define a new function hook. This way I needn't to change all back-ends, but support several back-ends required first. Which solution is usually preferred? Thanks, -Jiangning > > Richard. > > > Thanks, > > -Jiangning > > > >> > >> Richard. > >> > >> > >> > Zdenek > > > > > > > >
RE: A question about loop ivopt
> -Original Message- > From: Richard Guenther [mailto:richard.guent...@gmail.com] > Sent: Tuesday, May 15, 2012 10:17 PM > To: Zdenek Dvorak > Cc: Jiangning Liu; gcc@gcc.gnu.org; Jiangning Liu > Subject: Re: A question about loop ivopt > > On Tue, May 15, 2012 at 4:13 PM, Zdenek Dvorak > wrote: > > Hi, > > > >> > > > Why can't we replace function force_expr_to_var_cost directly > with > >> function > >> > > > computation_cost in tree-ssa-loop-ivopt.c? > >> > > > > >> > > > Actually I think it is inaccurate for the current recursive > algorithm > >> in > >> > > > force_expr_to_var_cost to estimate expr cost. Instead > >> computation_cost can > >> > > > count some back-end factors and make the estimation more > accurate. > >> > > > > >> > > > For example, using computation_cost, we may check whether > back-ends > >> can tie > >> > > > some modes transformation expressed by SUBREG or not. If we > use > >> > > > force_expr_to_var_cost, some more computations around type > >> > > > promotion/demotion would increase the cost estimated. > >> > > > > >> > > > Looking at the algorithm in force_expr_to_var_cost, it is just > to > >> analyze > >> > > > the operator in the expression and give estimation. Should it > be the > >> same as > >> > > > expanding EXPR to RTX and give estimation like in > computation_cost? > >> > > > > >> > > > Any thoughts? > >> > > > >> > > I suppose Zdenek may remember. > >> > > >> > computation_cost actually expands the expression to RTL. Since > cost > >> estimates > >> > are computed a lot in ivopts, using it directly would require a > huge > >> amount of memory, > >> > >> Zdenek, > >> > >> Do you know how huge is it? Any data on this? For GCC, is this > "huge" > >> memory consumption is critical enough, and there aren't any other > else > >> consuming even more memory? > > > > no, not really (I haven't worked on this for a few years now), > although > > of course I did some measurements when ivopts were created. Feel > free > > to experiment with that, if it turned out that the memory consumption > > and extra time spent by it is negligible, it would be a nice cleanup. > > Well, I don't think we should feed arbitrary expressions to expand at > IVOPTs time. What probably matters most is address costs, no? > At least that is where expand probably makes the most difference. > So why not improve force_expr_to_var_cost instead? OK, yes, the thing that matter most is just address cost, so I can improve force_expr_to_var_cost. Would it sound OK if I expose MODES_TIEABLE_P to middle-end by defining a new target hook? I need this function to strip some operations and make the cost estimate more accurate. If I don't expand to RTL, I would need a method to check the modes conversion in middle end, anyway. Any idea? Thanks, -Jiangning > > Richard. > > > > Zdenek
A question about loop ivopt
Hi, Why can't we replace function force_expr_to_var_cost directly with function computation_cost in tree-ssa-loop-ivopt.c? Actually I think it is inaccurate for the current recursive algorithm in force_expr_to_var_cost to estimate expr cost. Instead computation_cost can count some back-end factors and make the estimation more accurate. For example, using computation_cost, we may check whether back-ends can tie some modes transformation expressed by SUBREG or not. If we use force_expr_to_var_cost, some more computations around type promotion/demotion would increase the cost estimated. Looking at the algorithm in force_expr_to_var_cost, it is just to analyze the operator in the expression and give estimation. Should it be the same as expanding EXPR to RTX and give estimation like in computation_cost? Any thoughts? Thanks, -Jiangning
Why can't copy renaming capture this assignment?
Hi, For this small case, char garr[100]; void f(void) { unsigned short h, s; s = 20; for (h = 1; h < (s-1); h++) { garr[h] = 0; } } After copyrename3, we have the following dump, f () { short unsigned int h; int D.4066; : D.4066_14 = 1; if (D.4066_14 <= 18) goto ; else goto ; : # h_15 = PHI # D.4066_16 = PHI garr[D.4066_16] = 0; h_8 = h_15 + 1; D.4066_4 = (int) h_8; if (D.4066_4 <= 18) goto ; else goto ; : return; } copy renaming fails to capture the assignment statement "D.4066_4 = (int) h_8;" to trigger renaming partition coalesce. I find gimple_assign_single_p invoked by gimple_assign_ssa_name_copy_p always returns false, because for this statement " gs->gsbase.subcode" is NOP_EXPR rather than GIMPLE_SINGLE_RHS. Should subcode be correctly initialized anywhere to fix this problem? BTW, my expectation after copy renaming is like below, f () { int D.4679; : D.4679_7 = 1; if (D.4679_7 != 19) goto ; else goto ; : # D.4679_15 = PHI # D.4679_17 = PHI garr[D.4679_15] = 0; D.4679_14 = D.4679_17 + 1; D.4679_4 = D.4679_14; if (D.4679_4 != 19) goto ; else goto ; : return; } and then PRE can finally remove that redundancy for symbol D. away. Thanks, -Jiangning
RE: A question about redundant PHI expression stmt
> -Original Message- > From: Jiangning Liu > Sent: Tuesday, February 28, 2012 4:07 PM > To: Jiangning Liu; 'William J. Schmidt' > Cc: gcc@gcc.gnu.org; wschm...@gcc.gnu.org > Subject: RE: A question about redundant PHI expression stmt > > > > > -Original Message- > > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf > Of > > Jiangning Liu > > Sent: Tuesday, February 28, 2012 11:19 AM > > To: 'William J. Schmidt' > > Cc: gcc@gcc.gnu.org; wschm...@gcc.gnu.org > > Subject: RE: A question about redundant PHI expression stmt > > > > > > > > > -Original Message- > > > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On > Behalf > > Of > > > William J. Schmidt > > > Sent: Monday, February 27, 2012 11:32 PM > > > To: Jiangning Liu > > > Cc: gcc@gcc.gnu.org; wschm...@gcc.gnu.org > > > Subject: Re: A question about redundant PHI expression stmt > > > > > > On Fri, 2012-02-24 at 16:07 +0800, Jiangning Liu wrote: > > > > Hi, > > > > > > > > For the small case below, there are some redundant PHI expression > > > stmt > > > > generated, and finally cause back-end generates redundant copy > > > instructions > > > > due to some reasons around IRA. > > > > > > > > int *l, *r, *g; > > > > void test_func(int n) > > > > { > > > > int i; > > > > static int j; > > > > static int pos, direction, direction_pre; > > > > > > > > pos = 0; > > > > direction = 1; > > > > > > > > for ( i = 0; i < n; i++ ) > > > > { > > > > direction_pre = direction; > > > > > > > > for ( j = 0; j <= 400; j++ ) > > > > { > > > > > > > > if ( g[pos] == 200 ) > > > > break; > > > > > > > > if ( direction == 0 ) > > > > pos = l[pos]; > > > > else > > > > pos = r[pos]; > > > > > > > > if ( pos == -1 ) > > > > { > > > > if ( direction_pre != direction ) > > > > break; > > > > > > > > pos = 0; > > > > direction = !direction; > > > > } > > > > > > > > } > > > > > > > > f(g[pos]); > > > > } > > > > } > > > > > > > > I know PR39976 has something to do with this case, and check-in > > > r182140 > > > > caused big degradation on the real benchmark, but I'm still > > confusing > > > around > > > > this issue. > > > > > > > > The procedure is like this, > > > > > > > > Loop store motion generates code below, > > > > > > > > : > > > > D.4084_17 = l.4_13 + D.4077_70; > > > > pos.5_18 = *D.4084_17; > > > > pos_lsm.34_103 = pos.5_18; > > > > goto ; > > > > > > > > : > > > > D.4088_23 = r.6_19 + D.4077_70; > > > > pos.7_24 = *D.4088_23; > > > > pos_lsm.34_104 = pos.7_24; > > > > > > > > : > > > > # prephitmp.29_89 = PHI > > > > # pos_lsm.34_53 = PHI > > > > pos.2_25 = prephitmp.29_89; > > > > if (pos.2_25 == -1) > > > > goto ; > > > > else > > > > goto ; > > > > > > > > And then, copy propagation transforms block 8 it into > > > > > > > > : > > > > # prephitmp.29_89 = PHI > > > > # pos_lsm.34_53 = PHI > > > > ... > > > > > > > > And then, these two duplicated PHI stmts have been kept until > > expand > > > pass. > > > > > > > > In dom2, a stmt like below > > > > > > > > # pos_lsm.34_54 = PHI > > > > > > > > is transformed into > > > > > > &
RE: A question about redundant PHI expression stmt
> -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of > Jiangning Liu > Sent: Tuesday, February 28, 2012 11:19 AM > To: 'William J. Schmidt' > Cc: gcc@gcc.gnu.org; wschm...@gcc.gnu.org > Subject: RE: A question about redundant PHI expression stmt > > > > > -Original Message- > > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf > Of > > William J. Schmidt > > Sent: Monday, February 27, 2012 11:32 PM > > To: Jiangning Liu > > Cc: gcc@gcc.gnu.org; wschm...@gcc.gnu.org > > Subject: Re: A question about redundant PHI expression stmt > > > > On Fri, 2012-02-24 at 16:07 +0800, Jiangning Liu wrote: > > > Hi, > > > > > > For the small case below, there are some redundant PHI expression > > stmt > > > generated, and finally cause back-end generates redundant copy > > instructions > > > due to some reasons around IRA. > > > > > > int *l, *r, *g; > > > void test_func(int n) > > > { > > > int i; > > > static int j; > > > static int pos, direction, direction_pre; > > > > > > pos = 0; > > > direction = 1; > > > > > > for ( i = 0; i < n; i++ ) > > > { > > > direction_pre = direction; > > > > > > for ( j = 0; j <= 400; j++ ) > > > { > > > > > > if ( g[pos] == 200 ) > > > break; > > > > > > if ( direction == 0 ) > > > pos = l[pos]; > > > else > > > pos = r[pos]; > > > > > > if ( pos == -1 ) > > > { > > > if ( direction_pre != direction ) > > > break; > > > > > > pos = 0; > > > direction = !direction; > > > } > > > > > > } > > > > > > f(g[pos]); > > > } > > > } > > > > > > I know PR39976 has something to do with this case, and check-in > > r182140 > > > caused big degradation on the real benchmark, but I'm still > confusing > > around > > > this issue. > > > > > > The procedure is like this, > > > > > > Loop store motion generates code below, > > > > > > : > > > D.4084_17 = l.4_13 + D.4077_70; > > > pos.5_18 = *D.4084_17; > > > pos_lsm.34_103 = pos.5_18; > > > goto ; > > > > > > : > > > D.4088_23 = r.6_19 + D.4077_70; > > > pos.7_24 = *D.4088_23; > > > pos_lsm.34_104 = pos.7_24; > > > > > > : > > > # prephitmp.29_89 = PHI > > > # pos_lsm.34_53 = PHI > > > pos.2_25 = prephitmp.29_89; > > > if (pos.2_25 == -1) > > > goto ; > > > else > > > goto ; > > > > > > And then, copy propagation transforms block 8 it into > > > > > > : > > > # prephitmp.29_89 = PHI > > > # pos_lsm.34_53 = PHI > > > ... > > > > > > And then, these two duplicated PHI stmts have been kept until > expand > > pass. > > > > > > In dom2, a stmt like below > > > > > > # pos_lsm.34_54 = PHI > > > > > > is transformed into > > > > > > # pos_lsm.34_54 = PHI > > > > > > just because they are equal. > > > > > > Unfortunately, this transformation changed back-end behavior to > > generate > > > redundant copy instructions and hurt performance. This case is from > a > > real > > > benchmark and hurt performance a lot. > > > > > > Does the fix in r182140 intend to totally clean up this kind of > > redundancy? > > > Where should we get it fixed? > > > > > > > Hi, sorry not to have responded sooner -- I just now got some time to > > look at this. > > > > I compiled your code with -O3 for revisions 182139 and 182140 of GCC > > trunk, and found no difference between the code produced by the > middle > > end for the two versions. So something else has apparently come > along > > since then that helped produce the problematic code generation for > you. > > Either that or I need to
RE: A question about redundant PHI expression stmt
> -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of > William J. Schmidt > Sent: Monday, February 27, 2012 11:32 PM > To: Jiangning Liu > Cc: gcc@gcc.gnu.org; wschm...@gcc.gnu.org > Subject: Re: A question about redundant PHI expression stmt > > On Fri, 2012-02-24 at 16:07 +0800, Jiangning Liu wrote: > > Hi, > > > > For the small case below, there are some redundant PHI expression > stmt > > generated, and finally cause back-end generates redundant copy > instructions > > due to some reasons around IRA. > > > > int *l, *r, *g; > > void test_func(int n) > > { > > int i; > > static int j; > > static int pos, direction, direction_pre; > > > > pos = 0; > > direction = 1; > > > > for ( i = 0; i < n; i++ ) > > { > > direction_pre = direction; > > > > for ( j = 0; j <= 400; j++ ) > > { > > > > if ( g[pos] == 200 ) > > break; > > > > if ( direction == 0 ) > > pos = l[pos]; > > else > > pos = r[pos]; > > > > if ( pos == -1 ) > > { > > if ( direction_pre != direction ) > > break; > > > > pos = 0; > > direction = !direction; > > } > > > > } > > > > f(g[pos]); > > } > > } > > > > I know PR39976 has something to do with this case, and check-in > r182140 > > caused big degradation on the real benchmark, but I'm still confusing > around > > this issue. > > > > The procedure is like this, > > > > Loop store motion generates code below, > > > > : > > D.4084_17 = l.4_13 + D.4077_70; > > pos.5_18 = *D.4084_17; > > pos_lsm.34_103 = pos.5_18; > > goto ; > > > > : > > D.4088_23 = r.6_19 + D.4077_70; > > pos.7_24 = *D.4088_23; > > pos_lsm.34_104 = pos.7_24; > > > > : > > # prephitmp.29_89 = PHI > > # pos_lsm.34_53 = PHI > > pos.2_25 = prephitmp.29_89; > > if (pos.2_25 == -1) > > goto ; > > else > > goto ; > > > > And then, copy propagation transforms block 8 it into > > > > : > > # prephitmp.29_89 = PHI > > # pos_lsm.34_53 = PHI > > ... > > > > And then, these two duplicated PHI stmts have been kept until expand > pass. > > > > In dom2, a stmt like below > > > > # pos_lsm.34_54 = PHI > > > > is transformed into > > > > # pos_lsm.34_54 = PHI > > > > just because they are equal. > > > > Unfortunately, this transformation changed back-end behavior to > generate > > redundant copy instructions and hurt performance. This case is from a > real > > benchmark and hurt performance a lot. > > > > Does the fix in r182140 intend to totally clean up this kind of > redundancy? > > Where should we get it fixed? > > > > Hi, sorry not to have responded sooner -- I just now got some time to > look at this. > > I compiled your code with -O3 for revisions 182139 and 182140 of GCC > trunk, and found no difference between the code produced by the middle > end for the two versions. So something else has apparently come along > since then that helped produce the problematic code generation for you. > Either that or I need to use different compile flags; you didn't > specify > what you used. > > The fix in r182140 does just what you saw in dom2: identifies > duplicate > PHIs in the same block and ensures only one of them is used. This > actually avoids inserting extra blocks during expand in certain loop > cases. I am not sure why you are getting redundant copies as a result, > but it sounds from your comments like IRA didn't coalesce a register > copy or something like that. You may want to bisect revisions on the > trunk to see where your bad code generation started to occur to get a > better handle on what happened. > > As Richard said, the dom pass is likely to be removed someday, whenever > someone can get around to it. My redundant-phi band-aid in dom would > go > away then as well, but presumably an extra pass of PRE would replace it, > and redundant PHIs would still be removed. > Bill, Thanks a lot for your explanation! The bug could be exposed with -O2 if you apply the check-in r184435 made by Richard. BTW, at present is there anybody working on the NEW pass replacing dom? If no, in short term, it would be good to still get it fixed just because it is hurting benchmark. If Yes, may I know what that NEW pass will be focusing on? Thanks, -Jiangning > Thanks, > Bill > > > Thanks, > > -Jiangning > > > > > > >
A question about redundant PHI expression stmt
Hi, For the small case below, there are some redundant PHI expression stmt generated, and finally cause back-end generates redundant copy instructions due to some reasons around IRA. int *l, *r, *g; void test_func(int n) { int i; static int j; static int pos, direction, direction_pre; pos = 0; direction = 1; for ( i = 0; i < n; i++ ) { direction_pre = direction; for ( j = 0; j <= 400; j++ ) { if ( g[pos] == 200 ) break; if ( direction == 0 ) pos = l[pos]; else pos = r[pos]; if ( pos == -1 ) { if ( direction_pre != direction ) break; pos = 0; direction = !direction; } } f(g[pos]); } } I know PR39976 has something to do with this case, and check-in r182140 caused big degradation on the real benchmark, but I'm still confusing around this issue. The procedure is like this, Loop store motion generates code below, : D.4084_17 = l.4_13 + D.4077_70; pos.5_18 = *D.4084_17; pos_lsm.34_103 = pos.5_18; goto ; : D.4088_23 = r.6_19 + D.4077_70; pos.7_24 = *D.4088_23; pos_lsm.34_104 = pos.7_24; : # prephitmp.29_89 = PHI # pos_lsm.34_53 = PHI pos.2_25 = prephitmp.29_89; if (pos.2_25 == -1) goto ; else goto ; And then, copy propagation transforms block 8 it into : # prephitmp.29_89 = PHI # pos_lsm.34_53 = PHI ... And then, these two duplicated PHI stmts have been kept until expand pass. In dom2, a stmt like below # pos_lsm.34_54 = PHI is transformed into # pos_lsm.34_54 = PHI just because they are equal. Unfortunately, this transformation changed back-end behavior to generate redundant copy instructions and hurt performance. This case is from a real benchmark and hurt performance a lot. Does the fix in r182140 intend to totally clean up this kind of redundancy? Where should we get it fixed? Thanks, -Jiangning
RE: A problem about loop store motion
> -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of > Richard Guenther > Sent: Tuesday, February 21, 2012 6:19 PM > To: Jiangning Liu > Cc: gcc@gcc.gnu.org > Subject: Re: A problem about loop store motion > > On Tue, Feb 21, 2012 at 10:57 AM, Jiangning Liu > wrote: > > > > > >> -Original Message- > >> From: Richard Guenther [mailto:richard.guent...@gmail.com] > >> Sent: Tuesday, February 21, 2012 5:40 PM > >> To: Jiangning Liu > >> Cc: gcc@gcc.gnu.org > >> Subject: Re: A problem about loop store motion > >> > >> On Tue, Feb 21, 2012 at 9:54 AM, Jiangning Liu > > >> wrote: > >> >> The MEM form is more "canonical", so the loop SM machinery to > detect > >> >> equality should be adjusted accordingly. Alternatively you can > >> teach > >> >> PRE insertion to strip off the MEM if possible (though > >> >> fold_stmt_inplace should > >> >> arelady do this if possible). > >> > > >> > Richard, > >> > > >> > Thank you! You are right. I noticed on latest trunk the problem in > >> PRE was > >> > already fixed by invoking fold_stmt_inplace. > >> > > >> > Unfortunately for this small case, the latest trunk code still > can't > >> do SM > >> > for variable pos, because refs_may_alias_p(*D.4074_10, pos) is > true, > >> that > >> > is, pos has alias with l[pos]. > >> > > >> > I think alias analysis should be able to know they don't have > alias > >> with > >> > each other, unless there is an assignment statement like "l=&pos;". > >> > > >> > Can alias analysis fix the problem? > >> > >> The problem is that 'pos' is marked TREE_ADDRESSABLE, so we think > >> its address got taken. 'l' points to any global possibly address- > taken > >> variable. It get's the TREE_ADDRESSABLE flag via PRE insertion > which > >> re-gimplifies the tree it creates which marks the variable as > >> addressable. > >> > >> I'll have a look. > > > > Terrific! :-) Could you please let me know when you have a patch to > fix > > this, because I want to double check the big case, and there might be > other > > hidden problems? > > PR52324, I am testing the following patch (in reality the gimplifier > should not > mark things addressable unless it itself makes them so, but frontends > are > broken, so we do that ... ugh). > Richard, Now trunk works for the big case as well. Thanks a lot for your quick fix. BTW, why can't we fix front-end directly? Thanks, -Jiangning > Index: gcc/gimplify.c > === > --- gcc/gimplify.c (revision 184428) > +++ gcc/gimplify.c (working copy) > @@ -7061,15 +7061,20 @@ gimplify_expr (tree *expr_p, gimple_seq > ret = GS_OK; > break; > } > - ret = gimplify_expr (&TREE_OPERAND (*expr_p, 0), pre_p, > post_p, > - is_gimple_mem_ref_addr, fb_rvalue); > - if (ret == GS_ERROR) > - break; > + /* Avoid re-gimplifying the address operand if it is already > +in suitable form. */ > + if (!is_gimple_mem_ref_addr (TREE_OPERAND (*expr_p, 0))) > + { > + ret = gimplify_expr (&TREE_OPERAND (*expr_p, 0), pre_p, > post_p, > + is_gimple_mem_ref_addr, fb_rvalue); > + if (ret == GS_ERROR) > + break; > + } > recalculate_side_effects (*expr_p); > ret = GS_ALL_DONE; > break; > > - /* Constants need not be gimplified. */ > + /* Constants need not be gimplified. */ > case INTEGER_CST: > case REAL_CST: > case FIXED_CST:
RE: A problem about loop store motion
> -Original Message- > From: Richard Guenther [mailto:richard.guent...@gmail.com] > Sent: Tuesday, February 21, 2012 5:40 PM > To: Jiangning Liu > Cc: gcc@gcc.gnu.org > Subject: Re: A problem about loop store motion > > On Tue, Feb 21, 2012 at 9:54 AM, Jiangning Liu > wrote: > >> The MEM form is more "canonical", so the loop SM machinery to detect > >> equality should be adjusted accordingly. Alternatively you can > teach > >> PRE insertion to strip off the MEM if possible (though > >> fold_stmt_inplace should > >> arelady do this if possible). > > > > Richard, > > > > Thank you! You are right. I noticed on latest trunk the problem in > PRE was > > already fixed by invoking fold_stmt_inplace. > > > > Unfortunately for this small case, the latest trunk code still can't > do SM > > for variable pos, because refs_may_alias_p(*D.4074_10, pos) is true, > that > > is, pos has alias with l[pos]. > > > > I think alias analysis should be able to know they don't have alias > with > > each other, unless there is an assignment statement like "l=&pos;". > > > > Can alias analysis fix the problem? > > The problem is that 'pos' is marked TREE_ADDRESSABLE, so we think > its address got taken. 'l' points to any global possibly address-taken > variable. It get's the TREE_ADDRESSABLE flag via PRE insertion which > re-gimplifies the tree it creates which marks the variable as > addressable. > > I'll have a look. Terrific! :-) Could you please let me know when you have a patch to fix this, because I want to double check the big case, and there might be other hidden problems? Thanks, -Jiangning > > Richard. > > > > > Thanks, > > -Jiangning > > > >> > >> Richard. > >> > > > > > >
RE: A problem about loop store motion
> The MEM form is more "canonical", so the loop SM machinery to detect > equality should be adjusted accordingly. Alternatively you can teach > PRE insertion to strip off the MEM if possible (though > fold_stmt_inplace should > arelady do this if possible). Richard, Thank you! You are right. I noticed on latest trunk the problem in PRE was already fixed by invoking fold_stmt_inplace. Unfortunately for this small case, the latest trunk code still can't do SM for variable pos, because refs_may_alias_p(*D.4074_10, pos) is true, that is, pos has alias with l[pos]. I think alias analysis should be able to know they don't have alias with each other, unless there is an assignment statement like "l=&pos;". Can alias analysis fix the problem? Thanks, -Jiangning > > Richard. >
RE: A problem about loop store motion
> -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of > Jiangning Liu > Sent: Friday, February 17, 2012 5:53 PM > To: gcc@gcc.gnu.org > Subject: A problem about loop store motion > > Hi, > > For this small test case, > > int *l, *r; > int test_func(void) > { > int i; > int direction; > static int pos; > > pos = 0; > direction = 1; > > for ( i = 0; i <= 400; i++ ) > { > if ( direction == 0 ) > pos = l[pos]; > else > pos = r[pos]; > > if ( pos == -1 ) > { > pos = 0; > direction = !direction; > } > } > return i; > } > > In middle end, I don't see pos is sunk out of loop by loop store motion. > Any > idea? > > The dump after lim is like below, and I expect a SSA symbole xxx_lsm > could > be created with this pass. > > ;; Function test_func (test_func, funcdef_no=0, decl_uid=4057, > cgraph_uid=0) > > > Symbols to be put in SSA form > { .MEM } > Incremental SSA update started at block: 0 > Number of blocks in CFG: 12 > Number of blocks to update: 11 ( 92%) > > > test_func () > { > int pretmp.14; > unsigned int pretmp.13; > int prephitmp.12; > int pretmp.11; > unsigned int pretmp.10; > int pretmp.9; > int D.4088; > static int pos; > int direction; > int i; > _Bool D.4082; > int pos.5; > int * D.4078; > int * r.4; > int pos.3; > int * D.4074; > unsigned int D.4073; > unsigned int pos.2; > int pos.1; > int * l.0; > > : > pos = 0; > l.0_6 = l; > r.4_12 = r; > > : > # i_32 = PHI > # direction_37 = PHI > # prephitmp.12_35 = PHI > if (direction_37 == 0) > goto ; > else > goto ; > > : > pos.1_7 = prephitmp.12_35; > pos.2_8 = (unsigned int) pos.1_7; > D.4073_9 = pos.2_8 * 4; > D.4074_10 = l.0_6 + D.4073_9; > pos.3_11 = *D.4074_10; > pos = pos.3_11; > goto ; > > : > pos.1_13 = prephitmp.12_35; > pos.2_14 = (unsigned int) pos.1_13; > D.4073_15 = pos.2_14 * 4; > D.4078_16 = r.4_12 + D.4073_15; > pos.5_17 = *D.4078_16; > pos = pos.5_17; > > : > # prephitmp.12_31 = PHI > pos.1_18 = prephitmp.12_31; > if (pos.1_18 == -1) > goto ; > else > goto ; > > : > goto ; > > : > pos = 0; > D.4088_36 = direction_37 ^ 1; > direction_20 = D.4088_36 & 1; > > : > # direction_2 = PHI > i_21 = i_32 + 1; > if (i_21 != 401) > goto ; > else > goto ; > > : > pretmp.11_1 = pos; Hi, pos in RHS seems to be transformed to MEM[&pos] by the code in tree-ssa-sccvn.c like below, case VAR_DECL: if (DECL_HARD_REGISTER (ref)) { temp.op0 = ref; break; } /* Fallthru. */ case PARM_DECL: case CONST_DECL: case RESULT_DECL: /* Canonicalize decls to MEM[&decl] which is what we end up with when valueizing MEM[ptr] with ptr = &decl. */ temp.opcode = MEM_REF; temp.op0 = build_int_cst (build_pointer_type (TREE_TYPE (ref)), 0); temp.off = 0; VEC_safe_push (vn_reference_op_s, heap, *result, &temp); temp.opcode = ADDR_EXPR; temp.op0 = build_fold_addr_expr (ref); temp.type = TREE_TYPE (temp.op0); temp.off = -1; break; But the LHS doesn't have this kind of transformation on gimple, So the code below in gather_mem_refs_stmt of tree-ssa-loop-im.c can't find MEM[&pos] is just pos, hash = iterative_hash_expr (*mem, 0); slot = htab_find_slot_with_hash (memory_accesses.refs, *mem, hash, INSERT); Finally pos is set to be dependent on pretmp.11_1 at code below in ref_indep_loop_p_1, and then loop store motion fails to find pos. EXECUTE_IF_SET_IN_BITMAP (refs_to_check, 0, i, bi) { aref = VEC_index (mem_ref_p, memory_accesses.refs_list, i); if (!refs_independent_p (ref, aref)) { ret = false; record_indep_loop (loop, aref, false); break; } } Should we fix this problem by enhancing iterative_hash_expr, or we should use MEM[&pos] to describe pos at all? Any hints? I need to understand why we have to use MEM[&pos] to describe pos? This looks strange to me. BTW, this bug caused significant regression for an important benchmark. Thanks, -Jiangning > goto ; > > : > return 401; > > } > > Thanks, > -Jiangning > > >
A problem about loop store motion
Hi, For this small test case, int *l, *r; int test_func(void) { int i; int direction; static int pos; pos = 0; direction = 1; for ( i = 0; i <= 400; i++ ) { if ( direction == 0 ) pos = l[pos]; else pos = r[pos]; if ( pos == -1 ) { pos = 0; direction = !direction; } } return i; } In middle end, I don't see pos is sunk out of loop by loop store motion. Any idea? The dump after lim is like below, and I expect a SSA symbole xxx_lsm could be created with this pass. ;; Function test_func (test_func, funcdef_no=0, decl_uid=4057, cgraph_uid=0) Symbols to be put in SSA form { .MEM } Incremental SSA update started at block: 0 Number of blocks in CFG: 12 Number of blocks to update: 11 ( 92%) test_func () { int pretmp.14; unsigned int pretmp.13; int prephitmp.12; int pretmp.11; unsigned int pretmp.10; int pretmp.9; int D.4088; static int pos; int direction; int i; _Bool D.4082; int pos.5; int * D.4078; int * r.4; int pos.3; int * D.4074; unsigned int D.4073; unsigned int pos.2; int pos.1; int * l.0; : pos = 0; l.0_6 = l; r.4_12 = r; : # i_32 = PHI # direction_37 = PHI # prephitmp.12_35 = PHI if (direction_37 == 0) goto ; else goto ; : pos.1_7 = prephitmp.12_35; pos.2_8 = (unsigned int) pos.1_7; D.4073_9 = pos.2_8 * 4; D.4074_10 = l.0_6 + D.4073_9; pos.3_11 = *D.4074_10; pos = pos.3_11; goto ; : pos.1_13 = prephitmp.12_35; pos.2_14 = (unsigned int) pos.1_13; D.4073_15 = pos.2_14 * 4; D.4078_16 = r.4_12 + D.4073_15; pos.5_17 = *D.4078_16; pos = pos.5_17; : # prephitmp.12_31 = PHI pos.1_18 = prephitmp.12_31; if (pos.1_18 == -1) goto ; else goto ; : goto ; : pos = 0; D.4088_36 = direction_37 ^ 1; direction_20 = D.4088_36 & 1; : # direction_2 = PHI i_21 = i_32 + 1; if (i_21 != 401) goto ; else goto ; : pretmp.11_1 = pos; goto ; : return 401; } Thanks, -Jiangning
RE: A case exposing code sink issue
> -Original Message- > From: Richard Guenther [mailto:richard.guent...@gmail.com] > Sent: Thursday, January 05, 2012 5:32 PM > To: Jiangning Liu > Cc: Michael Matz; gcc@gcc.gnu.org > Subject: Re: A case exposing code sink issue > > On Thu, Jan 5, 2012 at 9:34 AM, Jiangning Liu > wrote: > >> >> > In final value replacement, expression "&a + D." can be > >> figured > >> >> out, > >> >> > while "&a[i_xxx]" failed to be CHRECed, so I'm wondering if we > >> should > >> >> > lower > >> >> > &a[i_xxx] to "&a + unitsize(a) * i_xxx" first? It seems GCC > >> intends > >> >> to > >> >> > keep > >> >> > &a[i_xxx] until cfgexpand pass. > > > > Richard, > > > > Thanks a lot for your explanation! > > > > Any comments for this question as well? Does GCC intends to keep &a[i] > until > > expand pass? > > It's kept as it suits - it's shorter and easier to analyze (we know the > implicit > multiplication won't overflow) > > > Any special reason? If I change CHREC algorithm, I see ivopt > > would have done this lowering, so we wouldn't see &a[i] in expand > pass. Any > > hurt? > > No, I think changing the CHREC algorithm is ok (I didn't look closely > at > your patch). This is stage1 material though. Actually I didn't send it out at all. :-) And I just send out a RFC here http://gcc.gnu.org/ml/gcc-patches/2012-01/msg00236.html, can you have a look? > > Thanks, > Richard. > > > Thanks, > > -Jiangning > > > > > > > >
RE: A case exposing code sink issue
> >> > In final value replacement, expression "&a + D." can be > figured > >> out, > >> > while "&a[i_xxx]" failed to be CHRECed, so I'm wondering if we > should > >> > lower > >> > &a[i_xxx] to "&a + unitsize(a) * i_xxx" first? It seems GCC > intends > >> to > >> > keep > >> > &a[i_xxx] until cfgexpand pass. Richard, Thanks a lot for your explanation! Any comments for this question as well? Does GCC intends to keep &a[i] until expand pass? Any special reason? If I change CHREC algorithm, I see ivopt would have done this lowering, so we wouldn't see &a[i] in expand pass. Any hurt? Thanks, -Jiangning
RE: A case exposing code sink issue
> -Original Message- > From: Jiangning Liu > Sent: Wednesday, December 28, 2011 5:38 PM > To: Jiangning Liu; 'Richard Guenther' > Cc: Michael Matz; gcc@gcc.gnu.org > Subject: RE: A case exposing code sink issue > > > > > -Original Message- > > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf > Of > > Jiangning Liu > > Sent: Tuesday, December 27, 2011 5:10 PM > > To: 'Richard Guenther' > > Cc: Michael Matz; gcc@gcc.gnu.org > > Subject: RE: A case exposing code sink issue > > > > > > > > The job to do this is final value replacement, not sinking (we do > not > > > sink non-invariant expressions - you'd have to translate them > through > > > the loop-closed SSA exit PHI node, certainly doable, patches > > > welcome ;)). > > > > > > > Richard, > > > > In final value replacement, expression "&a + D." can be figured > out, > > while "&a[i_xxx]" failed to be CHRECed, so I'm wondering if we should > > lower > > &a[i_xxx] to "&a + unitsize(a) * i_xxx" first? It seems GCC intends > to > > keep > > &a[i_xxx] until cfgexpand pass. Or we have to directly modify CHREC > > algorithm to get it calculated? > > > > Appreciate your kindly help in advance! > > > > Richard, > > Now I have a patch working for the case of step "i++", by directly > modifying > scalar evolution algorithm. the following code would be generated after > SCCP, > l > # i_13 = PHI > a_p.0_4 = &a[i_13]; > MEM[(int *)&a][i_13] = 100; > i_6 = i_13 + 1; > if (i_6 <= 999) > goto ; > else > goto ; > > : > a_p_lsm.5_11 = &MEM[(void *)&a + 3996B]; > a_p = a_p_lsm.5_11; > goto ; > > It looks good, but I still have problem when the case has step "i+=k". > > For this case the value of variable i exiting loop isn't invariant, the > algorithm below in scalar evolution doesn't work on it, > > compute_overall_effect_of_inner_loop() > { > ... > tree nb_iter = number_of_latch_executions (inner_loop); > > if (nb_iter == chrec_dont_know) > return chrec_dont_know; > else > { > tree res; > > /* evolution_fn is the evolution function in LOOP. Get >its value in the nb_iter-th iteration. */ > res = chrec_apply (inner_loop->num, evolution_fn, nb_iter); > > if (chrec_contains_symbols_defined_in_loop (res, loop->num)) > res = instantiate_parameters (loop, res); > > /* Continue the computation until ending on a parent of LOOP. > */ > return compute_overall_effect_of_inner_loop (loop, res); > } > } > > In theory, we can still have the transformation like below even if the > step > is "i+=k", > > # i_13 = PHI > i_14 = i_13, > a_p.0_4 = &a[i_13]; > MEM[(int *)&a][i_13] = 100; > i_6 = i_13 + k_2(D); // i+=k > if (i_6 <= 999) > goto ; > else > goto ; > > : > a_p_lsm.5_11 = &a[i_14]; > a_p = a_p_lsm.5_11; > goto ; > > But I realize this is not a loop closed SSA form at all, because i_14 > is > being used out of the loop. Where could we extend the liverange of > variable > i in GCC infrastructure and finally solve this problem? > It seems many people are still in the happy of the upcoming 2012 New Year. :-) Following my story, I find the following code in tree-ssa-copy.c /* Avoid copy propagation from an inner into an outer loop. Otherwise, this may move loop variant variables outside of their loops and prevent coalescing opportunities. If the value was loop invariant, it will be hoisted by LICM and exposed for copy propagation. ??? The value will be always loop invariant. In loop-closed SSA form do not copy-propagate through PHI nodes in blocks with a loop exit edge predecessor. */ if (current_loops && TREE_CODE (arg_value) == SSA_NAME && (loop_depth_of_name (arg_value) > loop_depth_of_name (lhs) || (loops_state_satisfies_p (LOOP_CLOSED_SSA) && loop_exit_edge_p (e->src->loop_father, e { phi_val.value = lhs; break; } Here http://gcc.gnu.org/ml/gcc-patches/2006-12/msg00066.html, Dan said "The original check was not because of coalescing, but because we would copy prop in-loop variables outside the loop, causing *more* invariantness in nested loops." Can anybody give me a concrete example to explain this statement? Anyway, for my simple case, I don't see bad thing would happen when propagate &a[i] out of the loop. Also, after this propagation, "a_p.0_4 = &a[i_13];" within the loop would be dead code and removed in the passes afterwards. In my opinion, that way the computation would be reduced in the loop. Did I make any mistake? > > Thanks, > > -Jiangning > > > > > > > >
RE: A case exposing code sink issue
> -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of > Jiangning Liu > Sent: Tuesday, December 27, 2011 5:10 PM > To: 'Richard Guenther' > Cc: Michael Matz; gcc@gcc.gnu.org > Subject: RE: A case exposing code sink issue > > > > > The job to do this is final value replacement, not sinking (we do not > > sink non-invariant expressions - you'd have to translate them through > > the loop-closed SSA exit PHI node, certainly doable, patches > > welcome ;)). > > > > Richard, > > In final value replacement, expression "&a + D." can be figured out, > while "&a[i_xxx]" failed to be CHRECed, so I'm wondering if we should > lower > &a[i_xxx] to "&a + unitsize(a) * i_xxx" first? It seems GCC intends to > keep > &a[i_xxx] until cfgexpand pass. Or we have to directly modify CHREC > algorithm to get it calculated? > > Appreciate your kindly help in advance! > Richard, Now I have a patch working for the case of step "i++", by directly modifying scalar evolution algorithm. the following code would be generated after SCCP, l # i_13 = PHI a_p.0_4 = &a[i_13]; MEM[(int *)&a][i_13] = 100; i_6 = i_13 + 1; if (i_6 <= 999) goto ; else goto ; : a_p_lsm.5_11 = &MEM[(void *)&a + 3996B]; a_p = a_p_lsm.5_11; goto ; It looks good, but I still have problem when the case has step "i+=k". For this case the value of variable i exiting loop isn't invariant, the algorithm below in scalar evolution doesn't work on it, compute_overall_effect_of_inner_loop() { ... tree nb_iter = number_of_latch_executions (inner_loop); if (nb_iter == chrec_dont_know) return chrec_dont_know; else { tree res; /* evolution_fn is the evolution function in LOOP. Get its value in the nb_iter-th iteration. */ res = chrec_apply (inner_loop->num, evolution_fn, nb_iter); if (chrec_contains_symbols_defined_in_loop (res, loop->num)) res = instantiate_parameters (loop, res); /* Continue the computation until ending on a parent of LOOP. */ return compute_overall_effect_of_inner_loop (loop, res); } } In theory, we can still have the transformation like below even if the step is "i+=k", # i_13 = PHI i_14 = i_13, a_p.0_4 = &a[i_13]; MEM[(int *)&a][i_13] = 100; i_6 = i_13 + k_2(D); // i+=k if (i_6 <= 999) goto ; else goto ; : a_p_lsm.5_11 = &a[i_14]; a_p = a_p_lsm.5_11; goto ; But I realize this is not a loop closed SSA form at all, because i_14 is being used out of the loop. Where could we extend the liverange of variable i in GCC infrastructure and finally solve this problem? > Thanks, > -Jiangning > > >
RE: A case exposing code sink issue
> > The job to do this is final value replacement, not sinking (we do not > sink non-invariant expressions - you'd have to translate them through > the loop-closed SSA exit PHI node, certainly doable, patches > welcome ;)). > Richard, In final value replacement, expression "&a + D." can be figured out, while "&a[i_xxx]" failed to be CHRECed, so I'm wondering if we should lower &a[i_xxx] to "&a + unitsize(a) * i_xxx" first? It seems GCC intends to keep &a[i_xxx] until cfgexpand pass. Or we have to directly modify CHREC algorithm to get it calculated? Appreciate your kindly help in advance! Thanks, -Jiangning
RE: A case exposing code sink issue
> Yes, the number of iterations of the i loop simply is too difficult for > our loop iteration calculator to comprehend: > > for (i=k; i<500; i+=k) > > iterates for roundup((500-k)/k) time. In particular if the step is > non-constant our nr-of-iteration calculator gives up. > I'm trying to give an even smaller case, int a[512] ; int *a_p ; int f(int k) { int i ; for(i=0; i: # i_13 = PHI # ivtmp.10_9 = PHI a_p_lsm.6_4 = &a[i_13]; ivtmp.10_1 = ivtmp.10_9 + 4; D.4085_16 = (void *) ivtmp.10_1; MEM[base: D.4085_16, offset: 0B] = 7; i_6 = i_13 + 1; if (i_6 != k_3(D)) goto ; else goto ; : # a_p_lsm.6_11 = PHI a_p = a_p_lsm.6_11; goto ; Why can't we still sunk &a[i_13] out of loop? For example, I expect to generate the code like below, : # i_13 = PHI # ivtmp.10_9 = PHI i_14 = i_13; ivtmp.10_1 = ivtmp.10_9 + 4; D.4085_16 = (void *) ivtmp.10_1; MEM[base: D.4085_16, offset: 0B] = 7; i_6 = i_13 + 1; if (i_6 != k_3(D)) goto ; else goto ; : # a_p_lsm.6_11 = PHI a_p_lsm.6_4 = &a[i_14]; a_p = a_p_lsm.6_11; goto ; This way the computation of &a[i] would be saved within the loop. Any idea? Thanks, -Jiangning
RE: A case exposing code sink issue
> -Original Message- > From: Michael Matz [mailto:m...@suse.de] > Sent: Monday, November 28, 2011 9:07 PM > To: Jiangning Liu > Cc: gcc@gcc.gnu.org > Subject: RE: A case exposing code sink issue > > Hi, > > On Mon, 28 Nov 2011, Jiangning Liu wrote: > > > > > One more question... > > > > > > > > Can " i = i.6_18;" be sinked out of loop, because it doesn't have > > > memory > > > > dependence with others? > > > > > > With current trunk the stores to i, a_p, b_p and k are sunken after > the > > > loop. (There are no aliasing problems because the decls can't > > > conflict). > > > > > > What isn't sunken is the calculation of the &a[D.2248_7] expression. > > > First, the number of iterations of the inner loop can't be > determined > > > by > > > current code (replacing i+=k with e.g. i++ could be handled for > > > instance). > > > > Hi Michael, > > > > Do you know what the essential problem is in the case of loop > iteration > > uncertainty? > > Yes, the number of iterations of the i loop simply is too difficult for > our loop iteration calculator to comprehend: > > for (i=k; i<500; i+=k) > > iterates for roundup((500-k)/k) time. In particular if the step is > non-constant our nr-of-iteration calculator gives up. So do you think this can be improved somewhere? For this case, looking at the result in middle end, "a_p.2_8 = &a[D.2248_7];" should be able to sunken out of loop. That way the computation of &a[D.2248_7] would be saved in loop, although the consequence is the liverange of D.2248_7 is longer and it needs to live out of loop. But anyway the register pressure would be decreased within the loop, and we would less possibly have spill/fill code. This is what I want. I think we can simply use loop induction variable analysis to solve this problem. Do you think so? Thanks, -Jiangning > > > I thought it was still an aliasing problem. > > No. All accesses are resolved to final objects (i.e. no pointers), and > hence can be trivially disambiguated. > > > Ciao, > Michael.
RE: A case exposing code sink issue
> -Original Message- > From: Michael Matz [mailto:m...@suse.de] > Sent: Friday, November 25, 2011 11:23 PM > To: Jiangning Liu > Cc: gcc@gcc.gnu.org > Subject: RE: A case exposing code sink issue > > Hi, > > On Thu, 24 Nov 2011, Jiangning Liu wrote: > > > One more question... > > > > Can " i = i.6_18;" be sinked out of loop, because it doesn't have > memory > > dependence with others? > > With current trunk the stores to i, a_p, b_p and k are sunken after the > loop. (There are no aliasing problems because the decls can't > conflict). > > What isn't sunken is the calculation of the &a[D.2248_7] expression. > First, the number of iterations of the inner loop can't be determined > by > current code (replacing i+=k with e.g. i++ could be handled for > instance). Hi Michael, Do you know what the essential problem is in the case of loop iteration uncertainty? I thought it was still an aliasing problem. Thanks, -Jiangning > Then this code could be handled by final value replacement, but isn't > because interpret_rhs_expr doesn't deal with ADDR_EXPR of ARRAY_REFs. > > > Ciao, > Michael.
RE: A case exposing code sink issue
One more question... Can " i = i.6_18;" be sinked out of loop, because it doesn't have memory dependence with others? Thanks, -Jiangning > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of > Jiangning Liu > Sent: Thursday, November 24, 2011 2:57 PM > To: gcc@gcc.gnu.org > Subject: RE: A case exposing code sink issue > > Sorry, I realize we can't do that optimization because a_p may have > dependence upon other memory accesses like MEM[(int *)&a][D.2248_7]. > > For example, if it happens a_p equals &a_p, that optimization would be > wrong. > > But can alias analysis solve the problem if we can guarantee (i+(1< is > less than the upbound of array a's definition? > > Or is there any GCC command line switch assuming no array bound > overflow? > That way we can do more aggressive optimizations, right? > > Thanks, > -Jiangning > > > -Original Message- > > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf > Of > > Jiangning Liu > > Sent: Thursday, November 24, 2011 12:05 PM > > To: gcc@gcc.gnu.org > > Subject: A case exposing code sink issue > > > > Hi, > > > > For this small test case, > > > > int a[512] ; > > int b[512] ; > > int *a_p ; > > int *b_p ; > > int i ; > > int k ; > > > > int f(void) > > { > > for( k = 1 ; k <= 9 ; k++) > > { > > for( i = k ; i < 512 ; i += k) > > { > > a_p = &a[i + (1< > b_p = &b[i + (1< > *a_p = 7 ; > > *b_p = 7 ; > > } > > } > > } > > > > Before sink pass we have, > > > > f () > > { > > int pretmp.11; > > int k.7; > > int i.6; > > int * b_p.3; > > int * a_p.2; > > int D.2248; > > int i.1; > > int D.2246; > > int k.0; > > > > : > > k = 1; > > > > : > > # k.0_9 = PHI > > i = k.0_9; > > if (k.0_9 <= 511) > > goto ; > > else > > goto ; > > > > : > > Invalid sum of incoming frequencies 900, should be 81 > > goto ; > > > > : > > pretmp.11_19 = 1 << k.0_9; > > > > : > > # i.1_34 = PHI > > D.2246_5 = pretmp.11_19; > > D.2248_7 = i.1_34 + D.2246_5; > > a_p.2_8 = &a[D.2248_7]; > > a_p = a_p.2_8; > > b_p.3_13 = &b[D.2248_7]; > > b_p = b_p.3_13; > > MEM[(int *)&a][D.2248_7] = 7; > > MEM[(int *)&b][D.2248_7] = 7; > > i.6_18 = k.0_9 + i.1_34; > > i = i.6_18; > > if (i.6_18 <= 511) > > goto ; > > else > > goto ; > > > > : > > goto ; > > > > : > > Invalid sum of incoming frequencies 81, should be 900 > > k.7_20 = k.0_9 + 1; > > k = k.7_20; > > if (k.7_20 <= 9) > > goto ; > > else > > goto ; > > > > : > > goto ; > > > > : > > return; > > > > } > > > > Can the following statements be sinked out of loop? I don't see this > > optimization happen in trunk. The consequence is register pressure > > increased > > and a spill/fill occurs in RA. > > > > a_p.2_8 = &a[D.2248_7]; > > a_p = a_p.2_8; > > b_p.3_13 = &b[D.2248_7]; > > b_p = b_p.3_13; > > > > I know the sink would happen in sink pass if a_p and b_p are local > > variables. > > > > If this is the root cause, which optimization pass in GCC take the > role > > to > > sink them out of loop? How should we get it fixed? > > > > Thanks, > > -Jiangning > > > > > > > > > >
RE: A case exposing code sink issue
Sorry, I realize we can't do that optimization because a_p may have dependence upon other memory accesses like MEM[(int *)&a][D.2248_7]. For example, if it happens a_p equals &a_p, that optimization would be wrong. But can alias analysis solve the problem if we can guarantee (i+(1< -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of > Jiangning Liu > Sent: Thursday, November 24, 2011 12:05 PM > To: gcc@gcc.gnu.org > Subject: A case exposing code sink issue > > Hi, > > For this small test case, > > int a[512] ; > int b[512] ; > int *a_p ; > int *b_p ; > int i ; > int k ; > > int f(void) > { > for( k = 1 ; k <= 9 ; k++) > { > for( i = k ; i < 512 ; i += k) > { > a_p = &a[i + (1< b_p = &b[i + (1< *a_p = 7 ; > *b_p = 7 ; > } > } > } > > Before sink pass we have, > > f () > { > int pretmp.11; > int k.7; > int i.6; > int * b_p.3; > int * a_p.2; > int D.2248; > int i.1; > int D.2246; > int k.0; > > : > k = 1; > > : > # k.0_9 = PHI > i = k.0_9; > if (k.0_9 <= 511) > goto ; > else > goto ; > > : > Invalid sum of incoming frequencies 900, should be 81 > goto ; > > : > pretmp.11_19 = 1 << k.0_9; > > : > # i.1_34 = PHI > D.2246_5 = pretmp.11_19; > D.2248_7 = i.1_34 + D.2246_5; > a_p.2_8 = &a[D.2248_7]; > a_p = a_p.2_8; > b_p.3_13 = &b[D.2248_7]; > b_p = b_p.3_13; > MEM[(int *)&a][D.2248_7] = 7; > MEM[(int *)&b][D.2248_7] = 7; > i.6_18 = k.0_9 + i.1_34; > i = i.6_18; > if (i.6_18 <= 511) > goto ; > else > goto ; > > : > goto ; > > : > Invalid sum of incoming frequencies 81, should be 900 > k.7_20 = k.0_9 + 1; > k = k.7_20; > if (k.7_20 <= 9) > goto ; > else > goto ; > > : > goto ; > > : > return; > > } > > Can the following statements be sinked out of loop? I don't see this > optimization happen in trunk. The consequence is register pressure > increased > and a spill/fill occurs in RA. > > a_p.2_8 = &a[D.2248_7]; > a_p = a_p.2_8; > b_p.3_13 = &b[D.2248_7]; > b_p = b_p.3_13; > > I know the sink would happen in sink pass if a_p and b_p are local > variables. > > If this is the root cause, which optimization pass in GCC take the role > to > sink them out of loop? How should we get it fixed? > > Thanks, > -Jiangning > > >
RE: A case exposing code sink issue
> -Original Message- > From: Andrew Pinski [mailto:pins...@gmail.com] > Sent: Thursday, November 24, 2011 12:15 PM > To: Jiangning Liu > Cc: gcc@gcc.gnu.org > Subject: Re: A case exposing code sink issue > > On Wed, Nov 23, 2011 at 8:05 PM, Jiangning Liu > wrote: > > If this is the root cause, which optimization pass in GCC take the > role to > > sink them out of loop? How should we get it fixed? > > lim1 handles the case just fine for me. lim1 is the first loop pass. > > After lim1 I get: > > : > # i.1_34 = PHI > D.2934_5 = pretmp.11_33; > D.2936_7 = i.1_34 + D.2934_5; > a_p.2_8 = &a[D.2936_7]; > a_p_lsm.13_37 = a_p.2_8; > b_p.3_13 = &b[D.2936_7]; > b_p_lsm.14_38 = b_p.3_13; > MEM[(int *)&a][D.2936_7] = 7; > MEM[(int *)&b][D.2936_7] = 7; > i.6_18 = k.0_9 + i.1_34; > i_lsm.12_39 = i.6_18; > if (i.6_18 <= 511) > goto ; > else > goto ; > > : > goto ; > &a[D.2936_7] and &b[D.2936_7] are not loop invariants, so it seems lim1 shouldn't be able to sink them, right? Do I misunderstand this optimization? Thanks, -Jiangning > > > > Thanks, > Andrew Pinski
A case exposing code sink issue
Hi, For this small test case, int a[512] ; int b[512] ; int *a_p ; int *b_p ; int i ; int k ; int f(void) { for( k = 1 ; k <= 9 ; k++) { for( i = k ; i < 512 ; i += k) { a_p = &a[i + (1<: k = 1; : # k.0_9 = PHI i = k.0_9; if (k.0_9 <= 511) goto ; else goto ; : Invalid sum of incoming frequencies 900, should be 81 goto ; : pretmp.11_19 = 1 << k.0_9; : # i.1_34 = PHI D.2246_5 = pretmp.11_19; D.2248_7 = i.1_34 + D.2246_5; a_p.2_8 = &a[D.2248_7]; a_p = a_p.2_8; b_p.3_13 = &b[D.2248_7]; b_p = b_p.3_13; MEM[(int *)&a][D.2248_7] = 7; MEM[(int *)&b][D.2248_7] = 7; i.6_18 = k.0_9 + i.1_34; i = i.6_18; if (i.6_18 <= 511) goto ; else goto ; : goto ; : Invalid sum of incoming frequencies 81, should be 900 k.7_20 = k.0_9 + 1; k = k.7_20; if (k.7_20 <= 9) goto ; else goto ; : goto ; : return; } Can the following statements be sinked out of loop? I don't see this optimization happen in trunk. The consequence is register pressure increased and a spill/fill occurs in RA. a_p.2_8 = &a[D.2248_7]; a_p = a_p.2_8; b_p.3_13 = &b[D.2248_7]; b_p = b_p.3_13; I know the sink would happen in sink pass if a_p and b_p are local variables. If this is the root cause, which optimization pass in GCC take the role to sink them out of loop? How should we get it fixed? Thanks, -Jiangning
RE: Is VRP is too conservative to identify boolean value 0 and 1?
> -Original Message- > From: Richard Guenther [mailto:richard.guent...@gmail.com] > Sent: Friday, September 02, 2011 5:07 PM > To: Jiangning Liu > Cc: gcc@gcc.gnu.org > Subject: Re: Is VRP is too conservative to identify boolean value 0 and > 1? > > On Fri, Sep 2, 2011 at 7:58 AM, Jiangning Liu > wrote: > > Hi, > > > > For the following small case, > > > > int f(int i, int j) > > { > > if (i==1 && j==2) > > return i; > > else > > return j; > > } > > > > with -O2 option, GCC has vrp2 dump like below, > > > > == > > > > Value ranges after VRP: > > > > i_1: VARYING > > i_2(D): VARYING > > D.1249_3: [0, +INF] > > j_4(D): VARYING > > D.1250_5: [0, +INF] > > D.1251_6: [0, +INF] > > j_10: [2, 2] EQUIVALENCES: { j_4(D) } (1 elements) > > > > > > Removing basic block 3 > > f (int i, int j) > > { > > _Bool D.1251; > > _Bool D.1250; > > _Bool D.1249; > > > > : > > D.1249_3 = i_2(D) == 1; > > D.1250_5 = j_4(D) == 2; > > D.1251_6 = D.1250_5 & D.1249_3; > > if (D.1251_6 != 0) > > goto ; > > else > > goto ; > > > > : > > > > : > > # i_1 = PHI <1(3), j_4(D)(2)> > > return i_1; > > > > } > > > > > > > > Variable D.1249_3, D.1250_5 and D.1251_6 should be boolean values, so > the > > their value ranges should be > > > > D.1249_3: [0, 1] > > D.1250_5: [0, 1] > > D.1251_6: [0, 1] > > > > So why current VRP can't find out this value range? > > It does - it just prints it as [0, +INF], they are bools with > TYPE_MAX_VALUE > == 1 after all. Richard, May I use REG_EXPR(rtx of D.1249_3) in xxx.md file to detect whether D.1249_3 is a bool or not? Some comments in GCC says REG_EXPR may be lost in back-end. True? If we do have REG_EXPR info for some cases in back-end, is it guaranteed to be correct? May I implementing back-end peephole optimization depending on REG_EXPR? Thanks, -Jiangning > > Richard. > > > > > I'm asking this question because the optimizations in back-end need > this > > info to do advanced optimization. > > > > Thanks, > > -Jiangning > > > > > >
A question about redudant load elimination
Hi, For this test case, int x; extern void f(void); void g(int *a) { a[x] = 1; if (x == 100) f(); a[x] = 2; } For trunk, the x86 assembly code is like below, movlx, %eax movl16(%esp), %ebx movl$1, (%ebx,%eax,4) movlx, %eax // Is this a redundant one? cmpl$100, %eax je .L4 movl$2, (%ebx,%eax,4) addl$8, %esp .cfi_remember_state .cfi_def_cfa_offset 8 popl%ebx .cfi_restore 3 .cfi_def_cfa_offset 4 ret .p2align 4,,7 .p2align 3 .L4: .cfi_restore_state callf movlx, %eax movl$2, (%ebx,%eax,4) addl$8, %esp .cfi_def_cfa_offset 8 popl%ebx .cfi_restore 3 .cfi_def_cfa_offset 4 Ret Is the 2nd "movl x, %eax" is a redundant one for single thread programming model? If yes, can this be optimized away? Thanks, -Jiangning
RE: A case that PRE optimization hurts performance
> -Original Message- > From: Jeff Law [mailto:l...@redhat.com] > Sent: Tuesday, September 27, 2011 12:43 AM > To: Richard Guenther > Cc: Jiangning Liu; gcc@gcc.gnu.org > Subject: Re: A case that PRE optimization hurts performance > > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 09/26/11 05:00, Richard Guenther wrote: > > On Mon, Sep 26, 2011 at 9:39 AM, Jiangning Liu > > wrote: > >>>> * Without PRE, > >>>> > >>>> Path1: movl$2, %eax cmpl$1, %eax je .L3 > >>>> > >>>> Path2: movb$3, %al cmpl$1, %eax je .L3 > >>>> > >>>> Path3: cmpl$1, %eax jne .L14 > >>>> > >>>> * With PRE, > >>>> > >>>> Path1: movl$1, %ebx movl$2, %eax testb %bl, %bl je > >>>> .L3 > >>>> > >>>> Path2: movl$1, %ebx movb$3, %al testb %bl, %bl je > >>>> .L3 > >>>> > >>>> Path3: cmpl$1, %eax setne %bl testb %bl, %bl jne > >>>> .L14 > >>>> > >>>> Do you have any more thoughts? > >>> > >>> It seems to me that with PRE all the testb %bl, %bl should be > >>> evaluated at compile-time considering the preceeding movl $1, > >>> %ebx. Am I missing something? > >>> > >> > >> Yes. Can this be done by PRE or any other optimizations in middle > >> end? > > > > Hm, the paths as you quote them are obfuscated by missed > > jump-threading. On the tree level we have > > > > # s_2 = PHI <2(5), 3(4), 2(6), s_25(7)> # prephitmp.6_1 = PHI > > <1(5), 1(4), 1(6), prephitmp.6_3(7)> : t_14 = t_24 + 1; > > D.2729_6 = MEM[base: t_14, offset: 0B]; D.2732_7 = D.2729_6 != 0; > > D.2734_9 = prephitmp.6_1 & D.2732_7; if (D.2734_9 != 0) > > > > where we could thread the cases with prephitmp.6_1 == 1, > > ultimately removing the & and forwarding the D.2729_6 != 0 test. > > Which would of course cause some code duplication. > > > > Jeff, you recently looked at tree jump-threading, can you see if > > we can improve things on this particular testcase? > There's nothing threading can do here because it doesn't know anything > about the value MEM[t14]. > Jeff, Could you please explain more about this? What information does jump threading want to know on MEM[t14]? Do you mean it's hard to duplicate that basic block due to some reasons? Thanks, -Jiangning > > Jeff > -BEGIN PGP SIGNATURE- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iQEcBAEBAgAGBQJOgKuLAAoJEBRtltQi2kC75aIH/iikuOQXrMrQJFbQw0COXznB > OGq8iXdGwTJGH13vxdItTE0upJp7RgUVLzuhdqj1elTLHv/ujYygMsQRNGKcc8tb > GMLECmWDhZqQTFXcTJCgJNZiv7MH1PNELXSdIkkSnxY+pwyn9AX5D3+HcTSjGU6B > 51AdUNVph/VSaVboAgcrFpu9S0pX9HVTqFy4JI83Lh613zDVSmPo14DDy7vjBvE9 > 2Srlvlw0srYup97bGmRqN8wT4ZLLlyYSB2rjEFc6jmgXVncxiteQYIUZpy0lcC0M > q3j80aXjZ57/iWyAbqDr1jI5tbVKDBkRa9LL1jvn9534adiG4GrnSMPhoog0ibA= > =azr5 > -END PGP SIGNATURE-
RE: A case that PRE optimization hurts performance
> > * Without PRE, > > > > Path1: > > movl $2, %eax > > cmpl $1, %eax > > je .L3 > > > > Path2: > > movb $3, %al > > cmpl $1, %eax > > je .L3 > > > > Path3: > > cmpl $1, %eax > > jne .L14 > > > > * With PRE, > > > > Path1: > > movl $1, %ebx > > movl $2, %eax > > testb %bl, %bl > > je .L3 > > > > Path2: > > movl $1, %ebx > > movb $3, %al > > testb %bl, %bl > > je .L3 > > > > Path3: > > cmpl $1, %eax > > setne %bl > > testb %bl, %bl > > jne .L14 > > > > Do you have any more thoughts? > > It seems to me that with PRE all the testb %bl, %bl > should be evaluated at compile-time considering the > preceeding movl $1, %ebx. Am I missing something? > Yes. Can this be done by PRE or any other optimizations in middle end? Thanks, -Jiangning > Richard. >
RE: A question about detecting array bounds for case Warray-bounds-3.c
PING... > -Original Message- > From: Jiangning Liu [mailto:jiangning@arm.com] > Sent: Thursday, September 22, 2011 10:19 AM > To: gcc@gcc.gnu.org > Cc: 'ja...@gcc.gnu.org'; 'muel...@gcc.gnu.org'; 'rgue...@gcc.gnu.org'; > Matthew Gretton-Dann > Subject: A question about detecting array bounds for case Warray- > bounds-3.c > > Hi, > > For case gcc/testsuite/gcc.dg/Warray-bounds-3.c, obviously it is an > invalid C program, because the last iterations of all the loops cause > the access of arrays is beyond the max size of corresponding array > declarations. The condition of checking upper bound should be "<" > rather than "<=". > > Right now, GCC compiler doesn't report any warning messages for this > case, should it be a bug in both test case and compiler? > > But looking at http://gcc.gnu.org/PR31227 , it seems this test case is > designed to be like this on purpose. Anybody can explain about this? > > The case is like below, > > /* { dg-do compile } */ > /* { dg-options "-O2 -Warray-bounds" } */ > /* based on PR 31227 */ > > struct S > { > const char *abday[7]; > const char *day[7]; > const char *abmon[12]; > const char *mon[12]; > const char *am_pm[2]; > }; > > ... > > for (cnt = 0; cnt <= 7; ++cnt) > { > iov[2 + cnt].iov_base = (void *) (time->abday[cnt] ?: ""); > iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1; > } > > for (; cnt <= 14; ++cnt) > { > iov[2 + cnt].iov_base = (void *) (time->day[cnt - 7] ?: ""); > iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1; > } > > for (; cnt <= 26; ++cnt) > { > iov[2 + cnt].iov_base = (void *) (time->abmon[cnt - 14] ?: ""); > iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1; > } > > for (; cnt <= 38; ++cnt) > { > iov[2 + cnt].iov_base = (void *) (time->mon[cnt - 26] ?: ""); > iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1; > } > > for (; cnt <= 40; ++cnt) > { > iov[2 + cnt].iov_base = (void *) (time->am_pm[cnt - 38] ?: ""); > iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1; > } > > Thanks, > -Jiangning
A question about detecting array bounds for case Warray-bounds-3.c
Hi, For case gcc/testsuite/gcc.dg/Warray-bounds-3.c, obviously it is an invalid C program, because the last iterations of all the loops cause the access of arrays is beyond the max size of corresponding array declarations. The condition of checking upper bound should be "<" rather than "<=". Right now, GCC compiler doesn't report any warning messages for this case, should it be a bug in both test case and compiler? But looking at http://gcc.gnu.org/PR31227 , it seems this test case is designed to be like this on purpose. Anybody can explain about this? The case is like below, /* { dg-do compile } */ /* { dg-options "-O2 -Warray-bounds" } */ /* based on PR 31227 */ struct S { const char *abday[7]; const char *day[7]; const char *abmon[12]; const char *mon[12]; const char *am_pm[2]; }; ... for (cnt = 0; cnt <= 7; ++cnt) { iov[2 + cnt].iov_base = (void *) (time->abday[cnt] ?: ""); iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1; } for (; cnt <= 14; ++cnt) { iov[2 + cnt].iov_base = (void *) (time->day[cnt - 7] ?: ""); iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1; } for (; cnt <= 26; ++cnt) { iov[2 + cnt].iov_base = (void *) (time->abmon[cnt - 14] ?: ""); iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1; } for (; cnt <= 38; ++cnt) { iov[2 + cnt].iov_base = (void *) (time->mon[cnt - 26] ?: ""); iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1; } for (; cnt <= 40; ++cnt) { iov[2 + cnt].iov_base = (void *) (time->am_pm[cnt - 38] ?: ""); iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1; } Thanks, -Jiangning
RE: A case that PRE optimization hurts performance
Hi Richard, I slightly changed the case to be like below, int f(char *t) { int s=0; while (*t && s != 1) { switch (s) { case 0: /* path 1 */ s = 2; break; case 2: /* path 2 */ s = 3; /* changed */ break; default: /* path 3 */ if (*t == '-') s = 2; break; } t++; } return s; } "-O2" is still worse than "-O2 -fno-tree-pre". "-O2 -fno-tree-pre" result is f: pushl %ebp xorl%eax, %eax movl%esp, %ebp movl8(%ebp), %edx movzbl (%edx), %ecx jmp .L14 .p2align 4,,7 .p2align 3 .L5: movl$2, %eax .L7: addl$1, %edx cmpl$1, %eax movzbl (%edx), %ecx je .L3 .L14: testb %cl, %cl je .L3 testl %eax, %eax je .L5 cmpl$2, %eax .p2align 4,,5 je .L17 cmpb$45, %cl .p2align 4,,5 je .L5 addl$1, %edx cmpl$1, %eax movzbl (%edx), %ecx jne .L14 .p2align 4,,7 .p2align 3 .L3: popl%ebp .p2align 4,,2 ret .p2align 4,,7 .p2align 3 .L17: movb$3, %al .p2align 4,,3 jmp .L7 While "-O2" result is f: pushl %ebp xorl%eax, %eax movl%esp, %ebp movl8(%ebp), %edx pushl %ebx movzbl (%edx), %ecx jmp .L14 .p2align 4,,7 .p2align 3 .L5: movl$1, %ebx movl$2, %eax .L7: addl$1, %edx testb %bl, %bl movzbl (%edx), %ecx je .L3 .L14: testb %cl, %cl je .L3 testl %eax, %eax je .L5 cmpl$2, %eax .p2align 4,,5 je .L16 cmpb$45, %cl .p2align 4,,5 je .L5 cmpl$1, %eax setne %bl addl$1, %edx testb %bl, %bl movzbl (%edx), %ecx jne .L14 .p2align 4,,7 .p2align 3 .L3: popl%ebx popl%ebp ret .p2align 4,,7 .p2align 3 .L16: movl$1, %ebx movb$3, %al jmp .L7 You may notice that register ebx is introduced, and some more instructions around ebx are generated as well. i.e. setne %bl testb %bl, %bl I agree with you that in theory PRE does the right thing to minimize the computation cost on gimple level. However, the problem is the cost of converting comparison result to a bool value is not considered, so it actually makes binary code worse. For this case, as I summarized below, to complete the same functionality "With PRE" is worse than "Without PRE" for all three paths, * Without PRE, Path1: movl$2, %eax cmpl$1, %eax je .L3 Path2: movb$3, %al cmpl$1, %eax je .L3 Path3: cmpl$1, %eax jne .L14 * With PRE, Path1: movl$1, %ebx movl$2, %eax testb %bl, %bl je .L3 Path2: movl$1, %ebx movb$3, %al testb %bl, %bl je .L3 Path3: cmpl$1, %eax setne %bl testb %bl, %bl jne .L14 Do you have any more thoughts? Thanks, -Jiangning > -Original Message- > From: Richard Guenther [mailto:richard.guent...@gmail.com] > Sent: Tuesday, August 02, 2011 5:23 PM > To: Jiangning Liu > Cc: gcc@gcc.gnu.org > Subject: Re: A case that PRE optimization hurts performance > > On Tue, Aug 2, 2011 at 4:37 AM, Jiangning Liu > wrote: > > Hi, > > > > For the following simple test case, PRE optimization hoists > computation > > (s!=1) into the default branch of the switch statement, and finally > causes > > very poor code generation. This problem occurs in both X86 and ARM, > and I > > believe it is also a problem for other targets. > > > > int f(char *t) { > > int s=0; > > > > while (*t && s != 1) { > > switch (s) { > > case 0: > > s = 2; > > break; > > case 2: > > s = 1; > > break; > > default: > > if (*t == '-') > > s = 1; > > break; > > } > > t++; > > } > > > > return s; > > } > > > > Taking X86 as an example, with option "-O2" you may find 52 > instructions > > generated like below, > > > > : >
RE: Is VRP is too conservative to identify boolean value 0 and 1?
Andrew, I realize I needn't back-end solution for my case at all, and in middle end I can directly use the _Bool type info! Appreciate your reply! Thanks, -Jiangning > -Original Message- > From: Andrew Pinski [mailto:pins...@gmail.com] > Sent: Friday, September 02, 2011 2:27 PM > To: Jiangning Liu > Cc: gcc@gcc.gnu.org > Subject: Re: Is VRP is too conservative to identify boolean value 0 and > 1? > > On Thu, Sep 1, 2011 at 10:58 PM, Jiangning Liu > wrote: > > D.1249_3: [0, 1] > > D.1250_5: [0, 1] > > D.1251_6: [0, 1] > > Those are equivalent to [0, MAX] as _Bool only has two different > values, 0 and 1 (MAX). Can you explain more about the optimization > which you are working on that needs the ranges as (int)[0,1] rather > than (_Bool)[0,MAX] ? > > Thanks, > Andrew Pinski
Is VRP is too conservative to identify boolean value 0 and 1?
Hi, For the following small case, int f(int i, int j) { if (i==1 && j==2) return i; else return j; } with -O2 option, GCC has vrp2 dump like below, == Value ranges after VRP: i_1: VARYING i_2(D): VARYING D.1249_3: [0, +INF] j_4(D): VARYING D.1250_5: [0, +INF] D.1251_6: [0, +INF] j_10: [2, 2] EQUIVALENCES: { j_4(D) } (1 elements) Removing basic block 3 f (int i, int j) { _Bool D.1251; _Bool D.1250; _Bool D.1249; : D.1249_3 = i_2(D) == 1; D.1250_5 = j_4(D) == 2; D.1251_6 = D.1250_5 & D.1249_3; if (D.1251_6 != 0) goto ; else goto ; : : # i_1 = PHI <1(3), j_4(D)(2)> return i_1; } Variable D.1249_3, D.1250_5 and D.1251_6 should be boolean values, so the their value ranges should be D.1249_3: [0, 1] D.1250_5: [0, 1] D.1251_6: [0, 1] So why current VRP can't find out this value range? I'm asking this question because the optimizations in back-end need this info to do advanced optimization. Thanks, -Jiangning
RE: [RFC] Add middle end hook for stack red zone size
Hi Jakub, Appreciate for your valuable comments! I think SPARC V9 ABI doesn't have red zone defined, right? So stack_red_zone_size should be defined as zero by default, the scheduler would block moving memory accesses across stack adjustment no matter what the offset is. I don't see any risk here. Also, in my patch function *abs* is being used to avoid the opposite stack direction issue as you mentioned. Some people like you insist on the ABI diversity, and actually I agree with you on this. But part of the ABI definition is general for all targets. The point here is memory access beyond stack red zone should be avoided, which is the general part of ABI that compiler should guarantee. For this general part, middle end should take the responsibility. Thanks, -Jiangning > -Original Message- > From: Jakub Jelinek [mailto:ja...@redhat.com] > Sent: Monday, August 01, 2011 6:31 PM > To: Jiangning Liu > Cc: 'Joern Rennecke'; gcc@gcc.gnu.org; gcc-patc...@gcc.gnu.org; > vmaka...@redhat.com; dje@gmail.com; Richard Henderson; Ramana > Radhakrishnan; 'Ramana Radhakrishnan' > Subject: Re: [RFC] Add middle end hook for stack red zone size > > On Mon, Aug 01, 2011 at 06:14:27PM +0800, Jiangning Liu wrote: > > ARM. You are right, they were all fixed in back-ends in the past, but > we > > should > > fix the bug in a general way to make GCC infrastructure stronger, > rather > > than fixing the problem target-by-target and case-by-case! If you > further > > look into the back-end fixes in x86 and PowerPC, you may find they > looks > > quite similar in back-ends. > > > > Red zone is only one difficulty, your patch is e.g. completely ignoring > existence of biased stack pointers (e.g. SPARC -m64 has them). > Some targets have stack growing in opposite direction, etc. > We have really a huge amount of very diverse ABIs and making the > middle-end > grok what is an invalid stack access is difficult. > > Jakub
A case that PRE optimization hurts performance
Hi, For the following simple test case, PRE optimization hoists computation (s!=1) into the default branch of the switch statement, and finally causes very poor code generation. This problem occurs in both X86 and ARM, and I believe it is also a problem for other targets. int f(char *t) { int s=0; while (*t && s != 1) { switch (s) { case 0: s = 2; break; case 2: s = 1; break; default: if (*t == '-') s = 1; break; } t++; } return s; } Taking X86 as an example, with option "-O2" you may find 52 instructions generated like below, : 0: 55 push %ebp 1: 31 c0 xor%eax,%eax 3: 89 e5 mov%esp,%ebp 5: 57 push %edi 6: 56 push %esi 7: 53 push %ebx 8: 8b 55 08mov0x8(%ebp),%edx b: 0f b6 0amovzbl (%edx),%ecx e: 84 c9 test %cl,%cl 10: 74 50 je 62 12: 83 c2 01add$0x1,%edx 15: 85 c0 test %eax,%eax 17: 75 23 jne3c 19: 8d b4 26 00 00 00 00lea0x0(%esi,%eiz,1),%esi 20: 0f b6 0amovzbl (%edx),%ecx 23: 84 c9 test %cl,%cl 25: 0f 95 c0setne %al 28: 89 c7 mov%eax,%edi 2a: b8 02 00 00 00 mov$0x2,%eax 2f: 89 fb mov%edi,%ebx 31: 83 c2 01add$0x1,%edx 34: 84 db test %bl,%bl 36: 74 2a je 62 38: 85 c0 test %eax,%eax 3a: 74 e4 je 20 3c: 83 f8 02cmp$0x2,%eax 3f: 74 1f je 60 41: 80 f9 2dcmp$0x2d,%cl 44: 74 22 je 68 46: 0f b6 0amovzbl (%edx),%ecx 49: 83 f8 01cmp$0x1,%eax 4c: 0f 95 c3setne %bl 4f: 89 df mov%ebx,%edi 51: 84 c9 test %cl,%cl 53: 0f 95 c3setne %bl 56: 89 de mov%ebx,%esi 58: 21 f7 and%esi,%edi 5a: eb d3 jmp2f 5c: 8d 74 26 00 lea0x0(%esi,%eiz,1),%esi 60: b0 01 mov$0x1,%al 62: 5b pop%ebx 63: 5e pop%esi 64: 5f pop%edi 65: 5d pop%ebp 66: c3 ret 67: 90 nop 68: b8 01 00 00 00 mov$0x1,%eax 6d: 5b pop%ebx 6e: 5e pop%esi 6f: 5f pop%edi 70: 5d pop%ebp 71: c3 ret But with command line option "-O2 -fno-tree-pre", there are only 12 instructions generated, and the code would be very clean like below, : 0: 55 push %ebp 1: 31 c0 xor%eax,%eax 3: 89 e5 mov%esp,%ebp 5: 8b 55 08mov0x8(%ebp),%edx 8: 80 3a 00cmpb $0x0,(%edx) b: 74 0e je 1b d: 80 7a 01 00 cmpb $0x0,0x1(%edx) 11: b0 02 mov$0x2,%al 13: ba 01 00 00 00 mov$0x1,%edx 18: 0f 45 c2cmovne %edx,%eax 1b: 5d pop%ebp 1c: c3 ret Do you have any idea about this? Thanks, -Jiangning
RE: [RFC] Add middle end hook for stack red zone size
The answer is ARM can. However, if you look into the bugs PR30282 and PR38644, PR44199, you may find in history, there are several different cases in different ports reporting the similar failures, covering x86, PowerPC and ARM. You are right, they were all fixed in back-ends in the past, but we should fix the bug in a general way to make GCC infrastructure stronger, rather than fixing the problem target-by-target and case-by-case! If you further look into the back-end fixes in x86 and PowerPC, you may find they looks quite similar in back-ends. Thanks, -Jiangning > -Original Message- > From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] > On Behalf Of Jakub Jelinek > Sent: Monday, August 01, 2011 5:12 PM > To: Jiangning Liu > Cc: 'Joern Rennecke'; gcc@gcc.gnu.org; gcc-patc...@gcc.gnu.org; > vmaka...@redhat.com; dje@gmail.com; Richard Henderson; Ramana > Radhakrishnan; 'Ramana Radhakrishnan' > Subject: Re: [RFC] Add middle end hook for stack red zone size > > On Mon, Aug 01, 2011 at 11:44:04AM +0800, Jiangning Liu wrote: > > It's quite necessary to solve the general problem in middle-end rather than in > back-end. > > That's what we disagree on. All back-ends but ARM are able to handle it > right, why can't ARM too? The ABI rules for stack handling in the epilogues > are simply too diverse and complex to be handled easily in the scheduler. > > Jakub
RE: [RFC] Add middle end hook for stack red zone size
Joern, Thanks for your valuable feedback! This is only a RFC, and I will send out formal patch along with ChangLog later on. Basically, my patch is only to add new dependence in scheduler, and it only blocks some instruction movements, so it is NO RISK to compiler correctness. For whatever stack pointer changes you gave in different scenarios, the current code base should already work. My patch intends neither to replace old dependences, nor maximize the scheduler capability due to the existence of red zone in stack. It is only to block the memory access moving over stack pointer adjustment if distance is beyond red zone size, which is an OS requirement due to interruption existence. Stack adjustment in epilogue is a very general usage in stack frame. It's quite necessary to solve the general problem in middle-end rather than in back-end. Also, that old patch you attached is to solve the data dependence between two memory accesses, but stack pointer doesn't really have data dependence with memory access without using stack pointer, so they have different stories. Alternative solution of without adding blunt scheduling barrier is we insert an independent pass before scheduler to create RTL barrier by using the same interface stack_red_zone_size, but it would really be an over-design, if we add a new pass only for this *small* functionality. In my patch, *abs* of offset is being used, so you are right that it's possible to get false positive to be too conservative, but there won't exist false negative, because my code would only add new dependences. Since the compilation is based on function, it would be OK if red zone size varies due to different ABI. Could you please tell me exactly on what system being supported by GCC red zone size can be different for incoming and outgoing? And also how scheduler guarantee the correctness in current code base? Anyway, I don't think my patch will break the original solution. Thanks, -Jiangning > -Original Message- > From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] > On Behalf Of Joern Rennecke > Sent: Tuesday, July 26, 2011 10:33 AM > To: Jiangning Liu > Cc: gcc@gcc.gnu.org; gcc-patc...@gcc.gnu.org; vmaka...@redhat.com; > dje@gmail.com; Richard Henderson; Ramana Radhakrishnan; 'Ramana > Radhakrishnan' > Subject: RE: [RFC] Add middle end hook for stack red zone size > > Quoting Jiangning Liu : > > > Hi, > > > > One month ago, I sent out this RFC to *gcc-patches* mail list, but I > > didn't receive any response yet. So I'm forwarding this mail to > > *gcc* mail list. Can anybody here really give feedback to me? > > Well, I couldn't approve any patch, but I can point out some issues with your > patch. > > First, it's missing a ChangeLog, and you don't state how you have tested it. > And regarding the code in sched_analyze_1, I think you'll get false positives > with > alloca, and false negatives when registers are involved to compute offsets or > to > restore the stack pointer from. > > FWIW, I think generally blunt scheduling barriers should be avoided, and > instead > the dependencies made visible to the scheduler. > E.g., I've been working with another architecture with a redzone, where at > -fno- > omit-frame-pointer, the prologue can put pretend_args into the redzone, then > after > stack adjustment and frame allocation, these arguments are accessed via the > frame > pointer. > > With the attached patchlet, alias analysis works for this situation too, so > no blunt > scheduling block is required. > > Likewise, with stack adjustments, they should not affect scheduling in > general, but > be considered to clobber the part of the frame that is being exposed to > interrupt > writes either before or after the adjustment. > At the moment, each port that wants to have such selective scheduling > blockages > has to define a stack_adjust pattern with a memory clobber in a parallel, > with a > memref that shows the high water mark of possible interrupt stack writes. > Prima facia it would seem convenient if you only had to tell the middle-end > about > the redzone size, and it could figure out the implicit clobbers when the > stack is > changed. However, when a big stack adjustment is being made, or the stack is > realigned, or restored from the frame pointer / another register where it was > saved due to realignment, the adjustment is not so obvious. I'm not sure if > you can > actually create an robust interface that's simpler to use than putting the > right > memory clobber in the stack adjust pattern. Note also that the redzone size > can > vary from function to function depending on ABI-altering attributes, in > particular > for interrupt functions, which can also have different incoming and outgoing > redzone sizes. Plus, you can have an NMI / reset handler which can use the > stack > like an ordinary address register.
RE: [RFC] Add middle end hook for stack red zone size
Hi, One month ago, I sent out this RFC to *gcc-patches* mail list, but I didn't receive any response yet. So I'm forwarding this mail to *gcc* mail list. Can anybody here really give feedback to me? Appreciate your help in advance! -Jiangning -Original Message- From: Ramana Radhakrishnan [mailto:ramana.radhakrish...@linaro.org] Sent: Tuesday, July 19, 2011 6:18 PM To: Jiangning Liu Cc: gcc-patc...@gcc.gnu.org; vmaka...@redhat.com; dje@gmail.com; Richard Henderson; Ramana Radhakrishnan Subject: Re: [RFC] Add middle end hook for stack red zone size 2011/7/19 Jiangning Liu : > > I see a lot of feedbacks on other posts, but mine is still with ZERO > response in the past 3 weeks, so I'm wondering if I made any mistake in my > process? Who can help me? It would be worth CC'ing the other relevant target maintainers as well to get some feedback since the patch touches ARM, x86 and Powerpc. I've added the maintainers for i386 and PPC to the CC list using the email addresses from the MAINTAINERS file. Thanks, Ramana > > Thanks, > -Jiangning > > -Original Message- > From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] > On Behalf Of Jiangning Liu > Sent: Tuesday, July 05, 2011 8:32 AM > To: gcc-patc...@gcc.gnu.org; rgue...@gcc.gnu.org > Subject: RE: [RFC] Add middle end hook for stack red zone size > > PING... > > I just merged with the latest code base and generated new patch as attached. > > Thanks, > -Jiangning > >> -Original Message- >> From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches- >> ow...@gcc.gnu.org] On Behalf Of Jiangning Liu >> Sent: 2011年6月28日 4:38 PM >> To: gcc-patc...@gcc.gnu.org >> Subject: [RFC] Add middle end hook for stack red zone size >> >> This patch is to fix PR38644, which is a bug with long history about >> stack red zone access, and PR30282 is correlated. >> >> Originally red zone concept is not exposed to middle-end, and back-end >> uses special logic to add extra memory barrier RTL and help the >> correct dependence in middle-end. This way different back-ends must >> handle red zone problem by themselves. For example, X86 target >> introduced function >> ix86_using_red_zone() to judge red zone access, while POWER introduced >> offset_below_red_zone_p() to judge it. Note that they have different >> semantics, but the logic in caller sites of back-end uses them to >> decide whether adding memory barrier RTL or not. If back-end >> incorrectly handles this, bug would be introduced. >> >> Therefore, the correct method should be middle-end handles red zone >> related things to avoid the burden in different back-ends. To be >> specific for PR38644, this middle-end problem causes incorrect >> behavior for ARM target. >> This patch exposes red zone concept to middle-end by introducing a >> middle-end/back-end hook TARGET_STACK_RED_ZONE_SIZE defined in >> target.def, and by default its value is 0. Back-end may redefine this >> function to provide concrete red zone size according to specific ABI >> requirements. >> >> In middle end, scheduling dependence is modified by using this hook >> plus checking stack frame pointer adjustment instruction to decide >> whether memory references need to be all flushed out or not. In >> theory, if TARGET_STACK_RED_ZONE_SIZE is defined correctly, back-end >> would not be required to specially handle this scheduling dependence >> issue by introducing extra memory barrier RTL. >> >> In back-end, the following changes are made to define the hook, >> 1) For X86, TARGET_STACK_RED_ZONE_SIZE is redefined to be >> ix86_stack_red_zone_size() in i386.c, which is an newly introduced >> function. >> 2) For POWER, TARGET_STACK_RED_ZONE_SIZE is redefined to be >> rs6000_stack_red_zone_size() in rs6000.c, which is also a newly >> defined function. >> 3) For ARM and others, TARGET_STACK_RED_ZONE_SIZE is defined to be >> default_stack_red_zone_size in targhooks.c, and this function returns >> 0, which means ARM eabi and others don't support red zone access at all. >> >> In summary, the relationship between ABI and red zone access is like >> below, >> >> - >> | ARCH | ARM | X86 |POWER | others | >> |--|---|---|---|| >> |ABI | EABI | MS_64 | other | AIX | V4 || >> |--|---|---|---||--|| >> | RED ZONE | No | YES | No | YES | No | No | >> |--|---|---|---||--|| >> | RED ZONE SIZE| 0 | 128 | 0 |220/288 | 0 |0 | >> - >> >> Thanks, >> -Jiangning > > > >