An issue on loop optimization/vectorization
For the case below, the code generated by “gcc -O3” is very ugly. char g_d[1024], g_s1[1024], g_s2[1024]; void test_loop(void) { char *d = g_d, *s1 = g_s1, *s2 = g_s2; for( int y = 0; y < 128; y++ ) { for( int x = 0; x < 16; x++ ) d[x] = s1[x] + s2[x]; d += 16; } } If we change “for( int x = 0; x < 16; x++ )” to be like “for( int x = 0; x < 32; x++ )”, very beautiful vectorization code would be generated, test_loop: .LFB0: .cfi_startproc adrpx2, g_s1 adrpx3, g_s2 add x2, x2, :lo12:g_s1 add x3, x3, :lo12:g_s2 adrpx0, g_d adrpx1, g_d+2048 add x0, x0, :lo12:g_d add x1, x1, :lo12:g_d+2048 ldp q1, q2, [x2] ldp q3, q0, [x3] add v1.16b, v1.16b, v3.16b add v0.16b, v0.16b, v2.16b .p2align 3,,7 .L2: str q1, [x0] str q0, [x0, 16]! cmp x0, x1 bne .L2 ret The code generated for " for( int x = 0; x < 8; x++ )" is also very ugly. It looks gcc has potential bugs on loop vectorization. Any idea? Thanks, -Jiangning
Re: [RFC, ivopts] fix bugs in ivopts address cost computation
Hi, For the following code change, @@ -4212,11 +4064,6 @@ get_computation_cost_at (struct ivopts_d cost.cost += adjust_setup_cost (data, add_cost (TYPE_MODE (ctype), speed)); - /* Having offset does not affect runtime cost in case it is added to - symbol, but it increases complexity. */ - if (offset) -cost.complexity++; - cost.cost += add_cost (TYPE_MODE (ctype), speed); aratio = ratio 0 ? ratio : -ratio; I think this shouldn't be removed. The offset may be affected by the position of inserting reduction variable accumulation statement. There will be different cases between before and after reduction variable accumulation. The cost of replacing use point with reduction variable should be different accordingly. BTW, I personally think the current ivopt cost modelling basically works fine, although there might be some tunings needed. The most difficult part is the choice of reduction variable candidates has something to do with register pressure cost, while the register cost estimate is not accurate enough at this stage because we don't have back-end live range interference graph at all. we are always able to find holes on some particular cases or benchmarks, but we can't only want to find a optimal result for them, and the tuning needs to be backed by more comprehensive result. Thanks, -Jiangning 2012/6/6 Sandra Loosemore san...@codesourcery.com: My colleagues and I have been working on the GCC port for the Qualcomm Hexagon. Along the way I noticed that we were getting poor results from the ivopts pass no matter how we adjusted the target-specific RTX costs. In many cases ivopts was coming up with candidate use costs that seemed completely inconsistent with the target cost model. On further inspection, I found what appears to be a whole bunch of bugs in the way ivopts is computing address costs: (1) While the address cost computation is assuming in some situations that pre/post increment/decrement addressing will be used if supported by the target, it isn't actually using the target's address cost for such forms -- instead, just the cost of the form that would be used if autoinc weren't available/applicable. (2) The computation to determine which multiplier values are supported by target addressing modes is constructing an address rtx of the form (reg * ratio) to do the tests. This isn't a valid address RTX on Hexagon, although both (reg + reg * ratio) and (sym + reg * ratio) are. Because it's choosing the wrong address form to probe with, it thinks that the target doesn't support multipliers at all and is incorrectly tacking on an extra cost for them. I also note that it's assuming that the same set of ratios are supported by all three address forms that can potentially include them, and that all valid ratios have the same cost. (3) The computation to determine the range of valid constant offsets for address forms that can include them is probing the upper end of the range using constants of the form ((1n) - 1). On Hexagon, the offsets have to be aligned appropriately for the mode, so it's incorrectly rejecting all positive offsets for non-char modes. And again, it's assuming that the same range of offsets are supported by all address forms that can legitimately include them, and that all valid offsets have the same cost. The latter isn't true on Hexagon. (4) The cost adjustment for converting the symbol_present address to a var_present address seems overly optimistic in assuming that the symbol load will be hoisted outside the loop. I looked at a lot of code where this was not happening no matter how expensive I made the absolute addressing forms in the target-specific costs. Also, if subsequent passes actually do hoist the symbol load, this requires an additional register to be available over the entire loop, which ivopts isn't accounting for in any way. It seems to me that this adjustment shouldn't be attempted when the symbol_present form is a legitimate address (because subsequent passes are not doing the anticipated optimization in that case). There's also a bug present in the cost accounting: it's adding the full cost of the symbol + var addition, whereas it should be pro-rated across iterations (see the way this is handled for the non-address cases in get_computation_cost_at). (5) The documentation of TARGET_ADDRESS_COST says that when two (legitimate) address expressions have the same lowest cost, the one with the higher complexity is used. But, the ivopts code is doing the opposite of this and choosing the lower complexity as the tie-breaker. (Actually, it's also computing the complexity without regard to whether the address rtx is even legitimate on the target.) (6) The way get_address_costs is precomputing and caching a complete set of cost data for each addressing mode seems incorrect to me, in the general case. For example, consider MIPS where MIPS16
RE: A question about loop ivopt
-Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Tuesday, May 15, 2012 10:17 PM To: Zdenek Dvorak Cc: Jiangning Liu; gcc@gcc.gnu.org; Jiangning Liu Subject: Re: A question about loop ivopt On Tue, May 15, 2012 at 4:13 PM, Zdenek Dvorak rakd...@iuuk.mff.cuni.cz wrote: Hi, Why can't we replace function force_expr_to_var_cost directly with function computation_cost in tree-ssa-loop-ivopt.c? Actually I think it is inaccurate for the current recursive algorithm in force_expr_to_var_cost to estimate expr cost. Instead computation_cost can count some back-end factors and make the estimation more accurate. For example, using computation_cost, we may check whether back-ends can tie some modes transformation expressed by SUBREG or not. If we use force_expr_to_var_cost, some more computations around type promotion/demotion would increase the cost estimated. Looking at the algorithm in force_expr_to_var_cost, it is just to analyze the operator in the expression and give estimation. Should it be the same as expanding EXPR to RTX and give estimation like in computation_cost? Any thoughts? I suppose Zdenek may remember. computation_cost actually expands the expression to RTL. Since cost estimates are computed a lot in ivopts, using it directly would require a huge amount of memory, Zdenek, Do you know how huge is it? Any data on this? For GCC, is this huge memory consumption is critical enough, and there aren't any other else consuming even more memory? no, not really (I haven't worked on this for a few years now), although of course I did some measurements when ivopts were created. Feel free to experiment with that, if it turned out that the memory consumption and extra time spent by it is negligible, it would be a nice cleanup. Well, I don't think we should feed arbitrary expressions to expand at IVOPTs time. What probably matters most is address costs, no? At least that is where expand probably makes the most difference. So why not improve force_expr_to_var_cost instead? OK, yes, the thing that matter most is just address cost, so I can improve force_expr_to_var_cost. Would it sound OK if I expose MODES_TIEABLE_P to middle-end by defining a new target hook? I need this function to strip some operations and make the cost estimate more accurate. If I don't expand to RTL, I would need a method to check the modes conversion in middle end, anyway. Any idea? Thanks, -Jiangning Richard. Zdenek
RE: A question about loop ivopt
-Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Tuesday, May 22, 2012 6:36 PM To: Jiangning Liu Cc: Zdenek Dvorak; Jiangning Liu; gcc@gcc.gnu.org Subject: Re: A question about loop ivopt On Tue, May 22, 2012 at 11:19 AM, Jiangning Liu jiangning@arm.com wrote: -Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Tuesday, May 15, 2012 10:17 PM To: Zdenek Dvorak Cc: Jiangning Liu; gcc@gcc.gnu.org; Jiangning Liu Subject: Re: A question about loop ivopt On Tue, May 15, 2012 at 4:13 PM, Zdenek Dvorak rakd...@iuuk.mff.cuni.cz wrote: Hi, Why can't we replace function force_expr_to_var_cost directly with function computation_cost in tree-ssa-loop-ivopt.c? Actually I think it is inaccurate for the current recursive algorithm in force_expr_to_var_cost to estimate expr cost. Instead computation_cost can count some back-end factors and make the estimation more accurate. For example, using computation_cost, we may check whether back-ends can tie some modes transformation expressed by SUBREG or not. If we use force_expr_to_var_cost, some more computations around type promotion/demotion would increase the cost estimated. Looking at the algorithm in force_expr_to_var_cost, it is just to analyze the operator in the expression and give estimation. Should it be the same as expanding EXPR to RTX and give estimation like in computation_cost? Any thoughts? I suppose Zdenek may remember. computation_cost actually expands the expression to RTL. Since cost estimates are computed a lot in ivopts, using it directly would require a huge amount of memory, Zdenek, Do you know how huge is it? Any data on this? For GCC, is this huge memory consumption is critical enough, and there aren't any other else consuming even more memory? no, not really (I haven't worked on this for a few years now), although of course I did some measurements when ivopts were created. Feel free to experiment with that, if it turned out that the memory consumption and extra time spent by it is negligible, it would be a nice cleanup. Well, I don't think we should feed arbitrary expressions to expand at IVOPTs time. What probably matters most is address costs, no? At least that is where expand probably makes the most difference. So why not improve force_expr_to_var_cost instead? OK, yes, the thing that matter most is just address cost, so I can improve force_expr_to_var_cost. Would it sound OK if I expose MODES_TIEABLE_P to middle-end by defining a new target hook? I need this function to strip some operations and make the cost estimate more accurate. If I don't expand to RTL, I would need a method to check the modes conversion in middle end, anyway. Any idea? You are already in the middle-end and thus can use MODES_TIABLE_P directly. Modes are also available on gimple variables via DECL/TYPE_MODE. Richard, But MODES_TIEABLE_P is a macro hook and isn't exposed to TREE level, so I would have to modify xxx-protos.h for all back-ends. An alternative way is I define a new function hook. This way I needn't to change all back-ends, but support several back-ends required first. Which solution is usually preferred? Thanks, -Jiangning Richard. Thanks, -Jiangning Richard. Zdenek
A question about loop ivopt
Hi, Why can't we replace function force_expr_to_var_cost directly with function computation_cost in tree-ssa-loop-ivopt.c? Actually I think it is inaccurate for the current recursive algorithm in force_expr_to_var_cost to estimate expr cost. Instead computation_cost can count some back-end factors and make the estimation more accurate. For example, using computation_cost, we may check whether back-ends can tie some modes transformation expressed by SUBREG or not. If we use force_expr_to_var_cost, some more computations around type promotion/demotion would increase the cost estimated. Looking at the algorithm in force_expr_to_var_cost, it is just to analyze the operator in the expression and give estimation. Should it be the same as expanding EXPR to RTX and give estimation like in computation_cost? Any thoughts? Thanks, -Jiangning
Why can't copy renaming capture this assignment?
Hi, For this small case, char garr[100]; void f(void) { unsigned short h, s; s = 20; for (h = 1; h (s-1); h++) { garr[h] = 0; } } After copyrename3, we have the following dump, f () { short unsigned int h; int D.4066; bb 2: D.4066_14 = 1; if (D.4066_14 = 18) goto bb 3; else goto bb 4; bb 3: # h_15 = PHI h_8(3), 1(2) # D.4066_16 = PHI D.4066_4(3), D.4066_14(2) garr[D.4066_16] = 0; h_8 = h_15 + 1; D.4066_4 = (int) h_8; if (D.4066_4 = 18) goto bb 3; else goto bb 4; bb 4: return; } copy renaming fails to capture the assignment statement D.4066_4 = (int) h_8; to trigger renaming partition coalesce. I find gimple_assign_single_p invoked by gimple_assign_ssa_name_copy_p always returns false, because for this statement gs-gsbase.subcode is NOP_EXPR rather than GIMPLE_SINGLE_RHS. Should subcode be correctly initialized anywhere to fix this problem? BTW, my expectation after copy renaming is like below, f () { int D.4679; bb 2: D.4679_7 = 1; if (D.4679_7 != 19) goto bb 3; else goto bb 4; bb 3: # D.4679_15 = PHI D.4679_4(3), D.4679_7(2) # D.4679_17 = PHI D.4679_14(3), 1(2) garr[D.4679_15] = 0; D.4679_14 = D.4679_17 + 1; D.4679_4 = D.4679_14; if (D.4679_4 != 19) goto bb 3; else goto bb 4; bb 4: return; } and then PRE can finally remove that redundancy for symbol D. away. Thanks, -Jiangning
[PATCH][Testsuite] XFAIL scev-3/4.c and add scev-5.c
Hi, This patch is to XFAIL scev-3.c and scev-5.c. The bug is going to be fixed after Richard Guenther fix a serials of problems related to POINTER_PLUS_EXPR and sizetype precision. Thanks, -Jiangning ChangeLog for testsuite: 2012-03-21 Jiangning Liu jiangning@arm.com PR tree-optimization/52563 * gcc.dg/tree-ssa/scev-3.c: XFAIL on lp64. * gcc.dg/tree-ssa/scev-4.c: XFAIL on lp64. * gcc.dg/tree-ssa/scev-5.c: New. Thanks, -Jiangning diff --git a/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c b/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c index 28d5c93..ed63a18 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c @@ -14,5 +14,5 @@ f(int k) } } -/* { dg-final { scan-tree-dump-times a 1 optimized } } */ +/* { dg-final { scan-tree-dump-times a 1 optimized { xfail lp64 } +} } */ /* { dg-final { cleanup-tree-dump optimized } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c b/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c index 6c1e530..a538c32 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c @@ -19,5 +19,5 @@ f(int k) } } -/* { dg-final { scan-tree-dump-times a 1 optimized } } */ +/* { dg-final { scan-tree-dump-times a 1 optimized { xfail lp64 } +} } */ /* { dg-final { cleanup-tree-dump optimized } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/scev-5.c b/gcc/testsuite/gcc.dg/tree-ssa/scev-5.c new file mode 100644 index 000..b9de36a --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/scev-5.c @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -fdump-tree-optimized } */ + +int *a_p; +int a[1000]; + +f(int k) +{ +long long i; + +for (i=k; i1000; i+=k) { +a_p = a[i]; +*a_p = 100; +} +} + +/* { dg-final { scan-tree-dump-times a 1 optimized } } */ +/* { dg-final { cleanup-tree-dump optimized } } */
RE: [PATCH] Improve SCEV for array element
-Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Tuesday, March 06, 2012 9:12 PM To: Jiangning Liu Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Improve SCEV for array element On Fri, Jan 20, 2012 at 10:06 AM, Jiangning Liu jiangning@arm.com wrote: It's definitely not ok at this stage but at most for next stage1. OK. I may wait until next stage1. This is a very narrow pattern-match. It doesn't allow for a[i].x for example, even if a[i] is a one-element structure. I think the canonical way of handling ADDR_EXPR is to use sth like base = get_inner_reference (TREE_OPERAND (rhs1, 0), ..., offset, ...); base = build1 (ADDR_EXPR, TREE_TYPE (rhs1), base); chrec1 = analyze_scalar_evolution (loop, base); chrec2 = analyze_scalar_evolution (loop, offset); chrec1 = chrec_convert (type, chrec1, at_stmt); chrec2 = chrec_convert (TREE_TYPE (offset), chrec2, at_stmt); res = chrec_fold_plus (type, chrec1, chrec2); where you probably need to handle scev_not_known when analyzing offset (which might be NULL). You also need to add bitpos to the base address (in bytes, of course). Note that the MEM_REF case would naturally work with this as well. OK. New patch is like below, and bootstrapped on x86-32. You want instead of + if (TREE_CODE (TREE_OPERAND (rhs1, 0)) == ARRAY_REF + || TREE_CODE (TREE_OPERAND (rhs1, 0)) == MEM_REF + || TREE_CODE (TREE_OPERAND (rhs1, 0)) == COMPONENT_REF) +{ if (TREE_CODE (TREE_OPERAND (rhs1, 0)) == MEM_REF || handled_component_p (TREE_OPERAND (rhs1, 0))) { + base = build1 (ADDR_EXPR, TREE_TYPE (rhs1), base); + chrec1 = analyze_scalar_evolution (loop, base); can you please add a wrapper tree analyze_scalar_evolution_for_address_of (struct loop *loop, tree var) { return analyze_scalar_evolution (loop, build_fold_addr_expr (var)); } and call that instead of building the ADDR_EXPR there? We want to avoid building that tree node, but even such a simple wrapper would be prefered. + if (bitpos) if (bitpos != 0) + chrec3 = build_int_cst (integer_type_node, + bitpos / BITS_PER_UNIT); please use size_int (bitpos / BITS_PER_UNIT) instead. Using integer_type_node is definitely wrong. Ok with that changes. Richard, Modified as all you suggested, and new code is like below. Bootstrapped on x86-32. OK for trunk now? Thanks, -Jiangning ChangeLog: 2012-03-08 Jiangning Liu jiangning@arm.com * tree-scalar-evolution (interpret_rhs_expr): generate chrec for array reference and component reference. (analyze_scalar_evolution_for_address_of): New. ChangeLog for testsuite: 2012-03-08 Jiangning Liu jiangning@arm.com * gcc.dg/tree-ssa/scev-3.c: New. * gcc.dg/tree-ssa/scev-4.c: New. diff --git a/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c b/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c new file mode 100644 index 000..28d5c93 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -fdump-tree-optimized } */ + +int *a_p; +int a[1000]; + +f(int k) +{ + int i; + + for (i=k; i1000; i+=k) { + a_p = a[i]; + *a_p = 100; +} +} + +/* { dg-final { scan-tree-dump-times a 1 optimized } } */ +/* { dg-final { cleanup-tree-dump optimized } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c b/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c new file mode 100644 index 000..6c1e530 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c @@ -0,0 +1,23 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -fdump-tree-optimized } */ + +typedef struct { + int x; + int y; +} S; + +int *a_p; +S a[1000]; + +f(int k) +{ + int i; + + for (i=k; i1000; i+=k) { + a_p = a[i].y; + *a_p = 100; +} +} + +/* { dg-final { scan-tree-dump-times a 1 optimized } } */ +/* { dg-final { cleanup-tree-dump optimized } } */ diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c index 2077c8d..c719984 --- a/gcc/tree-scalar-evolution.c +++ b/gcc/tree-scalar-evolution.c @@ -266,6 +266,8 @@ along with GCC; see the file COPYING3. If not see #include params.h static tree analyze_scalar_evolution_1 (struct loop *, tree, tree); +static tree analyze_scalar_evolution_for_address_of (struct loop *loop, +tree var); /* The cached information about an SSA name VAR, claiming that below basic block INSTANTIATED_BELOW, the value of VAR can be expressed @@ -1712,16 +1714,59 @@ interpret_rhs_expr (struct loop *loop, gimple at_stmt, switch (code) { case ADDR_EXPR: - /* Handle MEM[ptr + CST] which is equivalent to POINTER_PLUS_EXPR
RE: A question about redundant PHI expression stmt
-Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Jiangning Liu Sent: Tuesday, February 28, 2012 11:19 AM To: 'William J. Schmidt' Cc: gcc@gcc.gnu.org; wschm...@gcc.gnu.org Subject: RE: A question about redundant PHI expression stmt -Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of William J. Schmidt Sent: Monday, February 27, 2012 11:32 PM To: Jiangning Liu Cc: gcc@gcc.gnu.org; wschm...@gcc.gnu.org Subject: Re: A question about redundant PHI expression stmt On Fri, 2012-02-24 at 16:07 +0800, Jiangning Liu wrote: Hi, For the small case below, there are some redundant PHI expression stmt generated, and finally cause back-end generates redundant copy instructions due to some reasons around IRA. int *l, *r, *g; void test_func(int n) { int i; static int j; static int pos, direction, direction_pre; pos = 0; direction = 1; for ( i = 0; i n; i++ ) { direction_pre = direction; for ( j = 0; j = 400; j++ ) { if ( g[pos] == 200 ) break; if ( direction == 0 ) pos = l[pos]; else pos = r[pos]; if ( pos == -1 ) { if ( direction_pre != direction ) break; pos = 0; direction = !direction; } } f(g[pos]); } } I know PR39976 has something to do with this case, and check-in r182140 caused big degradation on the real benchmark, but I'm still confusing around this issue. The procedure is like this, Loop store motion generates code below, bb 6: D.4084_17 = l.4_13 + D.4077_70; pos.5_18 = *D.4084_17; pos_lsm.34_103 = pos.5_18; goto bb 8; bb 7: D.4088_23 = r.6_19 + D.4077_70; pos.7_24 = *D.4088_23; pos_lsm.34_104 = pos.7_24; bb 8: # prephitmp.29_89 = PHI pos.5_18(6), pos.7_24(7) # pos_lsm.34_53 = PHI pos_lsm.34_103(6), pos_lsm.34_104(7) pos.2_25 = prephitmp.29_89; if (pos.2_25 == -1) goto bb 9; else goto bb 20; And then, copy propagation transforms block 8 it into bb 8: # prephitmp.29_89 = PHI pos.5_18(11), pos.7_24(12) # pos_lsm.34_53 = PHI pos.5_18(11), pos.7_24(12) ... And then, these two duplicated PHI stmts have been kept until expand pass. In dom2, a stmt like below # pos_lsm.34_54 = PHI pos_lsm.34_53(13), 0(16) is transformed into # pos_lsm.34_54 = PHI prephitmp.29_89(13), 0(16) just because they are equal. Unfortunately, this transformation changed back-end behavior to generate redundant copy instructions and hurt performance. This case is from a real benchmark and hurt performance a lot. Does the fix in r182140 intend to totally clean up this kind of redundancy? Where should we get it fixed? Hi, sorry not to have responded sooner -- I just now got some time to look at this. I compiled your code with -O3 for revisions 182139 and 182140 of GCC trunk, and found no difference between the code produced by the middle end for the two versions. So something else has apparently come along since then that helped produce the problematic code generation for you. Either that or I need to use different compile flags; you didn't specify what you used. The fix in r182140 does just what you saw in dom2: identifies duplicate PHIs in the same block and ensures only one of them is used. This actually avoids inserting extra blocks during expand in certain loop cases. I am not sure why you are getting redundant copies as a result, but it sounds from your comments like IRA didn't coalesce a register copy or something like that. You may want to bisect revisions on the trunk to see where your bad code generation started to occur to get a better handle on what happened. As Richard said, the dom pass is likely to be removed someday, whenever someone can get around to it. My redundant-phi band-aid in dom would go away then as well, but presumably an extra pass of PRE would replace it, and redundant PHIs would still be removed. Bill, Thanks a lot for your explanation! The bug could be exposed with -O2 if you apply the check-in r184435 made by Richard. BTW, at present is there anybody working on the NEW pass replacing dom? If no, in short term, it would be good to still get it fixed just because it is hurting benchmark. If Yes, may I know what that NEW pass will be focusing on? With dump enabled, I do see a PHI copy
RE: A question about redundant PHI expression stmt
-Original Message- From: Jiangning Liu Sent: Tuesday, February 28, 2012 4:07 PM To: Jiangning Liu; 'William J. Schmidt' Cc: gcc@gcc.gnu.org; wschm...@gcc.gnu.org Subject: RE: A question about redundant PHI expression stmt -Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Jiangning Liu Sent: Tuesday, February 28, 2012 11:19 AM To: 'William J. Schmidt' Cc: gcc@gcc.gnu.org; wschm...@gcc.gnu.org Subject: RE: A question about redundant PHI expression stmt -Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of William J. Schmidt Sent: Monday, February 27, 2012 11:32 PM To: Jiangning Liu Cc: gcc@gcc.gnu.org; wschm...@gcc.gnu.org Subject: Re: A question about redundant PHI expression stmt On Fri, 2012-02-24 at 16:07 +0800, Jiangning Liu wrote: Hi, For the small case below, there are some redundant PHI expression stmt generated, and finally cause back-end generates redundant copy instructions due to some reasons around IRA. int *l, *r, *g; void test_func(int n) { int i; static int j; static int pos, direction, direction_pre; pos = 0; direction = 1; for ( i = 0; i n; i++ ) { direction_pre = direction; for ( j = 0; j = 400; j++ ) { if ( g[pos] == 200 ) break; if ( direction == 0 ) pos = l[pos]; else pos = r[pos]; if ( pos == -1 ) { if ( direction_pre != direction ) break; pos = 0; direction = !direction; } } f(g[pos]); } } I know PR39976 has something to do with this case, and check-in r182140 caused big degradation on the real benchmark, but I'm still confusing around this issue. The procedure is like this, Loop store motion generates code below, bb 6: D.4084_17 = l.4_13 + D.4077_70; pos.5_18 = *D.4084_17; pos_lsm.34_103 = pos.5_18; goto bb 8; bb 7: D.4088_23 = r.6_19 + D.4077_70; pos.7_24 = *D.4088_23; pos_lsm.34_104 = pos.7_24; bb 8: # prephitmp.29_89 = PHI pos.5_18(6), pos.7_24(7) # pos_lsm.34_53 = PHI pos_lsm.34_103(6), pos_lsm.34_104(7) pos.2_25 = prephitmp.29_89; if (pos.2_25 == -1) goto bb 9; else goto bb 20; And then, copy propagation transforms block 8 it into bb 8: # prephitmp.29_89 = PHI pos.5_18(11), pos.7_24(12) # pos_lsm.34_53 = PHI pos.5_18(11), pos.7_24(12) ... And then, these two duplicated PHI stmts have been kept until expand pass. In dom2, a stmt like below # pos_lsm.34_54 = PHI pos_lsm.34_53(13), 0(16) is transformed into # pos_lsm.34_54 = PHI prephitmp.29_89(13), 0(16) just because they are equal. Unfortunately, this transformation changed back-end behavior to generate redundant copy instructions and hurt performance. This case is from a real benchmark and hurt performance a lot. Does the fix in r182140 intend to totally clean up this kind of redundancy? Where should we get it fixed? Hi, sorry not to have responded sooner -- I just now got some time to look at this. I compiled your code with -O3 for revisions 182139 and 182140 of GCC trunk, and found no difference between the code produced by the middle end for the two versions. So something else has apparently come along since then that helped produce the problematic code generation for you. Either that or I need to use different compile flags; you didn't specify what you used. The fix in r182140 does just what you saw in dom2: identifies duplicate PHIs in the same block and ensures only one of them is used. This actually avoids inserting extra blocks during expand in certain loop cases. I am not sure why you are getting redundant copies as a result, but it sounds from your comments like IRA didn't coalesce a register copy or something like that. You may want to bisect revisions on the trunk to see where your bad code generation started to occur to get a better handle on what happened. As Richard said, the dom pass is likely to be removed someday, whenever someone can get around to it. My redundant-phi band-aid
RE: A question about redundant PHI expression stmt
-Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of William J. Schmidt Sent: Monday, February 27, 2012 11:32 PM To: Jiangning Liu Cc: gcc@gcc.gnu.org; wschm...@gcc.gnu.org Subject: Re: A question about redundant PHI expression stmt On Fri, 2012-02-24 at 16:07 +0800, Jiangning Liu wrote: Hi, For the small case below, there are some redundant PHI expression stmt generated, and finally cause back-end generates redundant copy instructions due to some reasons around IRA. int *l, *r, *g; void test_func(int n) { int i; static int j; static int pos, direction, direction_pre; pos = 0; direction = 1; for ( i = 0; i n; i++ ) { direction_pre = direction; for ( j = 0; j = 400; j++ ) { if ( g[pos] == 200 ) break; if ( direction == 0 ) pos = l[pos]; else pos = r[pos]; if ( pos == -1 ) { if ( direction_pre != direction ) break; pos = 0; direction = !direction; } } f(g[pos]); } } I know PR39976 has something to do with this case, and check-in r182140 caused big degradation on the real benchmark, but I'm still confusing around this issue. The procedure is like this, Loop store motion generates code below, bb 6: D.4084_17 = l.4_13 + D.4077_70; pos.5_18 = *D.4084_17; pos_lsm.34_103 = pos.5_18; goto bb 8; bb 7: D.4088_23 = r.6_19 + D.4077_70; pos.7_24 = *D.4088_23; pos_lsm.34_104 = pos.7_24; bb 8: # prephitmp.29_89 = PHI pos.5_18(6), pos.7_24(7) # pos_lsm.34_53 = PHI pos_lsm.34_103(6), pos_lsm.34_104(7) pos.2_25 = prephitmp.29_89; if (pos.2_25 == -1) goto bb 9; else goto bb 20; And then, copy propagation transforms block 8 it into bb 8: # prephitmp.29_89 = PHI pos.5_18(11), pos.7_24(12) # pos_lsm.34_53 = PHI pos.5_18(11), pos.7_24(12) ... And then, these two duplicated PHI stmts have been kept until expand pass. In dom2, a stmt like below # pos_lsm.34_54 = PHI pos_lsm.34_53(13), 0(16) is transformed into # pos_lsm.34_54 = PHI prephitmp.29_89(13), 0(16) just because they are equal. Unfortunately, this transformation changed back-end behavior to generate redundant copy instructions and hurt performance. This case is from a real benchmark and hurt performance a lot. Does the fix in r182140 intend to totally clean up this kind of redundancy? Where should we get it fixed? Hi, sorry not to have responded sooner -- I just now got some time to look at this. I compiled your code with -O3 for revisions 182139 and 182140 of GCC trunk, and found no difference between the code produced by the middle end for the two versions. So something else has apparently come along since then that helped produce the problematic code generation for you. Either that or I need to use different compile flags; you didn't specify what you used. The fix in r182140 does just what you saw in dom2: identifies duplicate PHIs in the same block and ensures only one of them is used. This actually avoids inserting extra blocks during expand in certain loop cases. I am not sure why you are getting redundant copies as a result, but it sounds from your comments like IRA didn't coalesce a register copy or something like that. You may want to bisect revisions on the trunk to see where your bad code generation started to occur to get a better handle on what happened. As Richard said, the dom pass is likely to be removed someday, whenever someone can get around to it. My redundant-phi band-aid in dom would go away then as well, but presumably an extra pass of PRE would replace it, and redundant PHIs would still be removed. Bill, Thanks a lot for your explanation! The bug could be exposed with -O2 if you apply the check-in r184435 made by Richard. BTW, at present is there anybody working on the NEW pass replacing dom? If no, in short term, it would be good to still get it fixed just because it is hurting benchmark. If Yes, may I know what that NEW pass will be focusing on? Thanks, -Jiangning Thanks, Bill Thanks, -Jiangning
A question about redundant PHI expression stmt
Hi, For the small case below, there are some redundant PHI expression stmt generated, and finally cause back-end generates redundant copy instructions due to some reasons around IRA. int *l, *r, *g; void test_func(int n) { int i; static int j; static int pos, direction, direction_pre; pos = 0; direction = 1; for ( i = 0; i n; i++ ) { direction_pre = direction; for ( j = 0; j = 400; j++ ) { if ( g[pos] == 200 ) break; if ( direction == 0 ) pos = l[pos]; else pos = r[pos]; if ( pos == -1 ) { if ( direction_pre != direction ) break; pos = 0; direction = !direction; } } f(g[pos]); } } I know PR39976 has something to do with this case, and check-in r182140 caused big degradation on the real benchmark, but I'm still confusing around this issue. The procedure is like this, Loop store motion generates code below, bb 6: D.4084_17 = l.4_13 + D.4077_70; pos.5_18 = *D.4084_17; pos_lsm.34_103 = pos.5_18; goto bb 8; bb 7: D.4088_23 = r.6_19 + D.4077_70; pos.7_24 = *D.4088_23; pos_lsm.34_104 = pos.7_24; bb 8: # prephitmp.29_89 = PHI pos.5_18(6), pos.7_24(7) # pos_lsm.34_53 = PHI pos_lsm.34_103(6), pos_lsm.34_104(7) pos.2_25 = prephitmp.29_89; if (pos.2_25 == -1) goto bb 9; else goto bb 20; And then, copy propagation transforms block 8 it into bb 8: # prephitmp.29_89 = PHI pos.5_18(11), pos.7_24(12) # pos_lsm.34_53 = PHI pos.5_18(11), pos.7_24(12) ... And then, these two duplicated PHI stmts have been kept until expand pass. In dom2, a stmt like below # pos_lsm.34_54 = PHI pos_lsm.34_53(13), 0(16) is transformed into # pos_lsm.34_54 = PHI prephitmp.29_89(13), 0(16) just because they are equal. Unfortunately, this transformation changed back-end behavior to generate redundant copy instructions and hurt performance. This case is from a real benchmark and hurt performance a lot. Does the fix in r182140 intend to totally clean up this kind of redundancy? Where should we get it fixed? Thanks, -Jiangning
RE: A problem about loop store motion
The MEM form is more canonical, so the loop SM machinery to detect equality should be adjusted accordingly. Alternatively you can teach PRE insertion to strip off the MEM if possible (though fold_stmt_inplace should arelady do this if possible). Richard, Thank you! You are right. I noticed on latest trunk the problem in PRE was already fixed by invoking fold_stmt_inplace. Unfortunately for this small case, the latest trunk code still can't do SM for variable pos, because refs_may_alias_p(*D.4074_10, pos) is true, that is, pos has alias with l[pos]. I think alias analysis should be able to know they don't have alias with each other, unless there is an assignment statement like l=pos;. Can alias analysis fix the problem? Thanks, -Jiangning Richard.
RE: A problem about loop store motion
-Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Tuesday, February 21, 2012 5:40 PM To: Jiangning Liu Cc: gcc@gcc.gnu.org Subject: Re: A problem about loop store motion On Tue, Feb 21, 2012 at 9:54 AM, Jiangning Liu jiangning@arm.com wrote: The MEM form is more canonical, so the loop SM machinery to detect equality should be adjusted accordingly. Alternatively you can teach PRE insertion to strip off the MEM if possible (though fold_stmt_inplace should arelady do this if possible). Richard, Thank you! You are right. I noticed on latest trunk the problem in PRE was already fixed by invoking fold_stmt_inplace. Unfortunately for this small case, the latest trunk code still can't do SM for variable pos, because refs_may_alias_p(*D.4074_10, pos) is true, that is, pos has alias with l[pos]. I think alias analysis should be able to know they don't have alias with each other, unless there is an assignment statement like l=pos;. Can alias analysis fix the problem? The problem is that 'pos' is marked TREE_ADDRESSABLE, so we think its address got taken. 'l' points to any global possibly address-taken variable. It get's the TREE_ADDRESSABLE flag via PRE insertion which re-gimplifies the tree it creates which marks the variable as addressable. I'll have a look. Terrific! :-) Could you please let me know when you have a patch to fix this, because I want to double check the big case, and there might be other hidden problems? Thanks, -Jiangning Richard. Thanks, -Jiangning Richard.
RE: A problem about loop store motion
-Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Richard Guenther Sent: Tuesday, February 21, 2012 6:19 PM To: Jiangning Liu Cc: gcc@gcc.gnu.org Subject: Re: A problem about loop store motion On Tue, Feb 21, 2012 at 10:57 AM, Jiangning Liu jiangning@arm.com wrote: -Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Tuesday, February 21, 2012 5:40 PM To: Jiangning Liu Cc: gcc@gcc.gnu.org Subject: Re: A problem about loop store motion On Tue, Feb 21, 2012 at 9:54 AM, Jiangning Liu jiangning@arm.com wrote: The MEM form is more canonical, so the loop SM machinery to detect equality should be adjusted accordingly. Alternatively you can teach PRE insertion to strip off the MEM if possible (though fold_stmt_inplace should arelady do this if possible). Richard, Thank you! You are right. I noticed on latest trunk the problem in PRE was already fixed by invoking fold_stmt_inplace. Unfortunately for this small case, the latest trunk code still can't do SM for variable pos, because refs_may_alias_p(*D.4074_10, pos) is true, that is, pos has alias with l[pos]. I think alias analysis should be able to know they don't have alias with each other, unless there is an assignment statement like l=pos;. Can alias analysis fix the problem? The problem is that 'pos' is marked TREE_ADDRESSABLE, so we think its address got taken. 'l' points to any global possibly address- taken variable. It get's the TREE_ADDRESSABLE flag via PRE insertion which re-gimplifies the tree it creates which marks the variable as addressable. I'll have a look. Terrific! :-) Could you please let me know when you have a patch to fix this, because I want to double check the big case, and there might be other hidden problems? PR52324, I am testing the following patch (in reality the gimplifier should not mark things addressable unless it itself makes them so, but frontends are broken, so we do that ... ugh). Richard, Now trunk works for the big case as well. Thanks a lot for your quick fix. BTW, why can't we fix front-end directly? Thanks, -Jiangning Index: gcc/gimplify.c === --- gcc/gimplify.c (revision 184428) +++ gcc/gimplify.c (working copy) @@ -7061,15 +7061,20 @@ gimplify_expr (tree *expr_p, gimple_seq ret = GS_OK; break; } - ret = gimplify_expr (TREE_OPERAND (*expr_p, 0), pre_p, post_p, - is_gimple_mem_ref_addr, fb_rvalue); - if (ret == GS_ERROR) - break; + /* Avoid re-gimplifying the address operand if it is already +in suitable form. */ + if (!is_gimple_mem_ref_addr (TREE_OPERAND (*expr_p, 0))) + { + ret = gimplify_expr (TREE_OPERAND (*expr_p, 0), pre_p, post_p, + is_gimple_mem_ref_addr, fb_rvalue); + if (ret == GS_ERROR) + break; + } recalculate_side_effects (*expr_p); ret = GS_ALL_DONE; break; - /* Constants need not be gimplified. */ + /* Constants need not be gimplified. */ case INTEGER_CST: case REAL_CST: case FIXED_CST:
RE: A problem about loop store motion
-Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Jiangning Liu Sent: Friday, February 17, 2012 5:53 PM To: gcc@gcc.gnu.org Subject: A problem about loop store motion Hi, For this small test case, int *l, *r; int test_func(void) { int i; int direction; static int pos; pos = 0; direction = 1; for ( i = 0; i = 400; i++ ) { if ( direction == 0 ) pos = l[pos]; else pos = r[pos]; if ( pos == -1 ) { pos = 0; direction = !direction; } } return i; } In middle end, I don't see pos is sunk out of loop by loop store motion. Any idea? The dump after lim is like below, and I expect a SSA symbole xxx_lsm could be created with this pass. ;; Function test_func (test_func, funcdef_no=0, decl_uid=4057, cgraph_uid=0) Symbols to be put in SSA form { .MEM } Incremental SSA update started at block: 0 Number of blocks in CFG: 12 Number of blocks to update: 11 ( 92%) test_func () { int pretmp.14; unsigned int pretmp.13; int prephitmp.12; int pretmp.11; unsigned int pretmp.10; int pretmp.9; int D.4088; static int pos; int direction; int i; _Bool D.4082; int pos.5; int * D.4078; int * r.4; int pos.3; int * D.4074; unsigned int D.4073; unsigned int pos.2; int pos.1; int * l.0; bb 2: pos = 0; l.0_6 = l; r.4_12 = r; bb 3: # i_32 = PHI i_21(11), 0(2) # direction_37 = PHI direction_2(11), 1(2) # prephitmp.12_35 = PHI pretmp.11_1(11), 0(2) if (direction_37 == 0) goto bb 4; else goto bb 5; bb 4: pos.1_7 = prephitmp.12_35; pos.2_8 = (unsigned int) pos.1_7; D.4073_9 = pos.2_8 * 4; D.4074_10 = l.0_6 + D.4073_9; pos.3_11 = *D.4074_10; pos = pos.3_11; goto bb 6; bb 5: pos.1_13 = prephitmp.12_35; pos.2_14 = (unsigned int) pos.1_13; D.4073_15 = pos.2_14 * 4; D.4078_16 = r.4_12 + D.4073_15; pos.5_17 = *D.4078_16; pos = pos.5_17; bb 6: # prephitmp.12_31 = PHI pos.3_11(4), pos.5_17(5) pos.1_18 = prephitmp.12_31; if (pos.1_18 == -1) goto bb 7; else goto bb 10; bb 10: goto bb 8; bb 7: pos = 0; D.4088_36 = direction_37 ^ 1; direction_20 = D.4088_36 1; bb 8: # direction_2 = PHI direction_37(10), direction_20(7) i_21 = i_32 + 1; if (i_21 != 401) goto bb 11; else goto bb 9; bb 11: pretmp.11_1 = pos; Hi, pos in RHS seems to be transformed to MEM[pos] by the code in tree-ssa-sccvn.c like below, case VAR_DECL: if (DECL_HARD_REGISTER (ref)) { temp.op0 = ref; break; } /* Fallthru. */ case PARM_DECL: case CONST_DECL: case RESULT_DECL: /* Canonicalize decls to MEM[decl] which is what we end up with when valueizing MEM[ptr] with ptr = decl. */ temp.opcode = MEM_REF; temp.op0 = build_int_cst (build_pointer_type (TREE_TYPE (ref)), 0); temp.off = 0; VEC_safe_push (vn_reference_op_s, heap, *result, temp); temp.opcode = ADDR_EXPR; temp.op0 = build_fold_addr_expr (ref); temp.type = TREE_TYPE (temp.op0); temp.off = -1; break; But the LHS doesn't have this kind of transformation on gimple, So the code below in gather_mem_refs_stmt of tree-ssa-loop-im.c can't find MEM[pos] is just pos, hash = iterative_hash_expr (*mem, 0); slot = htab_find_slot_with_hash (memory_accesses.refs, *mem, hash, INSERT); Finally pos is set to be dependent on pretmp.11_1 at code below in ref_indep_loop_p_1, and then loop store motion fails to find pos. EXECUTE_IF_SET_IN_BITMAP (refs_to_check, 0, i, bi) { aref = VEC_index (mem_ref_p, memory_accesses.refs_list, i); if (!refs_independent_p (ref, aref)) { ret = false; record_indep_loop (loop, aref, false); break; } } Should we fix this problem by enhancing iterative_hash_expr, or we should use MEM[pos] to describe pos at all? Any hints? I need to understand why we have to use MEM[pos] to describe pos? This looks strange to me. BTW, this bug caused significant regression for an important benchmark. Thanks, -Jiangning goto bb 3; bb 9: return 401; } Thanks, -Jiangning
A problem about loop store motion
Hi, For this small test case, int *l, *r; int test_func(void) { int i; int direction; static int pos; pos = 0; direction = 1; for ( i = 0; i = 400; i++ ) { if ( direction == 0 ) pos = l[pos]; else pos = r[pos]; if ( pos == -1 ) { pos = 0; direction = !direction; } } return i; } In middle end, I don't see pos is sunk out of loop by loop store motion. Any idea? The dump after lim is like below, and I expect a SSA symbole xxx_lsm could be created with this pass. ;; Function test_func (test_func, funcdef_no=0, decl_uid=4057, cgraph_uid=0) Symbols to be put in SSA form { .MEM } Incremental SSA update started at block: 0 Number of blocks in CFG: 12 Number of blocks to update: 11 ( 92%) test_func () { int pretmp.14; unsigned int pretmp.13; int prephitmp.12; int pretmp.11; unsigned int pretmp.10; int pretmp.9; int D.4088; static int pos; int direction; int i; _Bool D.4082; int pos.5; int * D.4078; int * r.4; int pos.3; int * D.4074; unsigned int D.4073; unsigned int pos.2; int pos.1; int * l.0; bb 2: pos = 0; l.0_6 = l; r.4_12 = r; bb 3: # i_32 = PHI i_21(11), 0(2) # direction_37 = PHI direction_2(11), 1(2) # prephitmp.12_35 = PHI pretmp.11_1(11), 0(2) if (direction_37 == 0) goto bb 4; else goto bb 5; bb 4: pos.1_7 = prephitmp.12_35; pos.2_8 = (unsigned int) pos.1_7; D.4073_9 = pos.2_8 * 4; D.4074_10 = l.0_6 + D.4073_9; pos.3_11 = *D.4074_10; pos = pos.3_11; goto bb 6; bb 5: pos.1_13 = prephitmp.12_35; pos.2_14 = (unsigned int) pos.1_13; D.4073_15 = pos.2_14 * 4; D.4078_16 = r.4_12 + D.4073_15; pos.5_17 = *D.4078_16; pos = pos.5_17; bb 6: # prephitmp.12_31 = PHI pos.3_11(4), pos.5_17(5) pos.1_18 = prephitmp.12_31; if (pos.1_18 == -1) goto bb 7; else goto bb 10; bb 10: goto bb 8; bb 7: pos = 0; D.4088_36 = direction_37 ^ 1; direction_20 = D.4088_36 1; bb 8: # direction_2 = PHI direction_37(10), direction_20(7) i_21 = i_32 + 1; if (i_21 != 401) goto bb 11; else goto bb 9; bb 11: pretmp.11_1 = pos; goto bb 3; bb 9: return 401; } Thanks, -Jiangning
RE: [PATCH] Improve SCEV for array element
-Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches- ow...@gcc.gnu.org] On Behalf Of Jiangning Liu Sent: Friday, January 20, 2012 5:07 PM To: 'Richard Guenther' Cc: gcc-patches@gcc.gnu.org Subject: RE: [PATCH] Improve SCEV for array element It's definitely not ok at this stage but at most for next stage1. OK. I may wait until next stage1. This is a very narrow pattern-match. It doesn't allow for a[i].x for example, even if a[i] is a one-element structure. I think the canonical way of handling ADDR_EXPR is to use sth like base = get_inner_reference (TREE_OPERAND (rhs1, 0), ..., offset, ...); base = build1 (ADDR_EXPR, TREE_TYPE (rhs1), base); chrec1 = analyze_scalar_evolution (loop, base); chrec2 = analyze_scalar_evolution (loop, offset); chrec1 = chrec_convert (type, chrec1, at_stmt); chrec2 = chrec_convert (TREE_TYPE (offset), chrec2, at_stmt); res = chrec_fold_plus (type, chrec1, chrec2); where you probably need to handle scev_not_known when analyzing offset (which might be NULL). You also need to add bitpos to the base address (in bytes, of course). Note that the MEM_REF case would naturally work with this as well. OK. New patch is like below, and bootstrapped on x86-32. ChangeLog: 2012-01-20 Jiangning Liu jiangning@arm.com * tree-scalar-evolution (interpret_rhs_expr): generate chrec for array reference and component reference. ChangeLog for testsuite: 2012-01-20 Jiangning Liu jiangning@arm.com * gcc.dg/tree-ssa/scev-3.c: New. * gcc.dg/tree-ssa/scev-4.c: New. Richard, PING... Is this patch OK after branch 4.7 is created and trunk is open again? Thanks, -Jiangning diff --git a/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c b/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c new file mode 100644 index 000..28d5c93 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -fdump-tree-optimized } */ + +int *a_p; +int a[1000]; + +f(int k) +{ + int i; + + for (i=k; i1000; i+=k) { + a_p = a[i]; + *a_p = 100; +} +} + +/* { dg-final { scan-tree-dump-times a 1 optimized } } */ +/* { dg-final { cleanup-tree-dump optimized } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c b/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c new file mode 100644 index 000..6c1e530 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c @@ -0,0 +1,23 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -fdump-tree-optimized } */ + +typedef struct { + int x; + int y; +} S; + +int *a_p; +S a[1000]; + +f(int k) +{ + int i; + + for (i=k; i1000; i+=k) { + a_p = a[i].y; + *a_p = 100; +} +} + +/* { dg-final { scan-tree-dump-times a 1 optimized } } */ +/* { dg-final { cleanup-tree-dump optimized } } */ diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c index 2077c8d..4e06b75 --- a/gcc/tree-scalar-evolution.c +++ b/gcc/tree-scalar-evolution.c @@ -1712,16 +1712,61 @@ interpret_rhs_expr (struct loop *loop, gimple at_stmt, switch (code) { case ADDR_EXPR: - /* Handle MEM[ptr + CST] which is equivalent to POINTER_PLUS_EXPR. */ - if (TREE_CODE (TREE_OPERAND (rhs1, 0)) != MEM_REF) - { - res = chrec_dont_know; - break; - } + if (TREE_CODE (TREE_OPERAND (rhs1, 0)) == ARRAY_REF + || TREE_CODE (TREE_OPERAND (rhs1, 0)) == MEM_REF + || TREE_CODE (TREE_OPERAND (rhs1, 0)) == COMPONENT_REF) +{ + enum machine_mode mode; + HOST_WIDE_INT bitsize, bitpos; + int unsignedp; + int volatilep = 0; + tree base, offset; + tree chrec3; + + base = get_inner_reference (TREE_OPERAND (rhs1, 0), + bitsize, bitpos, offset, + mode, unsignedp, volatilep, false); + + if (TREE_CODE (base) == MEM_REF) + { + rhs2 = TREE_OPERAND (base, 1); + rhs1 = TREE_OPERAND (base, 0); + + chrec1 = analyze_scalar_evolution (loop, rhs1); + chrec2 = analyze_scalar_evolution (loop, rhs2); + chrec1 = chrec_convert (type, chrec1, at_stmt); + chrec2 = chrec_convert (TREE_TYPE (rhs2), chrec2, at_stmt); + res = chrec_fold_plus (type, chrec1, chrec2); + } + else + { + base = build1 (ADDR_EXPR, TREE_TYPE (rhs1), base); + chrec1 = analyze_scalar_evolution (loop, base); + chrec1 = chrec_convert (type, chrec1, at_stmt); + res = chrec1; + } - rhs2 = TREE_OPERAND (TREE_OPERAND (rhs1, 0), 1); - rhs1 = TREE_OPERAND (TREE_OPERAND (rhs1, 0), 0); - /* Fall through. */ + if (offset != NULL_TREE
RE: [PATCH] Improve SCEV for array element
It's definitely not ok at this stage but at most for next stage1. OK. I may wait until next stage1. This is a very narrow pattern-match. It doesn't allow for a[i].x for example, even if a[i] is a one-element structure. I think the canonical way of handling ADDR_EXPR is to use sth like base = get_inner_reference (TREE_OPERAND (rhs1, 0), ..., offset, ...); base = build1 (ADDR_EXPR, TREE_TYPE (rhs1), base); chrec1 = analyze_scalar_evolution (loop, base); chrec2 = analyze_scalar_evolution (loop, offset); chrec1 = chrec_convert (type, chrec1, at_stmt); chrec2 = chrec_convert (TREE_TYPE (offset), chrec2, at_stmt); res = chrec_fold_plus (type, chrec1, chrec2); where you probably need to handle scev_not_known when analyzing offset (which might be NULL). You also need to add bitpos to the base address (in bytes, of course). Note that the MEM_REF case would naturally work with this as well. OK. New patch is like below, and bootstrapped on x86-32. ChangeLog: 2012-01-20 Jiangning Liu jiangning@arm.com * tree-scalar-evolution (interpret_rhs_expr): generate chrec for array reference and component reference. ChangeLog for testsuite: 2012-01-20 Jiangning Liu jiangning@arm.com * gcc.dg/tree-ssa/scev-3.c: New. * gcc.dg/tree-ssa/scev-4.c: New. diff --git a/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c b/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c new file mode 100644 index 000..28d5c93 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -fdump-tree-optimized } */ + +int *a_p; +int a[1000]; + +f(int k) +{ + int i; + + for (i=k; i1000; i+=k) { + a_p = a[i]; + *a_p = 100; +} +} + +/* { dg-final { scan-tree-dump-times a 1 optimized } } */ +/* { dg-final { cleanup-tree-dump optimized } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c b/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c new file mode 100644 index 000..6c1e530 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c @@ -0,0 +1,23 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -fdump-tree-optimized } */ + +typedef struct { + int x; + int y; +} S; + +int *a_p; +S a[1000]; + +f(int k) +{ + int i; + + for (i=k; i1000; i+=k) { + a_p = a[i].y; + *a_p = 100; +} +} + +/* { dg-final { scan-tree-dump-times a 1 optimized } } */ +/* { dg-final { cleanup-tree-dump optimized } } */ diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c index 2077c8d..4e06b75 --- a/gcc/tree-scalar-evolution.c +++ b/gcc/tree-scalar-evolution.c @@ -1712,16 +1712,61 @@ interpret_rhs_expr (struct loop *loop, gimple at_stmt, switch (code) { case ADDR_EXPR: - /* Handle MEM[ptr + CST] which is equivalent to POINTER_PLUS_EXPR. */ - if (TREE_CODE (TREE_OPERAND (rhs1, 0)) != MEM_REF) - { - res = chrec_dont_know; - break; - } + if (TREE_CODE (TREE_OPERAND (rhs1, 0)) == ARRAY_REF + || TREE_CODE (TREE_OPERAND (rhs1, 0)) == MEM_REF + || TREE_CODE (TREE_OPERAND (rhs1, 0)) == COMPONENT_REF) +{ + enum machine_mode mode; + HOST_WIDE_INT bitsize, bitpos; + int unsignedp; + int volatilep = 0; + tree base, offset; + tree chrec3; + + base = get_inner_reference (TREE_OPERAND (rhs1, 0), + bitsize, bitpos, offset, + mode, unsignedp, volatilep, false); + + if (TREE_CODE (base) == MEM_REF) + { + rhs2 = TREE_OPERAND (base, 1); + rhs1 = TREE_OPERAND (base, 0); + + chrec1 = analyze_scalar_evolution (loop, rhs1); + chrec2 = analyze_scalar_evolution (loop, rhs2); + chrec1 = chrec_convert (type, chrec1, at_stmt); + chrec2 = chrec_convert (TREE_TYPE (rhs2), chrec2, at_stmt); + res = chrec_fold_plus (type, chrec1, chrec2); + } + else + { + base = build1 (ADDR_EXPR, TREE_TYPE (rhs1), base); + chrec1 = analyze_scalar_evolution (loop, base); + chrec1 = chrec_convert (type, chrec1, at_stmt); + res = chrec1; + } - rhs2 = TREE_OPERAND (TREE_OPERAND (rhs1, 0), 1); - rhs1 = TREE_OPERAND (TREE_OPERAND (rhs1, 0), 0); - /* Fall through. */ + if (offset != NULL_TREE) + { + chrec2 = analyze_scalar_evolution (loop, offset); + chrec2 = chrec_convert (TREE_TYPE (offset), chrec2, at_stmt); + res = chrec_fold_plus (type, res, chrec2); + } + + if (bitpos) + { + gcc_assert ((bitpos % BITS_PER_UNIT) == 0); + + chrec3 = build_int_cst (integer_type_node, + bitpos / BITS_PER_UNIT
[PATCH] Improve SCEV for array element
This code change intends to improve scev for array element and reduce the common sub-expressions in loop, which may be introduced by multiple reference of expression like a[i]. With this optimization the register pressure can be reduced in loops. The problem is originally from a real benchmark, and the test case only tries to detect the GIMPLE level changes. Bootstraped on x86-32. OK for trunk? ChangeLog: 2012-01-05 Jiangning Liu jiangning@arm.com * tree-scalar-evolution (interpret_rhs_expr): generate chrec for array reference. ChangeLog for testsuite: 2012-01-05 Jiangning Liu jiangning@arm.com * gcc.dg/scev-1.c: New. diff --git a/gcc/testsuite/gcc.dg/scev-1.c b/gcc/testsuite/gcc.dg/scev-1.c new file mode 100644 index 000..28d5c93 --- /dev/null +++ b/gcc/testsuite/gcc.dg/scev-1.c @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -fdump-tree-optimized } */ + +int *a_p; +int a[1000]; + +f(int k) +{ + int i; + + for (i=k; i1000; i+=k) { + a_p = a[i]; + *a_p = 100; +} +} + +/* { dg-final { scan-tree-dump-times a 1 optimized } } */ +/* { dg-final { cleanup-tree-dump optimized } } */ diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c index 2077c8d..de89fc4 --- a/gcc/tree-scalar-evolution.c +++ b/gcc/tree-scalar-evolution.c @@ -1712,6 +1712,42 @@ interpret_rhs_expr (struct loop *loop, gimple at_stmt, switch (code) { case ADDR_EXPR: + if (TREE_CODE (TREE_OPERAND (rhs1, 0)) == ARRAY_REF) +{ + tree array_ref; + tree var_decl, base, offset; + tree low_bound; + tree unit_size; + tree index; + + array_ref = TREE_OPERAND (rhs1, 0); + var_decl = TREE_OPERAND (array_ref, 0); + index = TREE_OPERAND (array_ref, 1); + + low_bound = array_ref_low_bound (array_ref); + unit_size = array_ref_element_size (array_ref); + + /* We assume all arrays have sizes that are a multiple of a byte. +First subtract the lower bound, if any, in the type of the +index, then convert to sizetype and multiply by the size of +the array element. */ + if (! integer_zerop (low_bound)) + index = fold_build2 (MINUS_EXPR, TREE_TYPE (index), +index, low_bound); + + offset = size_binop (MULT_EXPR, + fold_convert (sizetype, index), + unit_size); + + base = build1 (ADDR_EXPR, TREE_TYPE (rhs1), var_decl); + chrec1 = analyze_scalar_evolution (loop, base); + chrec2 = analyze_scalar_evolution (loop, offset); + chrec1 = chrec_convert (type, chrec1, at_stmt); + chrec2 = chrec_convert (TREE_TYPE (offset), chrec2, at_stmt); + res = chrec_fold_plus (type, chrec1, chrec2); + break; +} + /* Handle MEM[ptr + CST] which is equivalent to POINTER_PLUS_EXPR. */ if (TREE_CODE (TREE_OPERAND (rhs1, 0)) != MEM_REF) { Thanks, -Jiangning scev.patch Description: Binary data
RE: [RFC] Use REG_EXPR in back-end (introduced by optimization to conditional and/or in ARM back-end)
-Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches- ow...@gcc.gnu.org] On Behalf Of Jiangning Liu Sent: Wednesday, December 21, 2011 2:48 PM To: 'Richard Henderson' Cc: gcc-patches@gcc.gnu.org; 'Richard Guenther' Subject: RE: [RFC] Use REG_EXPR in back-end (introduced by optimization to conditional and/or in ARM back-end) -Original Message- From: Richard Henderson [mailto:r...@redhat.com] Sent: Tuesday, November 22, 2011 9:55 AM To: Jiangning Liu Cc: gcc-patches@gcc.gnu.org; 'Richard Guenther' Subject: Re: [RFC] Use REG_EXPR in back-end (introduced by optimization to conditional and/or in ARM back-end) On 11/21/2011 05:31 PM, Jiangning Liu wrote: My question is essentially is May I really use REG_EXPR in back- end code? like the patch I gave below? I suppose. Another alternative is to use BImode for booleans. Dunno how much of that you'd be able to gleen from mere rtl expansion or if you'd need boolean decls to be expanded with BImode. Hi Richard, The output of expand pass requires the operands must meet the requirement of general_operand for binary operations, i.e. all RTX operations on top level must meet the hardware instruction requirement. Generally for a hardware not directly supporting a single bit logic operation, we can't generate BI mode rtx dest. So if I simply generate BImode for NE in expand pass, like subreg:SI (ne:BI (reg:SI xxx) (const_int 0))), there would be an assertion failure due to failing to find an appropriate instruction code to operate on a single bit. This looks like a GCC design level issue. How would you suggest generating a legitimate rtx expression containing BImode? Can anybody give me useful suggestions on this issue? My question essentially is may I really use BImode to solve my original problem? Thanks, -Jiangning Thanks, -Jiangning The later would probably need a target hook. I've often wondered how much ia64 would benefit from that too, being able to store bool variables directly in predicate registers. All of this almost certainly must wait until stage1 opens up again though... r~
RE: A case exposing code sink issue
In final value replacement, expression a + D. can be figured out, while a[i_xxx] failed to be CHRECed, so I'm wondering if we should lower a[i_xxx] to a + unitsize(a) * i_xxx first? It seems GCC intends to keep a[i_xxx] until cfgexpand pass. Richard, Thanks a lot for your explanation! Any comments for this question as well? Does GCC intends to keep a[i] until expand pass? Any special reason? If I change CHREC algorithm, I see ivopt would have done this lowering, so we wouldn't see a[i] in expand pass. Any hurt? Thanks, -Jiangning
RE: A case exposing code sink issue
-Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Thursday, January 05, 2012 5:32 PM To: Jiangning Liu Cc: Michael Matz; gcc@gcc.gnu.org Subject: Re: A case exposing code sink issue On Thu, Jan 5, 2012 at 9:34 AM, Jiangning Liu jiangning@arm.com wrote: In final value replacement, expression a + D. can be figured out, while a[i_xxx] failed to be CHRECed, so I'm wondering if we should lower a[i_xxx] to a + unitsize(a) * i_xxx first? It seems GCC intends to keep a[i_xxx] until cfgexpand pass. Richard, Thanks a lot for your explanation! Any comments for this question as well? Does GCC intends to keep a[i] until expand pass? It's kept as it suits - it's shorter and easier to analyze (we know the implicit multiplication won't overflow) Any special reason? If I change CHREC algorithm, I see ivopt would have done this lowering, so we wouldn't see a[i] in expand pass. Any hurt? No, I think changing the CHREC algorithm is ok (I didn't look closely at your patch). This is stage1 material though. Actually I didn't send it out at all. :-) And I just send out a RFC here http://gcc.gnu.org/ml/gcc-patches/2012-01/msg00236.html, can you have a look? Thanks, Richard. Thanks, -Jiangning
[RFC][patch] improve scev for array element
This code change intends to improve scev for array element and reduce the common sub-expressions in loop, which may be introduced by multiple reference of expression a[i]. In the end the register pressure may be reduced in the loop. The test case is simplified from a real benchmark, and only want to show the GIMPLE level changes. Any suggestions? diff --git a/gcc/testsuite/gcc.dg/scev-1.c b/gcc/testsuite/gcc.dg/scev-1.c new file mode 100644 index 000..28d5c93 --- /dev/null +++ b/gcc/testsuite/gcc.dg/scev-1.c @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -fdump-tree-optimized } */ + +int *a_p; +int a[1000]; + +f(int k) +{ + int i; + + for (i=k; i1000; i+=k) { + a_p = a[i]; + *a_p = 100; +} +} + +/* { dg-final { scan-tree-dump-times a 1 optimized } } */ +/* { dg-final { cleanup-tree-dump optimized } } */ diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c index 2077c8d..de89fc4 --- a/gcc/tree-scalar-evolution.c +++ b/gcc/tree-scalar-evolution.c @@ -1712,6 +1712,42 @@ interpret_rhs_expr (struct loop *loop, gimple at_stmt, switch (code) { case ADDR_EXPR: + if (TREE_CODE (TREE_OPERAND (rhs1, 0)) == ARRAY_REF) +{ + tree array_ref; + tree var_decl, base, offset; + tree low_bound; + tree unit_size; + tree index; + + array_ref = TREE_OPERAND (rhs1, 0); + var_decl = TREE_OPERAND (array_ref, 0); + index = TREE_OPERAND (array_ref, 1); + + low_bound = array_ref_low_bound (array_ref); + unit_size = array_ref_element_size (array_ref); + + /* We assume all arrays have sizes that are a multiple of a byte. +First subtract the lower bound, if any, in the type of the +index, then convert to sizetype and multiply by the size of +the array element. */ + if (! integer_zerop (low_bound)) + index = fold_build2 (MINUS_EXPR, TREE_TYPE (index), +index, low_bound); + + offset = size_binop (MULT_EXPR, + fold_convert (sizetype, index), + unit_size); + + base = build1 (ADDR_EXPR, TREE_TYPE (rhs1), var_decl); + chrec1 = analyze_scalar_evolution (loop, base); + chrec2 = analyze_scalar_evolution (loop, offset); + chrec1 = chrec_convert (type, chrec1, at_stmt); + chrec2 = chrec_convert (TREE_TYPE (offset), chrec2, at_stmt); + res = chrec_fold_plus (type, chrec1, chrec2); + break; +} + /* Handle MEM[ptr + CST] which is equivalent to POINTER_PLUS_EXPR. */ if (TREE_CODE (TREE_OPERAND (rhs1, 0)) != MEM_REF) { Another bigger case may be like below, and register pressure could be observed lower. int a[512] ; int b[512] ; int *ax ; int *bx ; int *ay ; int *by ; int i ; int j ; int f(int k) { for( j = 0 ; j k ; j++) { for( i = j ; i 512 ; i += k) { ax = a[i+k] ; bx = b[i+k] ; ay = a[i] ; by = b[i] ; *ax = 7 ; *bx = 7 ; a[i] = 8; b[i] = 8; } } } Thanks, -Jiangning
RE: A case exposing code sink issue
-Original Message- From: Jiangning Liu Sent: Wednesday, December 28, 2011 5:38 PM To: Jiangning Liu; 'Richard Guenther' Cc: Michael Matz; gcc@gcc.gnu.org Subject: RE: A case exposing code sink issue -Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Jiangning Liu Sent: Tuesday, December 27, 2011 5:10 PM To: 'Richard Guenther' Cc: Michael Matz; gcc@gcc.gnu.org Subject: RE: A case exposing code sink issue The job to do this is final value replacement, not sinking (we do not sink non-invariant expressions - you'd have to translate them through the loop-closed SSA exit PHI node, certainly doable, patches welcome ;)). Richard, In final value replacement, expression a + D. can be figured out, while a[i_xxx] failed to be CHRECed, so I'm wondering if we should lower a[i_xxx] to a + unitsize(a) * i_xxx first? It seems GCC intends to keep a[i_xxx] until cfgexpand pass. Or we have to directly modify CHREC algorithm to get it calculated? Appreciate your kindly help in advance! Richard, Now I have a patch working for the case of step i++, by directly modifying scalar evolution algorithm. the following code would be generated after SCCP, l # i_13 = PHI i_6(7), k_2(D)(4) a_p.0_4 = a[i_13]; MEM[(int *)a][i_13] = 100; i_6 = i_13 + 1; if (i_6 = 999) goto bb 7; else goto bb 6; bb 6: a_p_lsm.5_11 = MEM[(void *)a + 3996B]; a_p = a_p_lsm.5_11; goto bb 3; It looks good, but I still have problem when the case has step i+=k. For this case the value of variable i exiting loop isn't invariant, the algorithm below in scalar evolution doesn't work on it, compute_overall_effect_of_inner_loop() { ... tree nb_iter = number_of_latch_executions (inner_loop); if (nb_iter == chrec_dont_know) return chrec_dont_know; else { tree res; /* evolution_fn is the evolution function in LOOP. Get its value in the nb_iter-th iteration. */ res = chrec_apply (inner_loop-num, evolution_fn, nb_iter); if (chrec_contains_symbols_defined_in_loop (res, loop-num)) res = instantiate_parameters (loop, res); /* Continue the computation until ending on a parent of LOOP. */ return compute_overall_effect_of_inner_loop (loop, res); } } In theory, we can still have the transformation like below even if the step is i+=k, # i_13 = PHI i_6(7), k_2(D)(4) i_14 = i_13, a_p.0_4 = a[i_13]; MEM[(int *)a][i_13] = 100; i_6 = i_13 + k_2(D); // i+=k if (i_6 = 999) goto bb 7; else goto bb 6; bb 6: a_p_lsm.5_11 = a[i_14]; a_p = a_p_lsm.5_11; goto bb 3; But I realize this is not a loop closed SSA form at all, because i_14 is being used out of the loop. Where could we extend the liverange of variable i in GCC infrastructure and finally solve this problem? It seems many people are still in the happy of the upcoming 2012 New Year. :-) Following my story, I find the following code in tree-ssa-copy.c /* Avoid copy propagation from an inner into an outer loop. Otherwise, this may move loop variant variables outside of their loops and prevent coalescing opportunities. If the value was loop invariant, it will be hoisted by LICM and exposed for copy propagation. ??? The value will be always loop invariant. In loop-closed SSA form do not copy-propagate through PHI nodes in blocks with a loop exit edge predecessor. */ if (current_loops TREE_CODE (arg_value) == SSA_NAME (loop_depth_of_name (arg_value) loop_depth_of_name (lhs) || (loops_state_satisfies_p (LOOP_CLOSED_SSA) loop_exit_edge_p (e-src-loop_father, e { phi_val.value = lhs; break; } Here http://gcc.gnu.org/ml/gcc-patches/2006-12/msg00066.html, Dan said The original check was not because of coalescing, but because we would copy prop in-loop variables outside the loop, causing *more* invariantness in nested loops. Can anybody give me a concrete example to explain this statement? Anyway, for my simple case, I don't see bad thing would happen when propagate a[i] out of the loop. Also, after this propagation, a_p.0_4 = a[i_13]; within the loop would be dead code and removed in the passes afterwards. In my opinion, that way the computation would be reduced in the loop. Did I make any mistake? Thanks, -Jiangning
RE: A case exposing code sink issue
-Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Jiangning Liu Sent: Tuesday, December 27, 2011 5:10 PM To: 'Richard Guenther' Cc: Michael Matz; gcc@gcc.gnu.org Subject: RE: A case exposing code sink issue The job to do this is final value replacement, not sinking (we do not sink non-invariant expressions - you'd have to translate them through the loop-closed SSA exit PHI node, certainly doable, patches welcome ;)). Richard, In final value replacement, expression a + D. can be figured out, while a[i_xxx] failed to be CHRECed, so I'm wondering if we should lower a[i_xxx] to a + unitsize(a) * i_xxx first? It seems GCC intends to keep a[i_xxx] until cfgexpand pass. Or we have to directly modify CHREC algorithm to get it calculated? Appreciate your kindly help in advance! Richard, Now I have a patch working for the case of step i++, by directly modifying scalar evolution algorithm. the following code would be generated after SCCP, l # i_13 = PHI i_6(7), k_2(D)(4) a_p.0_4 = a[i_13]; MEM[(int *)a][i_13] = 100; i_6 = i_13 + 1; if (i_6 = 999) goto bb 7; else goto bb 6; bb 6: a_p_lsm.5_11 = MEM[(void *)a + 3996B]; a_p = a_p_lsm.5_11; goto bb 3; It looks good, but I still have problem when the case has step i+=k. For this case the value of variable i exiting loop isn't invariant, the algorithm below in scalar evolution doesn't work on it, compute_overall_effect_of_inner_loop() { ... tree nb_iter = number_of_latch_executions (inner_loop); if (nb_iter == chrec_dont_know) return chrec_dont_know; else { tree res; /* evolution_fn is the evolution function in LOOP. Get its value in the nb_iter-th iteration. */ res = chrec_apply (inner_loop-num, evolution_fn, nb_iter); if (chrec_contains_symbols_defined_in_loop (res, loop-num)) res = instantiate_parameters (loop, res); /* Continue the computation until ending on a parent of LOOP. */ return compute_overall_effect_of_inner_loop (loop, res); } } In theory, we can still have the transformation like below even if the step is i+=k, # i_13 = PHI i_6(7), k_2(D)(4) i_14 = i_13, a_p.0_4 = a[i_13]; MEM[(int *)a][i_13] = 100; i_6 = i_13 + k_2(D); // i+=k if (i_6 = 999) goto bb 7; else goto bb 6; bb 6: a_p_lsm.5_11 = a[i_14]; a_p = a_p_lsm.5_11; goto bb 3; But I realize this is not a loop closed SSA form at all, because i_14 is being used out of the loop. Where could we extend the liverange of variable i in GCC infrastructure and finally solve this problem? Thanks, -Jiangning
RE: A case exposing code sink issue
The job to do this is final value replacement, not sinking (we do not sink non-invariant expressions - you'd have to translate them through the loop-closed SSA exit PHI node, certainly doable, patches welcome ;)). Richard, In final value replacement, expression a + D. can be figured out, while a[i_xxx] failed to be CHRECed, so I'm wondering if we should lower a[i_xxx] to a + unitsize(a) * i_xxx first? It seems GCC intends to keep a[i_xxx] until cfgexpand pass. Or we have to directly modify CHREC algorithm to get it calculated? Appreciate your kindly help in advance! Thanks, -Jiangning
RE: A case exposing code sink issue
Yes, the number of iterations of the i loop simply is too difficult for our loop iteration calculator to comprehend: for (i=k; i500; i+=k) iterates for roundup((500-k)/k) time. In particular if the step is non-constant our nr-of-iteration calculator gives up. I'm trying to give an even smaller case, int a[512] ; int *a_p ; int f(int k) { int i ; for(i=0; ik; i++) { a_p = a[i] ; *a_p = 7 ; } } For this case, we have a very simple loop step i++, then we would have the GIMPLE before expand like below, bb 5: # i_13 = PHI i_6(5), 0(4) # ivtmp.10_9 = PHI ivtmp.10_1(5), ivtmp.10_15(4) a_p_lsm.6_4 = a[i_13]; ivtmp.10_1 = ivtmp.10_9 + 4; D.4085_16 = (void *) ivtmp.10_1; MEM[base: D.4085_16, offset: 0B] = 7; i_6 = i_13 + 1; if (i_6 != k_3(D)) goto bb 5; else goto bb 6; bb 6: # a_p_lsm.6_11 = PHI a_p_lsm.6_4(5) a_p = a_p_lsm.6_11; goto bb 3; Why can't we still sunk a[i_13] out of loop? For example, I expect to generate the code like below, bb 5: # i_13 = PHI i_6(5), 0(4) # ivtmp.10_9 = PHI ivtmp.10_1(5), ivtmp.10_15(4) i_14 = i_13; ivtmp.10_1 = ivtmp.10_9 + 4; D.4085_16 = (void *) ivtmp.10_1; MEM[base: D.4085_16, offset: 0B] = 7; i_6 = i_13 + 1; if (i_6 != k_3(D)) goto bb 5; else goto bb 6; bb 6: # a_p_lsm.6_11 = PHI a_p_lsm.6_4(5) a_p_lsm.6_4 = a[i_14]; a_p = a_p_lsm.6_11; goto bb 3; This way the computation of a[i] would be saved within the loop. Any idea? Thanks, -Jiangning
RE: [RFC] Use REG_EXPR in back-end (introduced by optimization to conditional and/or in ARM back-end)
-Original Message- From: Richard Henderson [mailto:r...@redhat.com] Sent: Tuesday, November 22, 2011 9:55 AM To: Jiangning Liu Cc: gcc-patches@gcc.gnu.org; 'Richard Guenther' Subject: Re: [RFC] Use REG_EXPR in back-end (introduced by optimization to conditional and/or in ARM back-end) On 11/21/2011 05:31 PM, Jiangning Liu wrote: My question is essentially is May I really use REG_EXPR in back-end code? like the patch I gave below? I suppose. Another alternative is to use BImode for booleans. Dunno how much of that you'd be able to gleen from mere rtl expansion or if you'd need boolean decls to be expanded with BImode. Hi Richard, The output of expand pass requires the operands must meet the requirement of general_operand for binary operations, i.e. all RTX operations on top level must meet the hardware instruction requirement. Generally for a hardware not directly supporting a single bit logic operation, we can't generate BI mode rtx dest. So if I simply generate BImode for NE in expand pass, like subreg:SI (ne:BI (reg:SI xxx) (const_int 0))), there would be an assertion failure due to failing to find an appropriate instruction code to operate on a single bit. This looks like a GCC design level issue. How would you suggest generating a legitimate rtx expression containing BImode? Thanks, -Jiangning The later would probably need a target hook. I've often wondered how much ia64 would benefit from that too, being able to store bool variables directly in predicate registers. All of this almost certainly must wait until stage1 opens up again though... r~
RE: A case exposing code sink issue
-Original Message- From: Michael Matz [mailto:m...@suse.de] Sent: Monday, November 28, 2011 9:07 PM To: Jiangning Liu Cc: gcc@gcc.gnu.org Subject: RE: A case exposing code sink issue Hi, On Mon, 28 Nov 2011, Jiangning Liu wrote: One more question... Can i = i.6_18; be sinked out of loop, because it doesn't have memory dependence with others? With current trunk the stores to i, a_p, b_p and k are sunken after the loop. (There are no aliasing problems because the decls can't conflict). What isn't sunken is the calculation of the a[D.2248_7] expression. First, the number of iterations of the inner loop can't be determined by current code (replacing i+=k with e.g. i++ could be handled for instance). Hi Michael, Do you know what the essential problem is in the case of loop iteration uncertainty? Yes, the number of iterations of the i loop simply is too difficult for our loop iteration calculator to comprehend: for (i=k; i500; i+=k) iterates for roundup((500-k)/k) time. In particular if the step is non-constant our nr-of-iteration calculator gives up. So do you think this can be improved somewhere? For this case, looking at the result in middle end, a_p.2_8 = a[D.2248_7]; should be able to sunken out of loop. That way the computation of a[D.2248_7] would be saved in loop, although the consequence is the liverange of D.2248_7 is longer and it needs to live out of loop. But anyway the register pressure would be decreased within the loop, and we would less possibly have spill/fill code. This is what I want. I think we can simply use loop induction variable analysis to solve this problem. Do you think so? Thanks, -Jiangning I thought it was still an aliasing problem. No. All accesses are resolved to final objects (i.e. no pointers), and hence can be trivially disambiguated. Ciao, Michael.
RE: A case exposing code sink issue
-Original Message- From: Michael Matz [mailto:m...@suse.de] Sent: Friday, November 25, 2011 11:23 PM To: Jiangning Liu Cc: gcc@gcc.gnu.org Subject: RE: A case exposing code sink issue Hi, On Thu, 24 Nov 2011, Jiangning Liu wrote: One more question... Can i = i.6_18; be sinked out of loop, because it doesn't have memory dependence with others? With current trunk the stores to i, a_p, b_p and k are sunken after the loop. (There are no aliasing problems because the decls can't conflict). What isn't sunken is the calculation of the a[D.2248_7] expression. First, the number of iterations of the inner loop can't be determined by current code (replacing i+=k with e.g. i++ could be handled for instance). Hi Michael, Do you know what the essential problem is in the case of loop iteration uncertainty? I thought it was still an aliasing problem. Thanks, -Jiangning Then this code could be handled by final value replacement, but isn't because interpret_rhs_expr doesn't deal with ADDR_EXPR of ARRAY_REFs. Ciao, Michael.
A case exposing code sink issue
Hi, For this small test case, int a[512] ; int b[512] ; int *a_p ; int *b_p ; int i ; int k ; int f(void) { for( k = 1 ; k = 9 ; k++) { for( i = k ; i 512 ; i += k) { a_p = a[i + (1k)] ; b_p = b[i + (1k)] ; *a_p = 7 ; *b_p = 7 ; } } } Before sink pass we have, f () { int pretmp.11; int k.7; int i.6; int * b_p.3; int * a_p.2; int D.2248; int i.1; int D.2246; int k.0; bb 2: k = 1; bb 3: # k.0_9 = PHI k.7_20(11), 1(2) i = k.0_9; if (k.0_9 = 511) goto bb 7; else goto bb 8; bb 8: Invalid sum of incoming frequencies 900, should be 81 goto bb 5; bb 7: pretmp.11_19 = 1 k.0_9; bb 4: # i.1_34 = PHI i.6_18(9), k.0_9(7) D.2246_5 = pretmp.11_19; D.2248_7 = i.1_34 + D.2246_5; a_p.2_8 = a[D.2248_7]; a_p = a_p.2_8; b_p.3_13 = b[D.2248_7]; b_p = b_p.3_13; MEM[(int *)a][D.2248_7] = 7; MEM[(int *)b][D.2248_7] = 7; i.6_18 = k.0_9 + i.1_34; i = i.6_18; if (i.6_18 = 511) goto bb 9; else goto bb 8; bb 9: goto bb 4; bb 5: Invalid sum of incoming frequencies 81, should be 900 k.7_20 = k.0_9 + 1; k = k.7_20; if (k.7_20 = 9) goto bb 11; else goto bb 6; bb 11: goto bb 3; bb 6: return; } Can the following statements be sinked out of loop? I don't see this optimization happen in trunk. The consequence is register pressure increased and a spill/fill occurs in RA. a_p.2_8 = a[D.2248_7]; a_p = a_p.2_8; b_p.3_13 = b[D.2248_7]; b_p = b_p.3_13; I know the sink would happen in sink pass if a_p and b_p are local variables. If this is the root cause, which optimization pass in GCC take the role to sink them out of loop? How should we get it fixed? Thanks, -Jiangning
RE: A case exposing code sink issue
-Original Message- From: Andrew Pinski [mailto:pins...@gmail.com] Sent: Thursday, November 24, 2011 12:15 PM To: Jiangning Liu Cc: gcc@gcc.gnu.org Subject: Re: A case exposing code sink issue On Wed, Nov 23, 2011 at 8:05 PM, Jiangning Liu jiangning@arm.com wrote: If this is the root cause, which optimization pass in GCC take the role to sink them out of loop? How should we get it fixed? lim1 handles the case just fine for me. lim1 is the first loop pass. After lim1 I get: bb 4: # i.1_34 = PHI i.6_18(9), k.0_9(7) D.2934_5 = pretmp.11_33; D.2936_7 = i.1_34 + D.2934_5; a_p.2_8 = a[D.2936_7]; a_p_lsm.13_37 = a_p.2_8; b_p.3_13 = b[D.2936_7]; b_p_lsm.14_38 = b_p.3_13; MEM[(int *)a][D.2936_7] = 7; MEM[(int *)b][D.2936_7] = 7; i.6_18 = k.0_9 + i.1_34; i_lsm.12_39 = i.6_18; if (i.6_18 = 511) goto bb 9; else goto bb 8; bb 9: goto bb 4; a[D.2936_7] and b[D.2936_7] are not loop invariants, so it seems lim1 shouldn't be able to sink them, right? Do I misunderstand this optimization? Thanks, -Jiangning Thanks, Andrew Pinski
RE: A case exposing code sink issue
Sorry, I realize we can't do that optimization because a_p may have dependence upon other memory accesses like MEM[(int *)a][D.2248_7]. For example, if it happens a_p equals a_p, that optimization would be wrong. But can alias analysis solve the problem if we can guarantee (i+(1k)) is less than the upbound of array a's definition? Or is there any GCC command line switch assuming no array bound overflow? That way we can do more aggressive optimizations, right? Thanks, -Jiangning -Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Jiangning Liu Sent: Thursday, November 24, 2011 12:05 PM To: gcc@gcc.gnu.org Subject: A case exposing code sink issue Hi, For this small test case, int a[512] ; int b[512] ; int *a_p ; int *b_p ; int i ; int k ; int f(void) { for( k = 1 ; k = 9 ; k++) { for( i = k ; i 512 ; i += k) { a_p = a[i + (1k)] ; b_p = b[i + (1k)] ; *a_p = 7 ; *b_p = 7 ; } } } Before sink pass we have, f () { int pretmp.11; int k.7; int i.6; int * b_p.3; int * a_p.2; int D.2248; int i.1; int D.2246; int k.0; bb 2: k = 1; bb 3: # k.0_9 = PHI k.7_20(11), 1(2) i = k.0_9; if (k.0_9 = 511) goto bb 7; else goto bb 8; bb 8: Invalid sum of incoming frequencies 900, should be 81 goto bb 5; bb 7: pretmp.11_19 = 1 k.0_9; bb 4: # i.1_34 = PHI i.6_18(9), k.0_9(7) D.2246_5 = pretmp.11_19; D.2248_7 = i.1_34 + D.2246_5; a_p.2_8 = a[D.2248_7]; a_p = a_p.2_8; b_p.3_13 = b[D.2248_7]; b_p = b_p.3_13; MEM[(int *)a][D.2248_7] = 7; MEM[(int *)b][D.2248_7] = 7; i.6_18 = k.0_9 + i.1_34; i = i.6_18; if (i.6_18 = 511) goto bb 9; else goto bb 8; bb 9: goto bb 4; bb 5: Invalid sum of incoming frequencies 81, should be 900 k.7_20 = k.0_9 + 1; k = k.7_20; if (k.7_20 = 9) goto bb 11; else goto bb 6; bb 11: goto bb 3; bb 6: return; } Can the following statements be sinked out of loop? I don't see this optimization happen in trunk. The consequence is register pressure increased and a spill/fill occurs in RA. a_p.2_8 = a[D.2248_7]; a_p = a_p.2_8; b_p.3_13 = b[D.2248_7]; b_p = b_p.3_13; I know the sink would happen in sink pass if a_p and b_p are local variables. If this is the root cause, which optimization pass in GCC take the role to sink them out of loop? How should we get it fixed? Thanks, -Jiangning
RE: A case exposing code sink issue
One more question... Can i = i.6_18; be sinked out of loop, because it doesn't have memory dependence with others? Thanks, -Jiangning -Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Jiangning Liu Sent: Thursday, November 24, 2011 2:57 PM To: gcc@gcc.gnu.org Subject: RE: A case exposing code sink issue Sorry, I realize we can't do that optimization because a_p may have dependence upon other memory accesses like MEM[(int *)a][D.2248_7]. For example, if it happens a_p equals a_p, that optimization would be wrong. But can alias analysis solve the problem if we can guarantee (i+(1k)) is less than the upbound of array a's definition? Or is there any GCC command line switch assuming no array bound overflow? That way we can do more aggressive optimizations, right? Thanks, -Jiangning -Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Jiangning Liu Sent: Thursday, November 24, 2011 12:05 PM To: gcc@gcc.gnu.org Subject: A case exposing code sink issue Hi, For this small test case, int a[512] ; int b[512] ; int *a_p ; int *b_p ; int i ; int k ; int f(void) { for( k = 1 ; k = 9 ; k++) { for( i = k ; i 512 ; i += k) { a_p = a[i + (1k)] ; b_p = b[i + (1k)] ; *a_p = 7 ; *b_p = 7 ; } } } Before sink pass we have, f () { int pretmp.11; int k.7; int i.6; int * b_p.3; int * a_p.2; int D.2248; int i.1; int D.2246; int k.0; bb 2: k = 1; bb 3: # k.0_9 = PHI k.7_20(11), 1(2) i = k.0_9; if (k.0_9 = 511) goto bb 7; else goto bb 8; bb 8: Invalid sum of incoming frequencies 900, should be 81 goto bb 5; bb 7: pretmp.11_19 = 1 k.0_9; bb 4: # i.1_34 = PHI i.6_18(9), k.0_9(7) D.2246_5 = pretmp.11_19; D.2248_7 = i.1_34 + D.2246_5; a_p.2_8 = a[D.2248_7]; a_p = a_p.2_8; b_p.3_13 = b[D.2248_7]; b_p = b_p.3_13; MEM[(int *)a][D.2248_7] = 7; MEM[(int *)b][D.2248_7] = 7; i.6_18 = k.0_9 + i.1_34; i = i.6_18; if (i.6_18 = 511) goto bb 9; else goto bb 8; bb 9: goto bb 4; bb 5: Invalid sum of incoming frequencies 81, should be 900 k.7_20 = k.0_9 + 1; k = k.7_20; if (k.7_20 = 9) goto bb 11; else goto bb 6; bb 11: goto bb 3; bb 6: return; } Can the following statements be sinked out of loop? I don't see this optimization happen in trunk. The consequence is register pressure increased and a spill/fill occurs in RA. a_p.2_8 = a[D.2248_7]; a_p = a_p.2_8; b_p.3_13 = b[D.2248_7]; b_p = b_p.3_13; I know the sink would happen in sink pass if a_p and b_p are local variables. If this is the root cause, which optimization pass in GCC take the role to sink them out of loop? How should we get it fixed? Thanks, -Jiangning
RE: [RFC] Optimization to conditional and/or in ARM back-end
-Original Message- From: Andrew Pinski [mailto:pins...@gmail.com] Sent: Tuesday, November 22, 2011 1:14 PM To: Jiangning Liu Cc: gcc-patches@gcc.gnu.org; Richard Guenther; Richard Henderson Subject: Re: [RFC] Optimization to conditional and/or in ARM back-end On Sun, Nov 20, 2011 at 6:17 PM, Jiangning Liu jiangning@arm.com wrote: Hi, This patch is to implement a peephole like optimization in ARM back- end. If we have an if condition expression like ((r3 != 0) r1) != 0, So this is the same as: int f1(int r1, int r3) { if (((r3 != 0) r1) != 0) return g(); return 1; } --- CUT --- Can't you do this instead: Combine the following two instructions: (insn 17 15 18 2 (parallel [ (set (reg:SI 150) (and:SI (ne:SI (reg:SI 0 r0 [ r1 ]) (const_int 0 [0])) (reg:SI 1 r1 [ r3 ]))) (clobber (reg:CC 24 cc)) ]) t24.c:11 274 {*cond_arith} (expr_list:REG_UNUSED (reg:CC 24 cc) (expr_list:REG_DEAD (reg:SI 1 r1 [ r3 ]) (expr_list:REG_DEAD (reg:SI 0 r0 [ r1 ]) (nil) (insn 18 17 19 2 (set (reg:CC 24 cc) (compare:CC (reg:SI 150) (const_int 0 [0]))) t24.c:11 211 {*arm_cmpsi_insn} (expr_list:REG_DEAD (reg:SI 150) (nil))) And then have a pattern which expands it to (note the r3 here is the r3 in the function, likewise for r1): andi r1, r1, #1 cmp r3, #0 it ne cmpne r1, #0 Yes it is one extra instruction but it is still better than before right? Hi Andrew, Yes, and even the code generated can be like cmp r3, #0 itt ne andine r1, r1, #1 cmpne r1, #0 But this is not what I want. For performance purpose, I want to remove that andi instruction as well. Also, actually if we don't modify r1, there might be other more optimization opportunities, for example register pressure can be reduced for some cases, I guess. Thanks, -Jiangning
RE: [RFC] Use REG_EXPR in back-end (introduced by optimization to conditional and/or in ARM back-end)
The original subject doesn't catch the key point, so I changed the subject to get more people noticed. My question is essentially is May I really use REG_EXPR in back-end code? like the patch I gave below? Thanks, -Jiangning -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches- ow...@gcc.gnu.org] On Behalf Of Jiangning Liu Sent: Monday, November 21, 2011 10:18 AM To: gcc-patches@gcc.gnu.org Cc: 'Richard Guenther'; Richard Henderson Subject: [RFC] Optimization to conditional and/or in ARM back-end Hi, This patch is to implement a peephole like optimization in ARM back-end. If we have an if condition expression like ((r3 != 0) r1) != 0, originally the binary code to be generated is like, cmp r3, #0 ite eq moveq r1, #0 andne r1, r1, #1 cmp r1, #0 But actually this expression could be transformed into ((r3 != x) (r1 != 0)) != 0, if we know r2 is with bool type. This way we could generate new binary code like, cmp r3, #0 it ne cmpne r1, #0 The question is how to judge r2 is a bool variable in back-end. I'm using REG_EXPR in function arm_check_logic_with_bool_reg to check if r2 is a bool, this function is being invoked in pattern *cond_code. May I really use REG_EXPR this way in GCC back-end code? I posted a related topic at http://gcc.gnu.org/ml/gcc/2011-11/msg00417.html. If the answer is No, is there any suggestions about either how to judge this bool type or how to implement this optimization? Appreciate your comments in advance! Thanks, -Jiangning diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h index 23a29c6..8b12d48 --- a/gcc/config/arm/arm-protos.h +++ b/gcc/config/arm/arm-protos.h @@ -114,6 +114,7 @@ extern rtx arm_gen_load_multiple (int *, int, rtx, int, rtx, HOST_WIDE_INT *); extern rtx arm_gen_store_multiple (int *, int, rtx, int, rtx, HOST_WIDE_INT *); extern int arm_gen_movmemqi (rtx *); extern enum machine_mode arm_select_cc_mode (RTX_CODE, rtx, rtx); +extern bool arm_check_logic_with_bool_reg (RTX_CODE, rtx, rtx); extern enum machine_mode arm_select_dominance_cc_mode (rtx, rtx, HOST_WIDE_INT); extern rtx arm_gen_compare_reg (RTX_CODE, rtx, rtx); diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index a429c19..e96f24a --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -10910,6 +10910,41 @@ arm_gen_movmemqi (rtx *operands) return 1; } +/* Check whether the expression rc cmp1 bool_reg can be transformed + to use conditional comparison or not. */ +bool +arm_check_logic_with_bool_reg (RTX_CODE rc, rtx bool_reg, rtx cmp1) { + rtx cmp2; + HOST_WIDE_INT dom_cc; + + if (!REG_P (bool_reg)) +return false; + + if (REG_EXPR (bool_reg) == NULL) +return false; + + if (TREE_CODE (REG_EXPR (bool_reg)) != VAR_DECL) +return false; + + if (TREE_CODE (TREE_TYPE (REG_EXPR (bool_reg))) != BOOLEAN_TYPE) +return false; + + cmp2 = gen_rtx_NE (GET_MODE (bool_reg), bool_reg, const0_rtx); + + if (rc == AND) +dom_cc = DOM_CC_X_AND_Y; + else if (rc == IOR) +dom_cc = DOM_CC_X_OR_Y; + else +gcc_unreachable (); + + if (arm_select_dominance_cc_mode (cmp1, cmp2, dom_cc) == CCmode) +return false; + + return true; +} + + /* Select a dominance comparison mode if possible for a test of the general form (OP (COND_OR (X) (Y)) (const_int 0)). We support three forms. COND_OR == DOM_CC_X_AND_Y = (X Y) diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md index a78ba88..e90a78e --- a/gcc/config/arm/arm.md +++ b/gcc/config/arm/arm.md @@ -9172,6 +9172,64 @@ (set_attr length 4,4,8)] ) +; This pattern matches expression like example ((r1 != x) r2) != 0. +; Here r2 is a register with data type boolean. +; This pattern can transform to code matching patterns cmp_and. +; Likewise, code matching pattern cmp_ior could be generated, if it is | +; rather than in the example. +(define_insn_and_split *cond_code + [(set (match_operand 0 cc_register ) + (compare + (ior_and:SI + (match_operator:SI 4 arm_comparison_operator + [(match_operand:SI 2 s_register_operand r,r) + (match_operand:SI 3 arm_add_operand rI,L)]) + (match_operand:SI 1 s_register_operand r,r)) + (const_int 0)))] + TARGET_32BIT +arm_check_logic_with_bool_reg (CODE, operands[1], operands[4]) + # + 1 + [(set (match_dup 7) + (compare + (match_op_dup 5 + [(match_op_dup 4 [(match_dup 2) (match_dup 3)]) +(match_op_dup 6 [(match_dup 1) (const_int 0)])]) + (const_int 0)))] + + { +HOST_WIDE_INT dom_cc; + +operands[6] = gen_rtx_NE (SImode, operands[1], const0_rtx); + +if (CODE == AND) + { +dom_cc = DOM_CC_X_AND_Y
RE: [RFC] Use which_alternative in preparation-statements of define_insn_and_split
-Original Message- From: Richard Henderson [mailto:r...@redhat.com] Sent: Tuesday, November 22, 2011 7:55 AM To: Jiangning Liu Cc: gcc-patches@gcc.gnu.org Subject: Re: [RFC] Use which_alternative in preparation-statements of define_insn_and_split On 11/20/2011 07:34 PM, Jiangning Liu wrote: Hi, I find which_alternative can't really be used in preparation- statements of define_insn_and_split, so can this be fixed like below? For example, I want to use which_alternative in the pattern below, (define_insn_and_split *thumb2_movsicc_insn [(set (match_operand:SI 0 s_register_operand =r,r,r,r,r,r,r,r) (if_then_else:SI (match_operator 3 arm_comparison_operator [(match_operand 4 cc_register ) (const_int 0)]) (match_operand:SI 1 arm_not_operand 0,0,rI,K,rI,rI,K,K) (match_operand:SI 2 arm_not_operand rI,K,0,0,rI,K,rI,K)))] TARGET_THUMB2 @ it\\t%D3\;mov%D3\\t%0, %2 it\\t%D3\;mvn%D3\\t%0, #%B2 it\\t%d3\;mov%d3\\t%0, %1 it\\t%d3\;mvn%d3\\t%0, #%B1 ite\\t%d3\;mov%d3\\t%0, %1\;mov%D3\\t%0, %2 ite\\t%d3\;mov%d3\\t%0, %1\;mvn%D3\\t%0, #%B2 ite\\t%d3\;mvn%d3\\t%0, #%B1\;mov%D3\\t%0, %2 ite\\t%d3\;mvn%d3\\t%0, #%B1\;mvn%D3\\t%0, #%B2 reload_completed [(cond_exec (match_dup 5) (set (match_dup 0) (match_dup 6)))] { if (which_alternative = 2 which_alternative 4) { ... } else if (which_alternative = 4) Hmm, I guess. It seems a bit weird. It seems like you'd be better off *not* using define_insn_and_split, actually, and instead using more specific tests on the separate define_split than you would on the define_insn. Yes. That would be alternative solution I can accept. But I still want to know why we don't want to support this? I don't see any GCC documentation saying not allowing this usage. Or you might be thinking this change is not safe enough? Thanks, -Jiangning I don't feel strongly about it though. I won't object if some other rtl maintainer wants to approve this. r~
RE: Is VRP is too conservative to identify boolean value 0 and 1?
-Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Friday, September 02, 2011 5:07 PM To: Jiangning Liu Cc: gcc@gcc.gnu.org Subject: Re: Is VRP is too conservative to identify boolean value 0 and 1? On Fri, Sep 2, 2011 at 7:58 AM, Jiangning Liu jiangning@arm.com wrote: Hi, For the following small case, int f(int i, int j) { if (i==1 j==2) return i; else return j; } with -O2 option, GCC has vrp2 dump like below, == Value ranges after VRP: i_1: VARYING i_2(D): VARYING D.1249_3: [0, +INF] j_4(D): VARYING D.1250_5: [0, +INF] D.1251_6: [0, +INF] j_10: [2, 2] EQUIVALENCES: { j_4(D) } (1 elements) Removing basic block 3 f (int i, int j) { _Bool D.1251; _Bool D.1250; _Bool D.1249; bb 2: D.1249_3 = i_2(D) == 1; D.1250_5 = j_4(D) == 2; D.1251_6 = D.1250_5 D.1249_3; if (D.1251_6 != 0) goto bb 3; else goto bb 4; bb 3: bb 4: # i_1 = PHI 1(3), j_4(D)(2) return i_1; } Variable D.1249_3, D.1250_5 and D.1251_6 should be boolean values, so the their value ranges should be D.1249_3: [0, 1] D.1250_5: [0, 1] D.1251_6: [0, 1] So why current VRP can't find out this value range? It does - it just prints it as [0, +INF], they are bools with TYPE_MAX_VALUE == 1 after all. Richard, May I use REG_EXPR(rtx of D.1249_3) in xxx.md file to detect whether D.1249_3 is a bool or not? Some comments in GCC says REG_EXPR may be lost in back-end. True? If we do have REG_EXPR info for some cases in back-end, is it guaranteed to be correct? May I implementing back-end peephole optimization depending on REG_EXPR? Thanks, -Jiangning Richard. I'm asking this question because the optimizations in back-end need this info to do advanced optimization. Thanks, -Jiangning
[RFC] Optimization to conditional and/or in ARM back-end
Hi, This patch is to implement a peephole like optimization in ARM back-end. If we have an if condition expression like ((r3 != 0) r1) != 0, originally the binary code to be generated is like, cmp r3, #0 ite eq moveq r1, #0 andne r1, r1, #1 cmp r1, #0 But actually this expression could be transformed into ((r3 != x) (r1 != 0)) != 0, if we know r2 is with bool type. This way we could generate new binary code like, cmp r3, #0 it ne cmpne r1, #0 The question is how to judge r2 is a bool variable in back-end. I'm using REG_EXPR in function arm_check_logic_with_bool_reg to check if r2 is a bool, this function is being invoked in pattern *cond_code. May I really use REG_EXPR this way in GCC back-end code? I posted a related topic at http://gcc.gnu.org/ml/gcc/2011-11/msg00417.html. If the answer is No, is there any suggestions about either how to judge this bool type or how to implement this optimization? Appreciate your comments in advance! Thanks, -Jiangning diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h index 23a29c6..8b12d48 --- a/gcc/config/arm/arm-protos.h +++ b/gcc/config/arm/arm-protos.h @@ -114,6 +114,7 @@ extern rtx arm_gen_load_multiple (int *, int, rtx, int, rtx, HOST_WIDE_INT *); extern rtx arm_gen_store_multiple (int *, int, rtx, int, rtx, HOST_WIDE_INT *); extern int arm_gen_movmemqi (rtx *); extern enum machine_mode arm_select_cc_mode (RTX_CODE, rtx, rtx); +extern bool arm_check_logic_with_bool_reg (RTX_CODE, rtx, rtx); extern enum machine_mode arm_select_dominance_cc_mode (rtx, rtx, HOST_WIDE_INT); extern rtx arm_gen_compare_reg (RTX_CODE, rtx, rtx); diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index a429c19..e96f24a --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -10910,6 +10910,41 @@ arm_gen_movmemqi (rtx *operands) return 1; } +/* Check whether the expression rc cmp1 bool_reg can be transformed + to use conditional comparison or not. */ +bool +arm_check_logic_with_bool_reg (RTX_CODE rc, rtx bool_reg, rtx cmp1) { + rtx cmp2; + HOST_WIDE_INT dom_cc; + + if (!REG_P (bool_reg)) +return false; + + if (REG_EXPR (bool_reg) == NULL) +return false; + + if (TREE_CODE (REG_EXPR (bool_reg)) != VAR_DECL) +return false; + + if (TREE_CODE (TREE_TYPE (REG_EXPR (bool_reg))) != BOOLEAN_TYPE) +return false; + + cmp2 = gen_rtx_NE (GET_MODE (bool_reg), bool_reg, const0_rtx); + + if (rc == AND) +dom_cc = DOM_CC_X_AND_Y; + else if (rc == IOR) +dom_cc = DOM_CC_X_OR_Y; + else +gcc_unreachable (); + + if (arm_select_dominance_cc_mode (cmp1, cmp2, dom_cc) == CCmode) +return false; + + return true; +} + + /* Select a dominance comparison mode if possible for a test of the general form (OP (COND_OR (X) (Y)) (const_int 0)). We support three forms. COND_OR == DOM_CC_X_AND_Y = (X Y) diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md index a78ba88..e90a78e --- a/gcc/config/arm/arm.md +++ b/gcc/config/arm/arm.md @@ -9172,6 +9172,64 @@ (set_attr length 4,4,8)] ) +; This pattern matches expression like example ((r1 != x) r2) != 0. +; Here r2 is a register with data type boolean. +; This pattern can transform to code matching patterns cmp_and. +; Likewise, code matching pattern cmp_ior could be generated, if it is | +; rather than in the example. +(define_insn_and_split *cond_code + [(set (match_operand 0 cc_register ) + (compare + (ior_and:SI + (match_operator:SI 4 arm_comparison_operator + [(match_operand:SI 2 s_register_operand r,r) + (match_operand:SI 3 arm_add_operand rI,L)]) + (match_operand:SI 1 s_register_operand r,r)) +(const_int 0)))] + TARGET_32BIT +arm_check_logic_with_bool_reg (CODE, operands[1], operands[4]) + # + 1 + [(set (match_dup 7) + (compare +(match_op_dup 5 + [(match_op_dup 4 [(match_dup 2) (match_dup 3)]) + (match_op_dup 6 [(match_dup 1) (const_int 0)])]) +(const_int 0)))] + + { +HOST_WIDE_INT dom_cc; + +operands[6] = gen_rtx_NE (SImode, operands[1], const0_rtx); + +if (CODE == AND) + { +dom_cc = DOM_CC_X_AND_Y; +operands[5] = gen_rtx_fmt_ee (CODE, SImode, + operands[4], operands[6]); + } +else if (CODE == IOR) + { +dom_cc = DOM_CC_X_OR_Y; +operands[5] = gen_rtx_fmt_ee (CODE, SImode, + operands[4], operands[6]); + } +else + gcc_unreachable (); + +operands[7] + = gen_rtx_REG (arm_select_dominance_cc_mode (operands[4], + operands[6], + dom_cc), +CC_REGNUM); + } + [(set_attr conds clob) + (set (attr length) + (if_then_else
[RFC] Use which_alternative in preparation-statements of define_insn_and_split
Hi, I find which_alternative can't really be used in preparation-statements of define_insn_and_split, so can this be fixed like below? For example, I want to use which_alternative in the pattern below, (define_insn_and_split *thumb2_movsicc_insn [(set (match_operand:SI 0 s_register_operand =r,r,r,r,r,r,r,r) (if_then_else:SI (match_operator 3 arm_comparison_operator [(match_operand 4 cc_register ) (const_int 0)]) (match_operand:SI 1 arm_not_operand 0,0,rI,K,rI,rI,K,K) (match_operand:SI 2 arm_not_operand rI,K,0,0,rI,K,rI,K)))] TARGET_THUMB2 @ it\\t%D3\;mov%D3\\t%0, %2 it\\t%D3\;mvn%D3\\t%0, #%B2 it\\t%d3\;mov%d3\\t%0, %1 it\\t%d3\;mvn%d3\\t%0, #%B1 ite\\t%d3\;mov%d3\\t%0, %1\;mov%D3\\t%0, %2 ite\\t%d3\;mov%d3\\t%0, %1\;mvn%D3\\t%0, #%B2 ite\\t%d3\;mvn%d3\\t%0, #%B1\;mov%D3\\t%0, %2 ite\\t%d3\;mvn%d3\\t%0, #%B1\;mvn%D3\\t%0, #%B2 reload_completed [(cond_exec (match_dup 5) (set (match_dup 0) (match_dup 6)))] { if (which_alternative = 2 which_alternative 4) { ... } else if (which_alternative = 4) { ... } } [(set_attr length 6,6,6,6,10,10,10,10) (set_attr conds use)] ) Thanks, -Jiangning Patch: diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c index c94e743..df6a3df --- a/gcc/emit-rtl.c +++ b/gcc/emit-rtl.c @@ -3502,6 +3502,10 @@ try_split (rtx pat, rtx trial, int last) split_branch_probability = INTVAL (XEXP (note, 0)); probability = split_branch_probability; + extract_insn (trial); + if (!constrain_operands (reload_completed)) +return trial; + seq = split_insns (pat, trial); split_branch_probability = -1;
MAINTAINERS: add myself
Just committed the following: * MAINTAINERS (Write After Approval): Add myself. Index: MAINTAINERS === --- MAINTAINERS (revision 181466) +++ MAINTAINERS (working copy) @@ -419,6 +419,7 @@ Marc Lehmann p...@goof.com James Lemkejwle...@juniper.net Kriang Lerdsuwanakij lerds...@users.sourceforge.net +Jiangning Liu jiangning@arm.com Sa Liu sa...@de.ibm.com Ralph Loader r...@ihug.co.nz Gabor Loki l...@inf.u-szeged.hu
A question about redudant load elimination
Hi, For this test case, int x; extern void f(void); void g(int *a) { a[x] = 1; if (x == 100) f(); a[x] = 2; } For trunk, the x86 assembly code is like below, movlx, %eax movl16(%esp), %ebx movl$1, (%ebx,%eax,4) movlx, %eax // Is this a redundant one? cmpl$100, %eax je .L4 movl$2, (%ebx,%eax,4) addl$8, %esp .cfi_remember_state .cfi_def_cfa_offset 8 popl%ebx .cfi_restore 3 .cfi_def_cfa_offset 4 ret .p2align 4,,7 .p2align 3 .L4: .cfi_restore_state callf movlx, %eax movl$2, (%ebx,%eax,4) addl$8, %esp .cfi_def_cfa_offset 8 popl%ebx .cfi_restore 3 .cfi_def_cfa_offset 4 Ret Is the 2nd movl x, %eax is a redundant one for single thread programming model? If yes, can this be optimized away? Thanks, -Jiangning
[PATCH, ARM] Fix stack red zone bug (PR38644)
Hi, This patch is to fix PR38644 in ARM back-end. OK for trunk? For every detail, please refer to http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38644. ChangeLog: 2011-11-2 Jiangning Liu jiangning@arm.com PR rtl-optimization/38644 * config/arm/arm.c (thumb1_expand_epilogue): Add memory barrier for epilogue having stack adjustment. ChangeLog of testsuite: 2011-11-2 Jiangning Liu jiangning@arm.com PR rtl-optimization/38644 * gcc.target/arm/stack-red-zone.c: New. Thanks, -Jiangning Patch: diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index f1ada6f..1f6fc26 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -22215,6 +22215,8 @@ thumb1_expand_epilogue (void) gcc_assert (amount = 0); if (amount) { + emit_insn (gen_blockage ()); + if (amount 512) emit_insn (gen_addsi3 (stack_pointer_rtx, stack_pointer_rtx, GEN_INT (amount))); diff --git a/gcc/testsuite/gcc.target/arm/stack-red-zone.c b/gcc/testsuite/gcc.target/arm/stack-red-zone.c new file mode 100644 index 000..b9f0f99 --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/stack-red-zone.c @@ -0,0 +1,12 @@ +/* No stack red zone. PR38644. */ +/* { dg-options -mthumb -O2 } */ +/* { dg-final { scan-assembler ldrb\[^\n\]*\\n\[\t \]*add\[\t \]*sp } } */ + +extern int doStreamReadBlock (int *, char *, int size, int); + +char readStream (int *s) +{ + char c = 0; + doStreamReadBlock (s, c, 1, *s); + return c; +}
RE: [patch tree-optimization]: Improve handling of conditional-branches on targets with high branch costs
-Original Message- From: Kai Tietz [mailto:ktiet...@googlemail.com] Sent: Thursday, October 27, 2011 5:36 PM To: Jiangning Liu Cc: Michael Matz; Richard Guenther; Kai Tietz; gcc-patches@gcc.gnu.org; Richard Henderson Subject: Re: [patch tree-optimization]: Improve handling of conditional-branches on targets with high branch costs 2011/10/27 Jiangning Liu jiangning@arm.com: -Original Message- From: Michael Matz [mailto:m...@suse.de] Sent: Wednesday, October 26, 2011 11:47 PM To: Kai Tietz Cc: Jiangning Liu; Richard Guenther; Kai Tietz; gcc- patc...@gcc.gnu.org; Richard Henderson Subject: Re: [patch tree-optimization]: Improve handling of conditional-branches on targets with high branch costs Hi, On Wed, 26 Oct 2011, Kai Tietz wrote: So you would mean that memory dereferencing shouldn't be considered as side-effect at all? No. I haven't said this at all. Of course it's a side-effect, but we're allowed to remove existing ones (under some circumstances). We're not allowed to introduce new ones, which means that this ... So we would happily cause by code 'if (i *i != 0) an crash, as memory-dereference has for you no side-effect? ... is not allowed. But in the original example the memread was on the left side, hence occured always, therefore we can move it to the right side, even though it might occur less often. In you special case it might be valid that, if first (and C-fold- const doesn't know if the side-effect condition is really the first, as it might be a sub-sequence of a condition) condition might trap or not, to combine it. But branching has to cover the general cases. If you find a way to determine that left-hand operand in fold_const's branching code is really the left-most condition in chain, then we can add such a special case, but I don't see here an easy way to determine it. Hmm? I don't see why it's necessary to check if it's the left-most condition in a chain. If the left hand of '' is a memread it can always be moved to the right side (or the operator transformed into '' which can have the same effect), of course only if the original rhs is free of side effects, but then independed if the was part of a larger expression. The memread will possibly be done fewer times than originally, but as said, that's okay. Agree. The point is for the small case I gave RHS doesn't have side effect at all, so the optimization of changing it to AND doesn't violate C specification. We need to recover something for this case, although it did improve a lot for some particular benchmarks. Thanks, -Jiangning Ciao, Michael. Hmm, so we can allow merging to AND, if the left-hand-side might trap but has no-side-effects and rhs has neither trapping nor side-effects. As for the case that left-hand side has side-effects but right-hand not, we aren't allowed to do this AND/OR merge. For example 'if ((f = foo ()) != 0 f 24)' we aren't allowed to make this transformation. This shouldn't be that hard. We need to provide to simple_operand_p_2 an additional argument for checking trapping or not. Would it be OK if I file a tracker in bugzilla against this? Regards, Kai
RE: [patch tree-optimization]: Improve handling of conditional-branches on targets with high branch costs
-Original Message- From: Michael Matz [mailto:m...@suse.de] Sent: Wednesday, October 26, 2011 11:47 PM To: Kai Tietz Cc: Jiangning Liu; Richard Guenther; Kai Tietz; gcc-patches@gcc.gnu.org; Richard Henderson Subject: Re: [patch tree-optimization]: Improve handling of conditional-branches on targets with high branch costs Hi, On Wed, 26 Oct 2011, Kai Tietz wrote: So you would mean that memory dereferencing shouldn't be considered as side-effect at all? No. I haven't said this at all. Of course it's a side-effect, but we're allowed to remove existing ones (under some circumstances). We're not allowed to introduce new ones, which means that this ... So we would happily cause by code 'if (i *i != 0) an crash, as memory-dereference has for you no side-effect? ... is not allowed. But in the original example the memread was on the left side, hence occured always, therefore we can move it to the right side, even though it might occur less often. In you special case it might be valid that, if first (and C-fold- const doesn't know if the side-effect condition is really the first, as it might be a sub-sequence of a condition) condition might trap or not, to combine it. But branching has to cover the general cases. If you find a way to determine that left-hand operand in fold_const's branching code is really the left-most condition in chain, then we can add such a special case, but I don't see here an easy way to determine it. Hmm? I don't see why it's necessary to check if it's the left-most condition in a chain. If the left hand of '' is a memread it can always be moved to the right side (or the operator transformed into '' which can have the same effect), of course only if the original rhs is free of side effects, but then independed if the was part of a larger expression. The memread will possibly be done fewer times than originally, but as said, that's okay. Agree. The point is for the small case I gave RHS doesn't have side effect at all, so the optimization of changing it to AND doesn't violate C specification. We need to recover something for this case, although it did improve a lot for some particular benchmarks. Thanks, -Jiangning Ciao, Michael.
RE: [patch tree-optimization]: Improve handling of conditional-branches on targets with high branch costs
-Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches- ow...@gcc.gnu.org] On Behalf Of Michael Matz Sent: Tuesday, October 11, 2011 10:45 PM To: Kai Tietz Cc: Richard Guenther; Kai Tietz; gcc-patches@gcc.gnu.org; Richard Henderson Subject: Re: [patch tree-optimization]: Improve handling of conditional-branches on targets with high branch costs Hi, On Tue, 11 Oct 2011, Kai Tietz wrote: Better make it a separate function the first tests your new conditions, and then calls simple_operand_p. Well, either I make it a new function and call it instead of simple_operand_p, That's what I meant, yes. @@ -5149,13 +5176,6 @@ fold_truthop (location_t loc, enum tree_ build2 (BIT_IOR_EXPR, TREE_TYPE (ll_arg), ll_arg, rl_arg), build_int_cst (TREE_TYPE (ll_arg), 0)); - - if (LOGICAL_OP_NON_SHORT_CIRCUIT) - { - if (code != orig_code || lhs != orig_lhs || rhs != orig_rhs) - return build2_loc (loc, code, truth_type, lhs, rhs); - return NULL_TREE; - } Why do you remove this hunk? Shouldn't you instead move the hunk you added to fold_truth_andor() here. I realize this needs some TLC to fold_truth_andor_1, because right now it early-outs for non- comparisons, but it seems the better place. I.e. somehow move the below code into the above branch, with the associated diddling on fold_truth_andor_1 that it gets called. This hunk is removed, as it is vain to do here. There is a fallthrough now, that wasn't there before. I don't know if it's harmless, I just wanted to mention it. Yes, this part introduced different behavior for this small case, int f(char *i, int j) { if (*i j!=2) return *i; else return j; } Before the fix, we have D.4710 = *i; D.4711 = D.4710 != 0; D.4712 = j != 2; D.4713 = D.4711 D.4712; if (D.4713 != 0) goto D.4714; else goto D.4715; D.4714: D.4710 = *i; D.4716 = (int) D.4710; return D.4716; D.4715: D.4716 = j; return D.4716; After the fix, we have D.4711 = *i; if (D.4711 != 0) goto D.4712; else goto D.4710; D.4712: if (j != 2) goto D.4713; else goto D.4710; D.4713: D.4711 = *i; D.4714 = (int) D.4711; return D.4714; D.4710: D.4714 = j; return D.4714; Does this meet the original expectation? I find the code below in function fold_truth_andor makes difference, /* Transform (A AND-IF B) into (A AND B), or (A OR-IF B) into (A OR B). For sequence point consistancy, we need to check for trapping, and side-effects. */ else if (code == icode simple_operand_p_2 (arg0) simple_operand_p_2 (arg1)) return fold_build2_loc (loc, ncode, type, arg0, arg1); for *i != 0 simple_operand_p(*i) returns false. Originally this is not checked by the code. I don't see the patch originally wanted to cover this. Can this be explained reasonably? I'm not arguing this patch did worse thing, but only want to understand the rationale behind this and justify this patch doesn't hurt anything else. Actually on the contrary, I measured and this change accidently made some benchmarks significantly improved due to some other reasons. Thanks, -Jiangning Btw richi asked for it, and I agree that new TRUTH-AND/OR packing is better done at a single place in fold_truth_andor only. As fold_truthop is called twice by fold_truth_andor, the latter might indeed be the better place. Ciao, Michael.
RE: [PATCH] Fix stack red zone bug (PR38644)
-Original Message- From: Richard Henderson [mailto:r...@redhat.com] Sent: Saturday, October 01, 2011 3:05 AM To: Jiangning Liu Cc: 'Jakub Jelinek'; 'Richard Guenther'; Andrew Pinski; gcc- patc...@gcc.gnu.org Subject: Re: [PATCH] Fix stack red zone bug (PR38644) On 09/29/2011 06:13 PM, Jiangning Liu wrote: -Original Message- From: Jakub Jelinek [mailto:ja...@redhat.com] Sent: Thursday, September 29, 2011 6:14 PM To: Jiangning Liu Cc: 'Richard Guenther'; Andrew Pinski; gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Fix stack red zone bug (PR38644) On Thu, Sep 29, 2011 at 06:08:50PM +0800, Jiangning Liu wrote: As far as I know different back-ends are implementing different prologue/epilogue in GCC. If one day this part can be refined and abstracted as well, I would say solving this stack-red-zone problem in shared prologue/epilogue code would be a perfect solution, and barrier can be inserted there. I'm not saying you are wrong on keeping scheduler using a pure barrier interface. From engineering point of view, I only feel my proposal is so far so good, because this patch at least solve the problem for all targets in a quite simple way. Maybe it can be improved in future based on this. But you don't want to listen about any other alternative, other backends are happy with being able to put the best kind of barrier at the best spot in the epilogue and don't need a generic solution which won't model very well the target diversity anyway. Jakub, Appreciate for your attention on this issue, 1) Can you clarify who are the others back-ends? Does it cover most of the back-ends being supported by GCC right now? Your red-stack barrier issue is *exactly* the same as the frame pointer barrier issue, which affects many backends. That is, if the frame pointer is initialized before the local stack frame is allocated, then one has to add a barrier such that memory references based on the frame pointer are not scheduled before the local stack frame allocation. One example of this is in the i386 port, where the prologue looks like push%ebp mov %esp, %ebp sub $frame, %esp The rtl we emit for that subtraction looks like (define_insn pro_epilogue_adjust_stack_mode_add [(set (match_operand:P 0 register_operand =r,r) (plus:P (match_operand:P 1 register_operand 0,r) (match_operand:P 2 nonmemory_operand ri,li))) (clobber (reg:CC FLAGS_REG)) (clobber (mem:BLK (scratch)))] Note the final clobber, which is a memory scheduling barrier. Other targets use similar tricks. For instance arm stack_tie. Honestly, I've found nothing convincing throughout this thread that suggests to me that this problem should be handled generically. Richard H., Thanks for your explanation by giving an example in x86. The key is if possible, fixing it in middle end can benefit all ports directly and avoid bug fixing burden in back-ends, rather than fix this problem port by port. Actually now the debating here is whether memory barrier is properly modeling through whole GCC rather than a single component, because my current understanding is scheduler is not the only component using memory barrier. Thanks, -Jiangning r~
RE: [PATCH] Fix stack red zone bug (PR38644)
-Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Friday, September 30, 2011 8:57 PM To: Jiangning Liu; Jakub Jelinek; Richard Guenther; Andrew Pinski; gcc- patc...@gcc.gnu.org; richard.sandif...@linaro.org Subject: Re: [PATCH] Fix stack red zone bug (PR38644) On Fri, Sep 30, 2011 at 2:46 PM, Richard Sandiford richard.sandif...@linaro.org wrote: Jiangning Liu jiangning@arm.com writes: You seem to feel strongly about this because it's a wrong-code bug that is very easy to introduce and often very hard to detect. And I defintely sympathise with that. If we were going to to do it in a target- independent way, though, I think it would be better to scan patterns like epilogue and automatically introduce barriers before assignments to stack_pointer_rtx (subject to the kind of hook in your patch). But I still don't think that's better than leaving the onus on the backend. The backend is still responsible for much more complicated things like determning the correct deallocation and register-restore sequence, and for determining the correct CFI sequence. I think middle-end in GCC is actually shared code rather than the part exactly in the middle. A pass working on RTL can be a middle end just because the code can be shared for all targets, and some passes can even work for both GIMPLE and RTL. Actually some optimizations need to work through shared part (middle-end) plus target specific part (back-end). You are thinking the interface between this shared part and target specific part should be using barrier as a properly model. To some extension I agree with this. However, it doesn't mean the fix should be in back-end rather than middle end, because obviously this problem is a common ABI issue for all targets. If we can abstract this issue to be a shared part, why shouldn't we do it in middle end to reduce the onus of back-end? Back-end should handle the target specific things rather than only the complicated things. And for avoidance of doubt, the automatic barrier insertion that I described would be one way of doing it in target-independent code. But... If a complicated problem can be implemented in a shared code manner, we still want to put it into middle end rather than back-end. I believe those optimizations based on SSA form are complicated enough, but they are all in middle end. This is the logic I'm seeing in GCC. The situation here is different. The target-independent rtl code is being given a blob of instructions that the backend has generated for the epilogue. There's no fine-tuning beyond that. E.g. we don't have separate patterns for restore registers, deallocate stack, return: we just have one monolithic epilogue pattern. The target- independent code has very little control. In contrast, after the tree optimisers have handed off the initial IL, the tree optimisers are more or less in full control. There are very few cases where we generate further trees outside the middle- end. The only case I know off-hand is the innards of va_start and va_arg, which can be generated by the backend. So let's suppose we had a similar situation there, where we wanted va_arg do something special in a certain situation. If we had the same three choices of: 1. use an on-the-side hook to represent the special something 2. scan the code generated by the backend and automatically inject the special something at an appropriate place 3. require each backend to do it properly from the start (OK, slightly prejudiced wording :-)) I think we'd still choose 3. For this particular issue, I don't think that hook interface I'm proposing is more complicated than the barrier. Instead, it is easier for back-end implementer to be aware of the potential issue before really solving stack red zone problem, because it is very clearly listed in target hook list. The point for model it in the IL supporters like myself is that we have both many backends and many rtl passes. Putting it in a hook keeps things simple for the backends, but it means that every rtl pass must be aware of this on-the-side dependency. Perhaps sched2 really is the only pass that needs to look at the hook at present. But perhaps not. E.g. dbr_schedule (not a problem on ARM, I realise) also reorders instructions, so maybe it would need to be audited to see whether any calls to this hook are needed. And perhaps we'd add more rtl passes later. The point behind using a barrier is that the rtl passes do not then need to treat the stack-deallocation dependency as a special case. They can just use the normal analysis and get it right. In other words, we're both arguing for safety here. Indeed. It's certainly not only scheduling that can move instructions, but RTL PRE, combine, ifcvt all can
RE: [PATCH] Fix stack red zone bug (PR38644)
-Original Message- From: Richard Sandiford richard.sandif...@linaro.org Date: Fri, Sep 30, 2011 at 8:46 PM Subject: Re: [PATCH] Fix stack red zone bug (PR38644) To: Jiangning Liu jiangning@arm.com Cc: Jakub Jelinek ja...@redhat.com, Richard Guenther richard.guent...@gmail.com, Andrew Pinski pins...@gmail.com, gcc-patches@gcc.gnu.org Jiangning Liu jiangning@arm.com writes: You seem to feel strongly about this because it's a wrong-code bug that is very easy to introduce and often very hard to detect. And I defintely sympathise with that. If we were going to to do it in a target- independent way, though, I think it would be better to scan patterns like epilogue and automatically introduce barriers before assignments to stack_pointer_rtx (subject to the kind of hook in your patch). But I still don't think that's better than leaving the onus on the backend. The backend is still responsible for much more complicated things like determning the correct deallocation and register-restore sequence, and for determining the correct CFI sequence. I think middle-end in GCC is actually shared code rather than the part exactly in the middle. A pass working on RTL can be a middle end just because the code can be shared for all targets, and some passes can even work for both GIMPLE and RTL. Actually some optimizations need to work through shared part (middle-end) plus target specific part (back-end). You are thinking the interface between this shared part and target specific part should be using barrier as a properly model. To some extension I agree with this. However, it doesn't mean the fix should be in back-end rather than middle end, because obviously this problem is a common ABI issue for all targets. If we can abstract this issue to be a shared part, why shouldn't we do it in middle end to reduce the onus of back-end? Back-end should handle the target specific things rather than only the complicated things. And for avoidance of doubt, the automatic barrier insertion that I described would be one way of doing it in target-independent code. But... If a complicated problem can be implemented in a shared code manner, we still want to put it into middle end rather than back-end. I believe those optimizations based on SSA form are complicated enough, but they are all in middle end. This is the logic I'm seeing in GCC. The situation here is different. The target-independent rtl code is being given a blob of instructions that the backend has generated for the epilogue. There's no fine-tuning beyond that. E.g. we don't have separate patterns for restore registers, deallocate stack, return: we just have one monolithic epilogue pattern. The target-independent code has very little control. In contrast, after the tree optimisers have handed off the initial IL, the tree optimisers are more or less in full control. There are very few cases where we generate further trees outside the middle-end. The only case I know off-hand is the innards of va_start and va_arg, which can be generated by the backend. So let's suppose we had a similar situation there, where we wanted va_arg do something special in a certain situation. If we had the same three choices of: 1. use an on-the-side hook to represent the special something 2. scan the code generated by the backend and automatically inject the special something at an appropriate place 3. require each backend to do it properly from the start (OK, slightly prejudiced wording :-)) I think we'd still choose 3. Richard S., Although I've ever implemented va_arg for a commercial compiler previously long times ago, I forgot all the details. :-) I'm not sure if using va_arg is a good example to compare with this stack red zone case. For this particular issue, I don't think that hook interface I'm proposing is more complicated than the barrier. Instead, it is easier for back-end implementer to be aware of the potential issue before really solving stack red zone problem, because it is very clearly listed in target hook list. The point for model it in the IL supporters like myself is that we have both many backends and many rtl passes. Putting it in a hook keeps things simple for the backends, but it means that every rtl pass must be aware of this on-the-side dependency. Perhaps sched2 really is the only pass that needs to look at the hook at present. But perhaps not. E.g. dbr_schedule (not a problem on ARM, I realise) also reorders instructions, so maybe it would need to be audited to see whether any calls to this hook are needed. And perhaps we'd add more rtl passes later. Let me rephrase your justification with my own words. === We can't compare adding a new pass and adding a new port, because they are totally different things. But it implies with my proposal the burden may still be added
RE: [PATCH] Fix stack red zone bug (PR38644)
-Original Message- From: Richard Sandiford [mailto:rdsandif...@googlemail.com] Sent: Friday, September 30, 2011 4:15 PM To: Jiangning Liu Cc: 'Jakub Jelinek'; 'Richard Guenther'; Andrew Pinski; gcc- patc...@gcc.gnu.org Subject: Re: [PATCH] Fix stack red zone bug (PR38644) Jiangning Liu jiangning@arm.com writes: -Original Message- From: Jakub Jelinek [mailto:ja...@redhat.com] Sent: Thursday, September 29, 2011 6:14 PM To: Jiangning Liu Cc: 'Richard Guenther'; Andrew Pinski; gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Fix stack red zone bug (PR38644) On Thu, Sep 29, 2011 at 06:08:50PM +0800, Jiangning Liu wrote: As far as I know different back-ends are implementing different prologue/epilogue in GCC. If one day this part can be refined and abstracted as well, I would say solving this stack-red-zone problem in shared prologue/epilogue code would be a perfect solution, and barrier can be inserted there. I'm not saying you are wrong on keeping scheduler using a pure barrier interface. From engineering point of view, I only feel my proposal is so far so good, because this patch at least solve the problem for all targets in a quite simple way. Maybe it can be improved in future based on this. But you don't want to listen about any other alternative, other backends are happy with being able to put the best kind of barrier at the best spot in the epilogue and don't need a generic solution which won't model very well the target diversity anyway. Jakub, Appreciate for your attention on this issue, 1) Can you clarify who are the others back-ends? Does it cover most of the back-ends being supported by GCC right now? Not answering for Jakub of course, but as a maintainer of a backend, I know MIPS doesn't have the required barrier at the moment. But that's a bug. Like others in this thread, I'm strongly of the opinion that this should be modelled directly in the IL. And it's already supposed to be modelled in the IL. Target-independent code emits the required barriers in cases where it rather than the backend patterns are responsible. E.g.: emit_clobber (gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (VOIDmode))); emit_clobber (gen_rtx_MEM (BLKmode, hard_frame_pointer_rtx)); emit_move_insn (hard_frame_pointer_rtx, fp); emit_stack_restore (SAVE_NONLOCAL, stack); from expand_builtin_longjmp() and: if (sa != 0) { sa = validize_mem (sa); /* These clobbers prevent the scheduler from moving references to variable arrays below the code that deletes (pops) the arrays. */ emit_clobber (gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (VOIDmode))); emit_clobber (gen_rtx_MEM (BLKmode, stack_pointer_rtx)); } from emit_stack_restore(). Backends that fail to follow suit are IMO just buggy. FWIW, I intend to fix MIPS this weekend. Richard S., Appreciate your attention on this issue and investigation on MIPS target. Really glad to know you find a potential bug for MIPS through this discussion. To some extension this proved my hypothesis previously. You seem to feel strongly about this because it's a wrong-code bug that is very easy to introduce and often very hard to detect. And I defintely sympathise with that. If we were going to to do it in a target- independent way, though, I think it would be better to scan patterns like epilogue and automatically introduce barriers before assignments to stack_pointer_rtx (subject to the kind of hook in your patch). But I still don't think that's better than leaving the onus on the backend. The backend is still responsible for much more complicated things like determning the correct deallocation and register-restore sequence, and for determining the correct CFI sequence. I think middle-end in GCC is actually shared code rather than the part exactly in the middle. A pass working on RTL can be a middle end just because the code can be shared for all targets, and some passes can even work for both GIMPLE and RTL. Actually some optimizations need to work through shared part (middle-end) plus target specific part (back-end). You are thinking the interface between this shared part and target specific part should be using barrier as a properly model. To some extension I agree with this. However, it doesn't mean the fix should be in back-end rather than middle end, because obviously this problem is a common ABI issue for all targets. If we can abstract this issue to be a shared part, why shouldn't we do it in middle end to reduce the onus of back-end? Back-end should handle the target specific things rather than only the complicated things. If a complicated problem can be implemented in a shared code manner, we still want to put it into middle end rather than back-end. I believe those optimizations based on SSA form are complicated enough
RE: [PATCH] Fix stack red zone bug (PR38644)
-Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Thursday, September 29, 2011 5:03 PM To: Jiangning Liu Cc: Andrew Pinski; gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Fix stack red zone bug (PR38644) On Thu, Sep 29, 2011 at 5:13 AM, Jiangning Liu jiangning@arm.com wrote: -Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Wednesday, September 28, 2011 5:56 PM To: Jiangning Liu Cc: Andrew Pinski; gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Fix stack red zone bug (PR38644) On Wed, Sep 28, 2011 at 11:40 AM, Jiangning Liu jiangning@arm.com wrote: -Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Wednesday, September 28, 2011 5:20 PM To: Jiangning Liu Cc: Andrew Pinski; gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Fix stack red zone bug (PR38644) On Wed, Sep 28, 2011 at 11:10 AM, Jiangning Liu jiangning@arm.com wrote: -Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Wednesday, September 28, 2011 4:39 PM To: Jiangning Liu Cc: Andrew Pinski; gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Fix stack red zone bug (PR38644) On Wed, Sep 28, 2011 at 3:49 AM, Jiangning Liu jiangning@arm.com wrote: -Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Tuesday, September 27, 2011 3:41 PM To: Jiangning Liu Cc: Andrew Pinski; gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Fix stack red zone bug (PR38644) On Tue, Sep 27, 2011 at 5:32 AM, Jiangning Liu jiangning@arm.com wrote: Think of it this way. What the IR says is there is no barrier between those moves. You either have an implicit barrier (which is what you are proposing) or you have it explicitly. I think we all rather have more things explicit rather than implicit in the IR. And that has been the overall feeling for a few years now. Sorry, I'm afraid I can't agree with you. Instead, I think using barrier to describe this kind of dependence is a kind of implicit method. Please note that this is not an usual data dependence issue. The stack pointer change doesn't have any dependence with memory access at all. It is similar to atomic instructions that require being an optimization/memory barrier. Sure it is not a usual data dependence (otherwise we would handle it already), so the targets have to explicitly express the dependence somehow, for which we only have barriers right now. Richard, Thanks for your explanation. It's explicit to back-end, while it's implicit to scheduler in middle end, because barrier can decide dependence in scheduler but barrier can be generated from several different scenarios. It's unsafe and prone to introduce bug if any one of the scenarios requiring generating barriers is being missed in back-end. Between middle-end and back-end, we should have interfaces that is easy to be implemented by back-end. After all, middle-end itself can't consist of a compiler, and vice versa. Back-end needs middle-end's help to make sure back-end is easy to be implemented and reduce the possibility of introducing bugs. Without an explicit hook as I'm proposing, back-end implementers have to clearly know all scenarios of generating barrier very clearly, because there isn't any code tips and comments in middle end telling back- end the list of all scenarios on generating barriers. Yes, barrier is a perfect interface for scheduler in theory. But from engineering point of view, I think it's better to explicitly define an interface to describe stack red zone and inform back-end, or vice versa. Not like computer, people is easy to make mistake if you don't tell them. On this bug, the fact is over the years different back-ends made similar bugs. GCC is really a perfect platform on building new ports, and I saw a lot of new back-ends. The current middle end is unsafe, if port doesn't support stack red zone and back-ends doesn't generate barrier for it. Why can't we explicitly clarify this in compiler code between middle end and back end? What if any other back-end (new or old) NOT supporting stack red zone exposing the similar bug again? There are gazillion things you have to explicitly get right in your backends, so I don't see why exposing proper scheduling barriers should be special
RE: [PATCH] Fix stack red zone bug (PR38644)
-Original Message- From: Jakub Jelinek [mailto:ja...@redhat.com] Sent: Thursday, September 29, 2011 6:14 PM To: Jiangning Liu Cc: 'Richard Guenther'; Andrew Pinski; gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Fix stack red zone bug (PR38644) On Thu, Sep 29, 2011 at 06:08:50PM +0800, Jiangning Liu wrote: As far as I know different back-ends are implementing different prologue/epilogue in GCC. If one day this part can be refined and abstracted as well, I would say solving this stack-red-zone problem in shared prologue/epilogue code would be a perfect solution, and barrier can be inserted there. I'm not saying you are wrong on keeping scheduler using a pure barrier interface. From engineering point of view, I only feel my proposal is so far so good, because this patch at least solve the problem for all targets in a quite simple way. Maybe it can be improved in future based on this. But you don't want to listen about any other alternative, other backends are happy with being able to put the best kind of barrier at the best spot in the epilogue and don't need a generic solution which won't model very well the target diversity anyway. Jakub, Appreciate for your attention on this issue, 1) Can you clarify who are the others back-ends? Does it cover most of the back-ends being supported by GCC right now? 2) You need give data to prove other back-ends are happy with current solution. The fact is over the years there are a bunch of bugs filed against this problem. WHY? At this point you are implying other back-ends are happy with bugs or potential bugs? This is wired to me. Also, this is not a issue whether a back-end is able to implement barrier or not. If you are really asking able or not, I would say every back-end is able, but it doesn't mean able is correct and it doesn't mean able is the most reasonable. Comparing with the one I am proposing, I don't see the current solution has other significant advantages in addition to the proper modeling for scheduler you are arguing. Instead, the solution I'm proposing is a safe solution, and a solution easy to avoid bugs. If GCC compiler infrastructure can't even give a safe compilation, why should we insist on the proper modeling for scheduler only? Hopefully you can consider again about this. -Jiangning Jakub
RE: [PATCH] Fix stack red zone bug (PR38644)
-static inline bool +extern bool static bool ix86_using_red_zone (void) { return TARGET_RED_ZONE !TARGET_64BIT_MS_ABI; @@ -35958,6 +35958,9 @@ ix86_autovectorize_vector_sizes (void) #define TARGET_STACK_PROTECT_FAIL ix86_stack_protect_fail #endif ... +/* Return true if the ABI allows red zone access. */ +extern bool static bool +rs6000_stack_using_red_zone (void) +{ + return (DEFAULT_ABI != ABI_V4); +} + Uros, Thanks for your review. Accept and new patch is as below. No change for ChangeLog. Thanks, -Jiangning diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index ff8c49f..e200974 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -2631,7 +2631,7 @@ static const char *const cpu_names[TARGET_CPU_DEFAULT_max] = /* Return true if a red-zone is in use. */ -static inline bool +static bool ix86_using_red_zone (void) { return TARGET_RED_ZONE !TARGET_64BIT_MS_ABI; @@ -35958,6 +35958,9 @@ ix86_autovectorize_vector_sizes (void) #define TARGET_STACK_PROTECT_FAIL ix86_stack_protect_fail #endif +#undef TARGET_STACK_USING_RED_ZONE +#define TARGET_STACK_USING_RED_ZONE ix86_using_red_zone + #undef TARGET_FUNCTION_VALUE #define TARGET_FUNCTION_VALUE ix86_function_value diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c index 1ab57e5..1e64d14 100644 --- a/gcc/config/rs6000/rs6000.c +++ b/gcc/config/rs6000/rs6000.c @@ -1537,6 +1537,9 @@ static const struct attribute_spec rs6000_attribute_table[] = #define TARGET_STACK_PROTECT_FAIL rs6000_stack_protect_fail #endif +#undef TARGET_STACK_USING_RED_ZONE +#define TARGET_STACK_USING_RED_ZONE rs6000_stack_using_red_zone + /* MPC604EUM 3.5.2 Weak Consistency between Multiple Processors The PowerPC architecture requires only weak consistency among processors--that is, memory accesses between processors need not be @@ -20660,6 +20663,13 @@ rs6000_restore_saved_cr (rtx reg, int using_mfcr_multiple) } } +/* Return true if the ABI allows red zone access. */ +static bool +rs6000_stack_using_red_zone (void) +{ + return (DEFAULT_ABI != ABI_V4); +} + /* Return true if OFFSET from stack pointer can be clobbered by signals. V.4 doesn't have any stack cushion, AIX ABIs have 220 or 288 bytes below stack pointer not cloberred by signals. */ @@ -20667,7 +20677,7 @@ rs6000_restore_saved_cr (rtx reg, int using_mfcr_multiple) static inline bool offset_below_red_zone_p (HOST_WIDE_INT offset) { - return offset (DEFAULT_ABI == ABI_V4 + return offset (!TARGET_STACK_USING_RED_ZONE ? 0 : TARGET_32BIT ? -220 : -288); } diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi index 335c1d1..c848806 100644 --- a/gcc/doc/tm.texi +++ b/gcc/doc/tm.texi @@ -11332,6 +11332,22 @@ to the stack. Therefore, this hook should return true in general, but false for naked functions. The default implementation always returns true. @end deftypefn +@deftypefn {Target Hook} bool TARGET_STACK_USING_RED_ZONE (void) +This hook returns true if the target has a red zone (an area beyond the +current extent of the stack that cannot be modified by asynchronous events +on the processor). + +If this hook returns false then the compiler mid-end will not move an access +to memory in the stack frame past a stack adjustment insn. + +If this hook returns true then the compiler mid-end will assume that it is +safe to move an access to memory in the stack frame past a stack adjustment +insn. The target back-end must emit scheduling barrier insns when this is +unsafe. + +The default is to return false which is safe and appropriate for most targets. +@end deftypefn + @deftypevr {Target Hook} {unsigned HOST_WIDE_INT} TARGET_CONST_ANCHOR On some architectures it can take multiple instructions to synthesize a constant. If there is another constant already in a register that diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index 6783826..0fa9e10 100644 --- a/gcc/doc/tm.texi.in +++ b/gcc/doc/tm.texi.in @@ -11223,6 +11223,22 @@ to the stack. Therefore, this hook should return true in general, but false for naked functions. The default implementation always returns true. @end deftypefn +@hook TARGET_STACK_USING_RED_ZONE +This hook returns true if the target has a red zone (an area beyond the +current extent of the stack that cannot be modified by asynchronous events +on the processor). + +If this hook returns false then the compiler mid-end will not move an access +to memory in the stack frame past a stack adjustment insn. + +If this hook returns true then the compiler mid-end will assume that it is +safe to move an access to memory in the stack frame past a stack adjustment +insn. The target back-end must emit scheduling barrier insns when this is +unsafe. + +The default is to return false which is safe and appropriate for most targets. +@end deftypefn + @hook TARGET_CONST_ANCHOR On some architectures it can take multiple
RE: [PATCH] Fix stack red zone bug (PR38644)
Just realized ChangeLog needs to be changed as well. ChangeLog: * config/i386/i386.c (ix86_using_red_zone): Remove inline. (TARGET_STACK_USING_RED_ZONE): New. * config/rs6000/rs6000.c (rs6000_stack_using_red_zone): New. (TARGET_STACK_USING_RED_ZONE): New. (offset_below_red_zone_p): Change to use new hook TARGET_STACK_USING_RED_ZONE. * doc/tm.texi (TARGET_STACK_USING_RED_ZONE): New. * doc/tm.texi.in (TARGET_STACK_USING_RED_ZONE): New. * sched-deps.c (sched_analyze_1): If the stack pointer is being modified and stack red zone is not supported for ports, flush out all memory references as they may become invalid if moved across the stack adjustment. * target.def (stack_using_red_zone): New. * testsuite/gcc.target/arm/stack-red-zone.c: New. -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches- ow...@gcc.gnu.org] On Behalf Of Jiangning Liu Sent: Wednesday, September 28, 2011 2:24 PM To: 'Uros Bizjak' Cc: gcc-patches@gcc.gnu.org; j...@suse.cz; geo...@geoffk.org; dje@gmail.com; r...@redhat.com; Richard Earnshaw; Matthew Gretton- Dann Subject: RE: [PATCH] Fix stack red zone bug (PR38644) -static inline bool +extern bool static bool ix86_using_red_zone (void) { return TARGET_RED_ZONE !TARGET_64BIT_MS_ABI; @@ -35958,6 +35958,9 @@ ix86_autovectorize_vector_sizes (void) #define TARGET_STACK_PROTECT_FAIL ix86_stack_protect_fail #endif ... +/* Return true if the ABI allows red zone access. */ +extern bool static bool +rs6000_stack_using_red_zone (void) +{ + return (DEFAULT_ABI != ABI_V4); +} + Uros, Thanks for your review. Accept and new patch is as below. No change for ChangeLog. Thanks, -Jiangning diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index ff8c49f..e200974 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -2631,7 +2631,7 @@ static const char *const cpu_names[TARGET_CPU_DEFAULT_max] = /* Return true if a red-zone is in use. */ -static inline bool +static bool ix86_using_red_zone (void) { return TARGET_RED_ZONE !TARGET_64BIT_MS_ABI; @@ -35958,6 +35958,9 @@ ix86_autovectorize_vector_sizes (void) #define TARGET_STACK_PROTECT_FAIL ix86_stack_protect_fail #endif +#undef TARGET_STACK_USING_RED_ZONE +#define TARGET_STACK_USING_RED_ZONE ix86_using_red_zone + #undef TARGET_FUNCTION_VALUE #define TARGET_FUNCTION_VALUE ix86_function_value diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c index 1ab57e5..1e64d14 100644 --- a/gcc/config/rs6000/rs6000.c +++ b/gcc/config/rs6000/rs6000.c @@ -1537,6 +1537,9 @@ static const struct attribute_spec rs6000_attribute_table[] = #define TARGET_STACK_PROTECT_FAIL rs6000_stack_protect_fail #endif +#undef TARGET_STACK_USING_RED_ZONE +#define TARGET_STACK_USING_RED_ZONE rs6000_stack_using_red_zone + /* MPC604EUM 3.5.2 Weak Consistency between Multiple Processors The PowerPC architecture requires only weak consistency among processors--that is, memory accesses between processors need not be @@ -20660,6 +20663,13 @@ rs6000_restore_saved_cr (rtx reg, int using_mfcr_multiple) } } +/* Return true if the ABI allows red zone access. */ +static bool +rs6000_stack_using_red_zone (void) +{ + return (DEFAULT_ABI != ABI_V4); +} + /* Return true if OFFSET from stack pointer can be clobbered by signals. V.4 doesn't have any stack cushion, AIX ABIs have 220 or 288 bytes below stack pointer not cloberred by signals. */ @@ -20667,7 +20677,7 @@ rs6000_restore_saved_cr (rtx reg, int using_mfcr_multiple) static inline bool offset_below_red_zone_p (HOST_WIDE_INT offset) { - return offset (DEFAULT_ABI == ABI_V4 + return offset (!TARGET_STACK_USING_RED_ZONE ? 0 : TARGET_32BIT ? -220 : -288); } diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi index 335c1d1..c848806 100644 --- a/gcc/doc/tm.texi +++ b/gcc/doc/tm.texi @@ -11332,6 +11332,22 @@ to the stack. Therefore, this hook should return true in general, but false for naked functions. The default implementation always returns true. @end deftypefn +@deftypefn {Target Hook} bool TARGET_STACK_USING_RED_ZONE (void) +This hook returns true if the target has a red zone (an area beyond the +current extent of the stack that cannot be modified by asynchronous events +on the processor). + +If this hook returns false then the compiler mid-end will not move an access +to memory in the stack frame past a stack adjustment insn. + +If this hook returns true then the compiler mid-end will assume that it is +safe to move an access to memory in the stack frame past a stack adjustment +insn. The target back-end must emit scheduling barrier insns when this is +unsafe. + +The default
RE: [PATCH] Fix stack red zone bug (PR38644)
-Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Wednesday, September 28, 2011 4:39 PM To: Jiangning Liu Cc: Andrew Pinski; gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Fix stack red zone bug (PR38644) On Wed, Sep 28, 2011 at 3:49 AM, Jiangning Liu jiangning@arm.com wrote: -Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Tuesday, September 27, 2011 3:41 PM To: Jiangning Liu Cc: Andrew Pinski; gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Fix stack red zone bug (PR38644) On Tue, Sep 27, 2011 at 5:32 AM, Jiangning Liu jiangning@arm.com wrote: Think of it this way. What the IR says is there is no barrier between those moves. You either have an implicit barrier (which is what you are proposing) or you have it explicitly. I think we all rather have more things explicit rather than implicit in the IR. And that has been the overall feeling for a few years now. Sorry, I'm afraid I can't agree with you. Instead, I think using barrier to describe this kind of dependence is a kind of implicit method. Please note that this is not an usual data dependence issue. The stack pointer change doesn't have any dependence with memory access at all. It is similar to atomic instructions that require being an optimization/memory barrier. Sure it is not a usual data dependence (otherwise we would handle it already), so the targets have to explicitly express the dependence somehow, for which we only have barriers right now. Richard, Thanks for your explanation. It's explicit to back-end, while it's implicit to scheduler in middle end, because barrier can decide dependence in scheduler but barrier can be generated from several different scenarios. It's unsafe and prone to introduce bug if any one of the scenarios requiring generating barriers is being missed in back-end. Between middle-end and back-end, we should have interfaces that is easy to be implemented by back-end. After all, middle-end itself can't consist of a compiler, and vice versa. Back-end needs middle-end's help to make sure back-end is easy to be implemented and reduce the possibility of introducing bugs. Without an explicit hook as I'm proposing, back-end implementers have to clearly know all scenarios of generating barrier very clearly, because there isn't any code tips and comments in middle end telling back-end the list of all scenarios on generating barriers. Yes, barrier is a perfect interface for scheduler in theory. But from engineering point of view, I think it's better to explicitly define an interface to describe stack red zone and inform back-end, or vice versa. Not like computer, people is easy to make mistake if you don't tell them. On this bug, the fact is over the years different back-ends made similar bugs. GCC is really a perfect platform on building new ports, and I saw a lot of new back-ends. The current middle end is unsafe, if port doesn't support stack red zone and back-ends doesn't generate barrier for it. Why can't we explicitly clarify this in compiler code between middle end and back end? What if any other back-end (new or old) NOT supporting stack red zone exposing the similar bug again? There are gazillion things you have to explicitly get right in your backends, so I don't see why exposing proper scheduling barriers should be special, and there, why red-zones should be special (as opposed to other occasions where you need to emit barriers from the backend for the scheduler). Richard, This is because, 1) Current scheduler is unsafe if back-end doesn't generate barrier for a port which doesn't support stack red zone at all. 2) Implementing barrier in back-end is a burden to new back-end implementation for ports not supporting stack red zone at all. 3) There are many other ports not reporting bugs around this. I doubt there isn't bug for them. 4) There are over 300 TARGET HOOKS being defined in target.def. I don't think adding this interface is hurting GCC. BTW, really appreciate your close attention to this specific issue. Thanks, -Jiangning Richard. Thanks, -Jiangning Richard. No matter what solution itself is, the problem itself is a quite a common one on ISA level, so we should solve it in middle-end, because middle-end is shared for all ports. My proposal avoids problems in future. Any new ports and new back- end implementations needn't explicitly define this hook in a very safe way. But if one port wants to unsafely introduce red zone, we can explicitly define this hook in back-end. This way, we would avoid the bug in the earliest time. Do you really want to hide this problem in back-end silently? And wait others to report bug and then waste time to get it fixed again? The facts I see is over the years
RE: [PATCH] Fix stack red zone bug (PR38644)
-Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Wednesday, September 28, 2011 5:20 PM To: Jiangning Liu Cc: Andrew Pinski; gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Fix stack red zone bug (PR38644) On Wed, Sep 28, 2011 at 11:10 AM, Jiangning Liu jiangning@arm.com wrote: -Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Wednesday, September 28, 2011 4:39 PM To: Jiangning Liu Cc: Andrew Pinski; gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Fix stack red zone bug (PR38644) On Wed, Sep 28, 2011 at 3:49 AM, Jiangning Liu jiangning@arm.com wrote: -Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Tuesday, September 27, 2011 3:41 PM To: Jiangning Liu Cc: Andrew Pinski; gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Fix stack red zone bug (PR38644) On Tue, Sep 27, 2011 at 5:32 AM, Jiangning Liu jiangning@arm.com wrote: Think of it this way. What the IR says is there is no barrier between those moves. You either have an implicit barrier (which is what you are proposing) or you have it explicitly. I think we all rather have more things explicit rather than implicit in the IR. And that has been the overall feeling for a few years now. Sorry, I'm afraid I can't agree with you. Instead, I think using barrier to describe this kind of dependence is a kind of implicit method. Please note that this is not an usual data dependence issue. The stack pointer change doesn't have any dependence with memory access at all. It is similar to atomic instructions that require being an optimization/memory barrier. Sure it is not a usual data dependence (otherwise we would handle it already), so the targets have to explicitly express the dependence somehow, for which we only have barriers right now. Richard, Thanks for your explanation. It's explicit to back-end, while it's implicit to scheduler in middle end, because barrier can decide dependence in scheduler but barrier can be generated from several different scenarios. It's unsafe and prone to introduce bug if any one of the scenarios requiring generating barriers is being missed in back-end. Between middle-end and back-end, we should have interfaces that is easy to be implemented by back-end. After all, middle-end itself can't consist of a compiler, and vice versa. Back-end needs middle-end's help to make sure back-end is easy to be implemented and reduce the possibility of introducing bugs. Without an explicit hook as I'm proposing, back-end implementers have to clearly know all scenarios of generating barrier very clearly, because there isn't any code tips and comments in middle end telling back-end the list of all scenarios on generating barriers. Yes, barrier is a perfect interface for scheduler in theory. But from engineering point of view, I think it's better to explicitly define an interface to describe stack red zone and inform back-end, or vice versa. Not like computer, people is easy to make mistake if you don't tell them. On this bug, the fact is over the years different back-ends made similar bugs. GCC is really a perfect platform on building new ports, and I saw a lot of new back-ends. The current middle end is unsafe, if port doesn't support stack red zone and back-ends doesn't generate barrier for it. Why can't we explicitly clarify this in compiler code between middle end and back end? What if any other back-end (new or old) NOT supporting stack red zone exposing the similar bug again? There are gazillion things you have to explicitly get right in your backends, so I don't see why exposing proper scheduling barriers should be special, and there, why red-zones should be special (as opposed to other occasions where you need to emit barriers from the backend for the scheduler). Richard, This is because, 1) Current scheduler is unsafe if back-end doesn't generate barrier for a port which doesn't support stack red zone at all. 2) Implementing barrier in back-end is a burden to new back-end implementation for ports not supporting stack red zone at all. 3) There are many other ports not reporting bugs around this. I doubt there isn't bug for them. 4) There are over 300 TARGET HOOKS being defined in target.def. I don't think adding this interface is hurting GCC. I don't argue that your solution is not acceptable, just your reasoning is bogus IMHO. 1) is a target bug, Why should back-ends handle a thing that doesn't exist at all for them? I don't see clear logic here. 2) huh, I doubt that this is the biggest issue one faces when implementing a new target, I never say
RE: [PATCH] Fix stack red zone bug (PR38644)
-Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Wednesday, September 28, 2011 5:56 PM To: Jiangning Liu Cc: Andrew Pinski; gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Fix stack red zone bug (PR38644) On Wed, Sep 28, 2011 at 11:40 AM, Jiangning Liu jiangning@arm.com wrote: -Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Wednesday, September 28, 2011 5:20 PM To: Jiangning Liu Cc: Andrew Pinski; gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Fix stack red zone bug (PR38644) On Wed, Sep 28, 2011 at 11:10 AM, Jiangning Liu jiangning@arm.com wrote: -Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Wednesday, September 28, 2011 4:39 PM To: Jiangning Liu Cc: Andrew Pinski; gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Fix stack red zone bug (PR38644) On Wed, Sep 28, 2011 at 3:49 AM, Jiangning Liu jiangning@arm.com wrote: -Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Tuesday, September 27, 2011 3:41 PM To: Jiangning Liu Cc: Andrew Pinski; gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Fix stack red zone bug (PR38644) On Tue, Sep 27, 2011 at 5:32 AM, Jiangning Liu jiangning@arm.com wrote: Think of it this way. What the IR says is there is no barrier between those moves. You either have an implicit barrier (which is what you are proposing) or you have it explicitly. I think we all rather have more things explicit rather than implicit in the IR. And that has been the overall feeling for a few years now. Sorry, I'm afraid I can't agree with you. Instead, I think using barrier to describe this kind of dependence is a kind of implicit method. Please note that this is not an usual data dependence issue. The stack pointer change doesn't have any dependence with memory access at all. It is similar to atomic instructions that require being an optimization/memory barrier. Sure it is not a usual data dependence (otherwise we would handle it already), so the targets have to explicitly express the dependence somehow, for which we only have barriers right now. Richard, Thanks for your explanation. It's explicit to back-end, while it's implicit to scheduler in middle end, because barrier can decide dependence in scheduler but barrier can be generated from several different scenarios. It's unsafe and prone to introduce bug if any one of the scenarios requiring generating barriers is being missed in back-end. Between middle-end and back-end, we should have interfaces that is easy to be implemented by back-end. After all, middle-end itself can't consist of a compiler, and vice versa. Back-end needs middle-end's help to make sure back-end is easy to be implemented and reduce the possibility of introducing bugs. Without an explicit hook as I'm proposing, back-end implementers have to clearly know all scenarios of generating barrier very clearly, because there isn't any code tips and comments in middle end telling back-end the list of all scenarios on generating barriers. Yes, barrier is a perfect interface for scheduler in theory. But from engineering point of view, I think it's better to explicitly define an interface to describe stack red zone and inform back-end, or vice versa. Not like computer, people is easy to make mistake if you don't tell them. On this bug, the fact is over the years different back-ends made similar bugs. GCC is really a perfect platform on building new ports, and I saw a lot of new back-ends. The current middle end is unsafe, if port doesn't support stack red zone and back-ends doesn't generate barrier for it. Why can't we explicitly clarify this in compiler code between middle end and back end? What if any other back-end (new or old) NOT supporting stack red zone exposing the similar bug again? There are gazillion things you have to explicitly get right in your backends, so I don't see why exposing proper scheduling barriers should be special, and there, why red-zones should be special (as opposed to other occasions where you need to emit barriers from the backend for the scheduler). Richard, This is because, 1) Current scheduler is unsafe if back-end doesn't generate barrier for a port which doesn't support stack red zone at all. 2) Implementing barrier in back-end is a burden to new back-end implementation for ports not supporting stack red zone at all. 3) There are many other ports not reporting bugs around
RE: [PATCH] Fix stack red zone bug (PR38644)
-Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Tuesday, September 27, 2011 3:41 PM To: Jiangning Liu Cc: Andrew Pinski; gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Fix stack red zone bug (PR38644) On Tue, Sep 27, 2011 at 5:32 AM, Jiangning Liu jiangning@arm.com wrote: Think of it this way. What the IR says is there is no barrier between those moves. You either have an implicit barrier (which is what you are proposing) or you have it explicitly. I think we all rather have more things explicit rather than implicit in the IR. And that has been the overall feeling for a few years now. Sorry, I'm afraid I can't agree with you. Instead, I think using barrier to describe this kind of dependence is a kind of implicit method. Please note that this is not an usual data dependence issue. The stack pointer change doesn't have any dependence with memory access at all. It is similar to atomic instructions that require being an optimization/memory barrier. Sure it is not a usual data dependence (otherwise we would handle it already), so the targets have to explicitly express the dependence somehow, for which we only have barriers right now. Richard, Thanks for your explanation. It's explicit to back-end, while it's implicit to scheduler in middle end, because barrier can decide dependence in scheduler but barrier can be generated from several different scenarios. It's unsafe and prone to introduce bug if any one of the scenarios requiring generating barriers is being missed in back-end. Between middle-end and back-end, we should have interfaces that is easy to be implemented by back-end. After all, middle-end itself can't consist of a compiler, and vice versa. Back-end needs middle-end's help to make sure back-end is easy to be implemented and reduce the possibility of introducing bugs. Without an explicit hook as I'm proposing, back-end implementers have to clearly know all scenarios of generating barrier very clearly, because there isn't any code tips and comments in middle end telling back-end the list of all scenarios on generating barriers. Yes, barrier is a perfect interface for scheduler in theory. But from engineering point of view, I think it's better to explicitly define an interface to describe stack red zone and inform back-end, or vice versa. Not like computer, people is easy to make mistake if you don't tell them. On this bug, the fact is over the years different back-ends made similar bugs. GCC is really a perfect platform on building new ports, and I saw a lot of new back-ends. The current middle end is unsafe, if port doesn't support stack red zone and back-ends doesn't generate barrier for it. Why can't we explicitly clarify this in compiler code between middle end and back end? What if any other back-end (new or old) NOT supporting stack red zone exposing the similar bug again? Thanks, -Jiangning Richard. No matter what solution itself is, the problem itself is a quite a common one on ISA level, so we should solve it in middle-end, because middle-end is shared for all ports. My proposal avoids problems in future. Any new ports and new back-end implementations needn't explicitly define this hook in a very safe way. But if one port wants to unsafely introduce red zone, we can explicitly define this hook in back-end. This way, we would avoid the bug in the earliest time. Do you really want to hide this problem in back-end silently? And wait others to report bug and then waste time to get it fixed again? The facts I see is over the years, for different ports reported the similar issue around this, and somebody tried to fix them. Actually, this kind of problem should be fixed in design level. If the first people solve this bug can give solution in middle end, we would not need to waste time on filing those bugs in bug zilla and fixing them around this problem. Thanks, -Jiangning
RE: A question about detecting array bounds for case Warray-bounds-3.c
PING... -Original Message- From: Jiangning Liu [mailto:jiangning@arm.com] Sent: Thursday, September 22, 2011 10:19 AM To: gcc@gcc.gnu.org Cc: 'ja...@gcc.gnu.org'; 'muel...@gcc.gnu.org'; 'rgue...@gcc.gnu.org'; Matthew Gretton-Dann Subject: A question about detecting array bounds for case Warray- bounds-3.c Hi, For case gcc/testsuite/gcc.dg/Warray-bounds-3.c, obviously it is an invalid C program, because the last iterations of all the loops cause the access of arrays is beyond the max size of corresponding array declarations. The condition of checking upper bound should be rather than =. Right now, GCC compiler doesn't report any warning messages for this case, should it be a bug in both test case and compiler? But looking at http://gcc.gnu.org/PR31227 , it seems this test case is designed to be like this on purpose. Anybody can explain about this? The case is like below, /* { dg-do compile } */ /* { dg-options -O2 -Warray-bounds } */ /* based on PR 31227 */ struct S { const char *abday[7]; const char *day[7]; const char *abmon[12]; const char *mon[12]; const char *am_pm[2]; }; ... for (cnt = 0; cnt = 7; ++cnt) { iov[2 + cnt].iov_base = (void *) (time-abday[cnt] ?: ); iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1; } for (; cnt = 14; ++cnt) { iov[2 + cnt].iov_base = (void *) (time-day[cnt - 7] ?: ); iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1; } for (; cnt = 26; ++cnt) { iov[2 + cnt].iov_base = (void *) (time-abmon[cnt - 14] ?: ); iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1; } for (; cnt = 38; ++cnt) { iov[2 + cnt].iov_base = (void *) (time-mon[cnt - 26] ?: ); iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1; } for (; cnt = 40; ++cnt) { iov[2 + cnt].iov_base = (void *) (time-am_pm[cnt - 38] ?: ); iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1; } Thanks, -Jiangning
RE: A case that PRE optimization hurts performance
* Without PRE, Path1: movl $2, %eax cmpl $1, %eax je .L3 Path2: movb $3, %al cmpl $1, %eax je .L3 Path3: cmpl $1, %eax jne .L14 * With PRE, Path1: movl $1, %ebx movl $2, %eax testb %bl, %bl je .L3 Path2: movl $1, %ebx movb $3, %al testb %bl, %bl je .L3 Path3: cmpl $1, %eax setne %bl testb %bl, %bl jne .L14 Do you have any more thoughts? It seems to me that with PRE all the testb %bl, %bl should be evaluated at compile-time considering the preceeding movl $1, %ebx. Am I missing something? Yes. Can this be done by PRE or any other optimizations in middle end? Thanks, -Jiangning Richard.
RE: A case that PRE optimization hurts performance
-Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Tuesday, September 27, 2011 12:43 AM To: Richard Guenther Cc: Jiangning Liu; gcc@gcc.gnu.org Subject: Re: A case that PRE optimization hurts performance -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 09/26/11 05:00, Richard Guenther wrote: On Mon, Sep 26, 2011 at 9:39 AM, Jiangning Liu jiangning@arm.com wrote: * Without PRE, Path1: movl$2, %eax cmpl$1, %eax je .L3 Path2: movb$3, %al cmpl$1, %eax je .L3 Path3: cmpl$1, %eax jne .L14 * With PRE, Path1: movl$1, %ebx movl$2, %eax testb %bl, %bl je .L3 Path2: movl$1, %ebx movb$3, %al testb %bl, %bl je .L3 Path3: cmpl$1, %eax setne %bl testb %bl, %bl jne .L14 Do you have any more thoughts? It seems to me that with PRE all the testb %bl, %bl should be evaluated at compile-time considering the preceeding movl $1, %ebx. Am I missing something? Yes. Can this be done by PRE or any other optimizations in middle end? Hm, the paths as you quote them are obfuscated by missed jump-threading. On the tree level we have # s_2 = PHI 2(5), 3(4), 2(6), s_25(7) # prephitmp.6_1 = PHI 1(5), 1(4), 1(6), prephitmp.6_3(7) L10: t_14 = t_24 + 1; D.2729_6 = MEM[base: t_14, offset: 0B]; D.2732_7 = D.2729_6 != 0; D.2734_9 = prephitmp.6_1 D.2732_7; if (D.2734_9 != 0) where we could thread the cases with prephitmp.6_1 == 1, ultimately removing the and forwarding the D.2729_6 != 0 test. Which would of course cause some code duplication. Jeff, you recently looked at tree jump-threading, can you see if we can improve things on this particular testcase? There's nothing threading can do here because it doesn't know anything about the value MEM[t14]. Jeff, Could you please explain more about this? What information does jump threading want to know on MEM[t14]? Do you mean it's hard to duplicate that basic block due to some reasons? Thanks, -Jiangning Jeff -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJOgKuLAAoJEBRtltQi2kC75aIH/iikuOQXrMrQJFbQw0COXznB OGq8iXdGwTJGH13vxdItTE0upJp7RgUVLzuhdqj1elTLHv/ujYygMsQRNGKcc8tb GMLECmWDhZqQTFXcTJCgJNZiv7MH1PNELXSdIkkSnxY+pwyn9AX5D3+HcTSjGU6B 51AdUNVph/VSaVboAgcrFpu9S0pX9HVTqFy4JI83Lh613zDVSmPo14DDy7vjBvE9 2Srlvlw0srYup97bGmRqN8wT4ZLLlyYSB2rjEFc6jmgXVncxiteQYIUZpy0lcC0M q3j80aXjZ57/iWyAbqDr1jI5tbVKDBkRa9LL1jvn9534adiG4GrnSMPhoog0ibA= =azr5 -END PGP SIGNATURE-
RE: [PATCH, testsuite] Add loop unrolling command line options for some test cases
-Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Saturday, September 24, 2011 6:12 PM To: Jiangning Liu Cc: Mike Stump; gcc-patches@gcc.gnu.org; r...@cebitec.uni-bielefeld.de Subject: Re: [PATCH, testsuite] Add loop unrolling command line options for some test cases On Thu, Sep 22, 2011 at 6:22 AM, Jiangning Liu jiangning@arm.com wrote: Hi Mike, OK. I will wait 24 more hours. If no objections by then, I will get it checked into trunk. I don't think you need -funroll-loops though. The intention of those test cases are not hurt if -funroll-loops is added, right? Thanks, -Jiangning -Original Message- From: Mike Stump [mailto:mikest...@comcast.net] Sent: Thursday, September 22, 2011 3:10 AM To: Jiangning Liu Cc: gcc-patches@gcc.gnu.org; r...@cebitec.uni-bielefeld.de Subject: Re: [PATCH, testsuite] Add loop unrolling command line options for some test cases On Sep 21, 2011, at 1:22 AM, Jiangning Liu wrote: The fix is to explicitly turn on loop unroll and set max-unroll- times to 8, which is larger than the unrolling times being detected in the cases. Sounds reasonable to me. Ok, though, do watch for any comments by people that know more than I.
[PATCH] Fix stack red zone bug (PR38644)
This patch is fix PR38644, a 3-year-old bug. From the discussions in mail list and bugzilla, I think the middle end fix is a common view. Although there are stills some gaps on how to fix it in middle end, I think this patch at least moves the problem from back-end to middle-end, which makes GCC infrastructure stronger than before. Otherwise, any back-end needs to find and fix this problem by itself. Since this bug was pinged many times by customers, I think at least we can move on with this patch. If necessary, we can then give follow-up to build a even better solution in middle end later on. The patch is tested on x86 and ARM. ChangeLog: * config/i386/i386.c (ix86_stack_using_red_zone): Change inline to be extern. (TARGET_STACK_USING_RED_ZONE): New. * config/rs6000/rs6000.c (rs6000_stack_using_red_zone): New. (TARGET_STACK_USING_RED_ZONE): New. (offset_below_red_zone_p): Change to use new hook TARGET_STACK_USING_RED_ZONE. * doc/tm.texi (TARGET_STACK_USING_RED_ZONE): New. * doc/tm.texi.in (TARGET_STACK_USING_RED_ZONE): New. * sched-deps.c (sched_analyze_1): If the stack pointer is being modified and stack red zone is not supported for ports, flush out all memory references as they may become invalid if moved across the stack adjustment. * target.def (stack_using_red_zone): New. * testsuite/gcc.target/arm/stack-red-zone.c: New. Thanks, -Jiangning diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index ff8c49f..ce486da 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -2631,8 +2631,8 @@ static const char *const cpu_names[TARGET_CPU_DEFAULT_max] = /* Return true if a red-zone is in use. */ -static inline bool -ix86_using_red_zone (void) +extern bool +ix86_stack_using_red_zone (void) { return TARGET_RED_ZONE !TARGET_64BIT_MS_ABI; } @@ -35958,6 +35958,9 @@ ix86_autovectorize_vector_sizes (void) #define TARGET_STACK_PROTECT_FAIL ix86_stack_protect_fail #endif +#undef TARGET_STACK_USING_RED_ZONE +#define TARGET_STACK_USING_RED_ZONE ix86_stack_using_red_zone + #undef TARGET_FUNCTION_VALUE #define TARGET_FUNCTION_VALUE ix86_function_value diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c index 1ab57e5..411cb09 100644 --- a/gcc/config/rs6000/rs6000.c +++ b/gcc/config/rs6000/rs6000.c @@ -1537,6 +1537,9 @@ static const struct attribute_spec rs6000_attribute_table[] = #define TARGET_STACK_PROTECT_FAIL rs6000_stack_protect_fail #endif +#undef TARGET_STACK_USING_RED_ZONE +#define TARGET_STACK_USING_RED_ZONE rs6000_stack_using_red_zone + /* MPC604EUM 3.5.2 Weak Consistency between Multiple Processors The PowerPC architecture requires only weak consistency among processors--that is, memory accesses between processors need not be @@ -20660,6 +20663,13 @@ rs6000_restore_saved_cr (rtx reg, int using_mfcr_multiple) } } +/* Return true if the ABI allows red zone access. */ +extern bool +rs6000_stack_using_red_zone (void) +{ + return (DEFAULT_ABI != ABI_V4); +} + /* Return true if OFFSET from stack pointer can be clobbered by signals. V.4 doesn't have any stack cushion, AIX ABIs have 220 or 288 bytes below stack pointer not cloberred by signals. */ @@ -20667,7 +20677,7 @@ rs6000_restore_saved_cr (rtx reg, int using_mfcr_multiple) static inline bool offset_below_red_zone_p (HOST_WIDE_INT offset) { - return offset (DEFAULT_ABI == ABI_V4 + return offset (!TARGET_STACK_USING_RED_ZONE ? 0 : TARGET_32BIT ? -220 : -288); } diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi index 335c1d1..c848806 100644 --- a/gcc/doc/tm.texi +++ b/gcc/doc/tm.texi @@ -11332,6 +11332,22 @@ to the stack. Therefore, this hook should return true in general, but false for naked functions. The default implementation always returns true. @end deftypefn +@deftypefn {Target Hook} bool TARGET_STACK_USING_RED_ZONE (void) +This hook returns true if the target has a red zone (an area beyond the +current extent of the stack that cannot be modified by asynchronous events +on the processor). + +If this hook returns false then the compiler mid-end will not move an access +to memory in the stack frame past a stack adjustment insn. + +If this hook returns true then the compiler mid-end will assume that it is +safe to move an access to memory in the stack frame past a stack adjustment +insn. The target back-end must emit scheduling barrier insns when this is +unsafe. + +The default is to return false which is safe and appropriate for most targets. +@end deftypefn + @deftypevr {Target Hook} {unsigned HOST_WIDE_INT} TARGET_CONST_ANCHOR On some architectures it can take multiple instructions to synthesize a constant. If there is another constant already in a register that diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index 6783826..0fa9e10 100644 --- a/gcc/doc/tm.texi.in +++
RE: [PATCH] Fix stack red zone bug (PR38644)
-Original Message- From: Andrew Pinski [mailto:pins...@gmail.com] Sent: Tuesday, September 27, 2011 5:31 AM To: Jiangning Liu Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Fix stack red zone bug (PR38644) On Mon, Sep 26, 2011 at 3:26 AM, Jiangning Liu jiangning@arm.com wrote: This patch is fix PR38644, a 3-year-old bug. From the discussions in mail list and bugzilla, I think the middle end fix is a common view. Although there are stills some gaps on how to fix it in middle end, I think this patch at least moves the problem from back- end to middle-end, which makes GCC infrastructure stronger than before. Otherwise, any back-end needs to find and fix this problem by itself. I don't see why you think that is the common view that the fix should be in the middle-end. I and a few others think the back-end should be where the barrier emitted from. The middle-end should not have stack accesses as being special in anyway when it comes down to scheduling. What happens when a new scheduler comes around? Then this has to be fixed again in that new scheduler rather than having the barrier doing the correct thing for all schedulers. Hi Andrew, Thanks for your kindly response! Sorry, for this bug, I don’t see your valuable comments previously in either bug zilla and the [RFC] I sent out long time ago in gcc mail list. What I see is a bunch of people agreed this problem should be fixed in middle end. For example, http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38644#c48 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38644#c49 My comments, http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38644#c35 I'd like to repeat my points here. As I mentioned, it shouldn't be back-end's responsibility to find this problem, i.e. the instructions move over stack pointer change. ISAs can be split into two classes. One class doesn't support stack red zone, and the other class does support stack red zone. If we agree this is a general issue, I think this issue should be solved in middle end rather than in back-end. Yes. Back-end can provide barrier to explicitly represent the dependence, but I think this is a kind of workaround, because this way we are hoping back-end to detect this kind of dependence, while this kind of dependence is common for every back-end which doesn't support stack red zone. If we carefully analyze the implementation in x86 and powerpc back-ends supporting stack red zone, we may find, they are doing almost the same things. In particular, I think the retarget-ability is the most valuable treasure for GCC. Thinking of implementing a new port, do we want to write new code to find this problem, no matter whether the new port supports stack red zone or not? If the answer is YES, it means we need to write the similar code like what currently x86-64 and powerpc back-ends are doing. Obviously, it is a non-trivial work. This way I don't think it's good for GCC's retarget-ability. Why don't we abstract the common thing in middle-end with a very clean interface? Finally, It's OK for me if you say scheduler can only tread dependence as a clean interface. But this point doesn't support stack red zone issue must be handled in different back-ends respectively. If we want to keep scheduler clean enough, a simpler solution is adding a pass in middle-end to generate barrier before scheduler and this pass handle the general stack-red-zone issue using the interface I'm defining in the patch, but obviously this is a kind of over-design so far. Thanks, -Jiangning Thanks, Andrew Pinski
RE: [PATCH] Fix stack red zone bug (PR38644)
Fix a typo and CC x86/rs6000/arm ports maintainers. ChangeLog: * config/i386/i386.c (ix86_stack_using_red_zone): Change inline to be extern. (TARGET_STACK_USING_RED_ZONE): New. * config/rs6000/rs6000.c (rs6000_stack_using_red_zone): New. (TARGET_STACK_USING_RED_ZONE): New. (offset_below_red_zone_p): Change to use new hook TARGET_STACK_USING_RED_ZONE. * doc/tm.texi (TARGET_STACK_USING_RED_ZONE): New. * doc/tm.texi.in (TARGET_STACK_USING_RED_ZONE): New. * sched-deps.c (sched_analyze_1): If the stack pointer is being modified and stack red zone is not supported for ports, flush out all memory references as they may become invalid if moved across the stack adjustment. * target.def (stack_using_red_zone): New. * testsuite/gcc.target/arm/stack-red-zone.c: New. Thanks, -Jiangning diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index ff8c49f..1c16391 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -2631,7 +2631,7 @@ static const char *const cpu_names[TARGET_CPU_DEFAULT_max] = /* Return true if a red-zone is in use. */ -static inline bool +extern bool ix86_using_red_zone (void) { return TARGET_RED_ZONE !TARGET_64BIT_MS_ABI; @@ -35958,6 +35958,9 @@ ix86_autovectorize_vector_sizes (void) #define TARGET_STACK_PROTECT_FAIL ix86_stack_protect_fail #endif +#undef TARGET_STACK_USING_RED_ZONE +#define TARGET_STACK_USING_RED_ZONE ix86_using_red_zone + #undef TARGET_FUNCTION_VALUE #define TARGET_FUNCTION_VALUE ix86_function_value diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c index 1ab57e5..411cb09 100644 --- a/gcc/config/rs6000/rs6000.c +++ b/gcc/config/rs6000/rs6000.c @@ -1537,6 +1537,9 @@ static const struct attribute_spec rs6000_attribute_table[] = #define TARGET_STACK_PROTECT_FAIL rs6000_stack_protect_fail #endif +#undef TARGET_STACK_USING_RED_ZONE +#define TARGET_STACK_USING_RED_ZONE rs6000_stack_using_red_zone + /* MPC604EUM 3.5.2 Weak Consistency between Multiple Processors The PowerPC architecture requires only weak consistency among processors--that is, memory accesses between processors need not be @@ -20660,6 +20663,13 @@ rs6000_restore_saved_cr (rtx reg, int using_mfcr_multiple) } } +/* Return true if the ABI allows red zone access. */ +extern bool +rs6000_stack_using_red_zone (void) +{ + return (DEFAULT_ABI != ABI_V4); +} + /* Return true if OFFSET from stack pointer can be clobbered by signals. V.4 doesn't have any stack cushion, AIX ABIs have 220 or 288 bytes below stack pointer not cloberred by signals. */ @@ -20667,7 +20677,7 @@ rs6000_restore_saved_cr (rtx reg, int using_mfcr_multiple) static inline bool offset_below_red_zone_p (HOST_WIDE_INT offset) { - return offset (DEFAULT_ABI == ABI_V4 + return offset (!TARGET_STACK_USING_RED_ZONE ? 0 : TARGET_32BIT ? -220 : -288); } diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi index 335c1d1..c848806 100644 --- a/gcc/doc/tm.texi +++ b/gcc/doc/tm.texi @@ -11332,6 +11332,22 @@ to the stack. Therefore, this hook should return true in general, but false for naked functions. The default implementation always returns true. @end deftypefn +@deftypefn {Target Hook} bool TARGET_STACK_USING_RED_ZONE (void) +This hook returns true if the target has a red zone (an area beyond the +current extent of the stack that cannot be modified by asynchronous events +on the processor). + +If this hook returns false then the compiler mid-end will not move an access +to memory in the stack frame past a stack adjustment insn. + +If this hook returns true then the compiler mid-end will assume that it is +safe to move an access to memory in the stack frame past a stack adjustment +insn. The target back-end must emit scheduling barrier insns when this is +unsafe. + +The default is to return false which is safe and appropriate for most targets. +@end deftypefn + @deftypevr {Target Hook} {unsigned HOST_WIDE_INT} TARGET_CONST_ANCHOR On some architectures it can take multiple instructions to synthesize a constant. If there is another constant already in a register that diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index 6783826..0fa9e10 100644 --- a/gcc/doc/tm.texi.in +++ b/gcc/doc/tm.texi.in @@ -11223,6 +11223,22 @@ to the stack. Therefore, this hook should return true in general, but false for naked functions. The default implementation always returns true. @end deftypefn +@hook TARGET_STACK_USING_RED_ZONE +This hook returns true if the target has a red zone (an area beyond the +current extent of the stack that cannot be modified by asynchronous events +on the processor). + +If this hook returns false then the compiler mid-end will not move an access +to memory in the stack frame past a stack adjustment insn. + +If this hook returns true then the compiler mid-end will assume that it
RE: [PATCH] Fix stack red zone bug (PR38644)
Think of it this way. What the IR says is there is no barrier between those moves. You either have an implicit barrier (which is what you are proposing) or you have it explicitly. I think we all rather have more things explicit rather than implicit in the IR. And that has been the overall feeling for a few years now. Sorry, I'm afraid I can't agree with you. Instead, I think using barrier to describe this kind of dependence is a kind of implicit method. Please note that this is not an usual data dependence issue. The stack pointer change doesn't have any dependence with memory access at all. No matter what solution itself is, the problem itself is a quite a common one on ISA level, so we should solve it in middle-end, because middle-end is shared for all ports. My proposal avoids problems in future. Any new ports and new back-end implementations needn't explicitly define this hook in a very safe way. But if one port wants to unsafely introduce red zone, we can explicitly define this hook in back-end. This way, we would avoid the bug in the earliest time. Do you really want to hide this problem in back-end silently? And wait others to report bug and then waste time to get it fixed again? The facts I see is over the years, for different ports reported the similar issue around this, and somebody tried to fix them. Actually, this kind of problem should be fixed in design level. If the first people solve this bug can give solution in middle end, we would not need to waste time on filing those bugs in bug zilla and fixing them around this problem. Thanks, -Jiangning
A question about detecting array bounds for case Warray-bounds-3.c
Hi, For case gcc/testsuite/gcc.dg/Warray-bounds-3.c, obviously it is an invalid C program, because the last iterations of all the loops cause the access of arrays is beyond the max size of corresponding array declarations. The condition of checking upper bound should be rather than =. Right now, GCC compiler doesn't report any warning messages for this case, should it be a bug in both test case and compiler? But looking at http://gcc.gnu.org/PR31227 , it seems this test case is designed to be like this on purpose. Anybody can explain about this? The case is like below, /* { dg-do compile } */ /* { dg-options -O2 -Warray-bounds } */ /* based on PR 31227 */ struct S { const char *abday[7]; const char *day[7]; const char *abmon[12]; const char *mon[12]; const char *am_pm[2]; }; ... for (cnt = 0; cnt = 7; ++cnt) { iov[2 + cnt].iov_base = (void *) (time-abday[cnt] ?: ); iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1; } for (; cnt = 14; ++cnt) { iov[2 + cnt].iov_base = (void *) (time-day[cnt - 7] ?: ); iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1; } for (; cnt = 26; ++cnt) { iov[2 + cnt].iov_base = (void *) (time-abmon[cnt - 14] ?: ); iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1; } for (; cnt = 38; ++cnt) { iov[2 + cnt].iov_base = (void *) (time-mon[cnt - 26] ?: ); iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1; } for (; cnt = 40; ++cnt) { iov[2 + cnt].iov_base = (void *) (time-am_pm[cnt - 38] ?: ); iov[2 + cnt].iov_len = strlen (iov[2 + cnt].iov_base) + 1; } Thanks, -Jiangning
[PATCH, testsuite] Add loop unrolling command line options for some test cases
Hi, The following test cases are to check predictive commoning optimization by detecting loop unrolling times. Originally by default the max-unroll-times is 8. If max-unroll-times is specified to be less than the expected unrolling times, those cases would fail. The fix is to explicitly turn on loop unroll and set max-unroll-times to 8, which is larger than the unrolling times being detected in the cases. ChangeLog: 2011-09-14 Jiangning Liu jiangning@arm.com * gcc/testsuite/gcc.dg/tree-ssa/predcom-1.c: Explicitly turn on loop unroll and set max unroll times to 8. * gcc/testsuite/gcc.dg/tree-ssa/predcom-2.c: Likewise. * gcc/testsuite/gcc.dg/tree-ssa/predcom-3.c: Likewise. * gcc/testsuite/gcc.dg/tree-ssa/predcom-4.c: Likewise. * gcc/testsuite/gcc.dg/tree-ssa/predcom-5.c: Likewise. Thanks, -Jiangning diff --git a/gcc/testsuite/gcc.dg/tree-ssa/predcom-1.c b/gcc/testsuite/gcc.dg/tree-ssa/predcom-1.c index 16bd5c9..f1e52e5 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/predcom-1.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/predcom-1.c @@ -1,6 +1,6 @@ /* { dg-do compile } */ /* { dg-do run } */ -/* { dg-options -O2 -fpredictive-commoning -fdump-tree-pcom-details } */ +/* { dg-options -O2 -funroll-loops --param max-unroll-times=8 -fpredictive-commoning -fdump-tree-pcom-details } */ void abort (void); diff --git a/gcc/testsuite/gcc.dg/tree-ssa/predcom-2.c b/gcc/testsuite/gcc.dg/tree-ssa/predcom-2.c index 7275f28..27e53ee 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/predcom-2.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/predcom-2.c @@ -1,6 +1,6 @@ /* { dg-do compile } */ /* { dg-do run } */ -/* { dg-options -O2 -fpredictive-commoning -fdump-tree-pcom-details } */ +/* { dg-options -O2 -funroll-loops --param max-unroll-times=8 -fpredictive-commoning -fdump-tree-pcom-details } */ void abort (void); diff --git a/gcc/testsuite/gcc.dg/tree-ssa/predcom-3.c b/gcc/testsuite/gcc.dg/tree-ssa/predcom-3.c index d500234..5dfe384 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/predcom-3.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/predcom-3.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options -O2 -fpredictive-commoning -fdump-tree-pcom-details } */ +/* { dg-options -O2 -funroll-loops --param max-unroll-times=8 -fpredictive-commoning -fdump-tree-pcom-details } */ int a[1000], b[1000]; diff --git a/gcc/testsuite/gcc.dg/tree-ssa/predcom-4.c b/gcc/testsuite/gcc.dg/tree-ssa/predcom-4.c index 6f06b7f..c29a46a 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/predcom-4.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/predcom-4.c @@ -1,6 +1,6 @@ /* { dg-do compile } */ /* { dg-do run } */ -/* { dg-options -O2 -fpredictive-commoning -fdump-tree-pcom-details } */ +/* { dg-options -O2 -funroll-loops --param max-unroll-times=8 -fpredictive-commoning -fdump-tree-pcom-details } */ /* Test for predictive commoning of expressions, without reassociation. */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/predcom-5.c b/gcc/testsuite/gcc.dg/tree-ssa/predcom-5.c index 134fc37..29444ab 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/predcom-5.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/predcom-5.c @@ -1,6 +1,6 @@ /* { dg-do compile } */ /* { dg-do run } */ -/* { dg-options -O2 -fpredictive-commoning -fdump-tree-pcom-details } */ +/* { dg-options -O2 -funroll-loops --param max-unroll-times=8 -fpredictive-commoning -fdump-tree-pcom-details } */ /* Test for predictive commoning of expressions, with reassociation. */
RE: [PATCH, testsuite] Add loop unrolling command line options for some test cases
Hi Mike, OK. I will wait 24 more hours. If no objections by then, I will get it checked into trunk. Thanks, -Jiangning -Original Message- From: Mike Stump [mailto:mikest...@comcast.net] Sent: Thursday, September 22, 2011 3:10 AM To: Jiangning Liu Cc: gcc-patches@gcc.gnu.org; r...@cebitec.uni-bielefeld.de Subject: Re: [PATCH, testsuite] Add loop unrolling command line options for some test cases On Sep 21, 2011, at 1:22 AM, Jiangning Liu wrote: The fix is to explicitly turn on loop unroll and set max-unroll-times to 8, which is larger than the unrolling times being detected in the cases. Sounds reasonable to me. Ok, though, do watch for any comments by people that know more than I.
RE: A case that PRE optimization hurts performance
Hi Richard, I slightly changed the case to be like below, int f(char *t) { int s=0; while (*t s != 1) { switch (s) { case 0: /* path 1 */ s = 2; break; case 2: /* path 2 */ s = 3; /* changed */ break; default: /* path 3 */ if (*t == '-') s = 2; break; } t++; } return s; } -O2 is still worse than -O2 -fno-tree-pre. -O2 -fno-tree-pre result is f: pushl %ebp xorl%eax, %eax movl%esp, %ebp movl8(%ebp), %edx movzbl (%edx), %ecx jmp .L14 .p2align 4,,7 .p2align 3 .L5: movl$2, %eax .L7: addl$1, %edx cmpl$1, %eax movzbl (%edx), %ecx je .L3 .L14: testb %cl, %cl je .L3 testl %eax, %eax je .L5 cmpl$2, %eax .p2align 4,,5 je .L17 cmpb$45, %cl .p2align 4,,5 je .L5 addl$1, %edx cmpl$1, %eax movzbl (%edx), %ecx jne .L14 .p2align 4,,7 .p2align 3 .L3: popl%ebp .p2align 4,,2 ret .p2align 4,,7 .p2align 3 .L17: movb$3, %al .p2align 4,,3 jmp .L7 While -O2 result is f: pushl %ebp xorl%eax, %eax movl%esp, %ebp movl8(%ebp), %edx pushl %ebx movzbl (%edx), %ecx jmp .L14 .p2align 4,,7 .p2align 3 .L5: movl$1, %ebx movl$2, %eax .L7: addl$1, %edx testb %bl, %bl movzbl (%edx), %ecx je .L3 .L14: testb %cl, %cl je .L3 testl %eax, %eax je .L5 cmpl$2, %eax .p2align 4,,5 je .L16 cmpb$45, %cl .p2align 4,,5 je .L5 cmpl$1, %eax setne %bl addl$1, %edx testb %bl, %bl movzbl (%edx), %ecx jne .L14 .p2align 4,,7 .p2align 3 .L3: popl%ebx popl%ebp ret .p2align 4,,7 .p2align 3 .L16: movl$1, %ebx movb$3, %al jmp .L7 You may notice that register ebx is introduced, and some more instructions around ebx are generated as well. i.e. setne %bl testb %bl, %bl I agree with you that in theory PRE does the right thing to minimize the computation cost on gimple level. However, the problem is the cost of converting comparison result to a bool value is not considered, so it actually makes binary code worse. For this case, as I summarized below, to complete the same functionality With PRE is worse than Without PRE for all three paths, * Without PRE, Path1: movl$2, %eax cmpl$1, %eax je .L3 Path2: movb$3, %al cmpl$1, %eax je .L3 Path3: cmpl$1, %eax jne .L14 * With PRE, Path1: movl$1, %ebx movl$2, %eax testb %bl, %bl je .L3 Path2: movl$1, %ebx movb$3, %al testb %bl, %bl je .L3 Path3: cmpl$1, %eax setne %bl testb %bl, %bl jne .L14 Do you have any more thoughts? Thanks, -Jiangning -Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Tuesday, August 02, 2011 5:23 PM To: Jiangning Liu Cc: gcc@gcc.gnu.org Subject: Re: A case that PRE optimization hurts performance On Tue, Aug 2, 2011 at 4:37 AM, Jiangning Liu jiangning@arm.com wrote: Hi, For the following simple test case, PRE optimization hoists computation (s!=1) into the default branch of the switch statement, and finally causes very poor code generation. This problem occurs in both X86 and ARM, and I believe it is also a problem for other targets. int f(char *t) { int s=0; while (*t s != 1) { switch (s) { case 0: s = 2; break; case 2: s = 1; break; default: if (*t == '-') s = 1; break; } t++; } return s; } Taking X86 as an example, with option -O2 you may find 52 instructions generated like below, f: 0: 55 push %ebp 1: 31 c0 xor %eax,%eax 3: 89 e5 mov %esp,%ebp 5: 57 push %edi 6: 56 push %esi 7: 53 push %ebx 8: 8b 55 08 mov 0x8(%ebp),%edx b: 0f b6 0a movzbl (%edx),%ecx e: 84 c9
[PATCH, testsuite, ARM] Change to expected failure for g++.dg/abi/local1.C on ARM target
Here comes a patch to change the case g++.dg/abi/local1.C to be expected failure for ARM target. In C++ ABI for the ARM architecture, it says, This runs contrary to §2.9.1 of [GC++ABI] which states: It is intended that two type_info pointers point to equivalent type descriptions if and only if the pointers are equal. An implementation must satisfy this constraint, e.g. by using symbol preemption, COMDAT sections, or other mechanisms. Fortunately, we can ignore this requirement without violating the C++ standard provided that: * type_info::operator== and type_info::operator!= compare the strings returned by type_info::name(), not just the pointers to the RTTI objects and their names. * No reliance is placed on the address returned by type_info::name(). (That is, t1.name() != t2.name() does not imply that t1 != t2). The patch is, diff --git a/gcc/testsuite/g++.dg/abi/local1.C b/gcc/testsuite/g++.dg/abi/local1.C index 518193c..7f08a8f 100644 --- a/gcc/testsuite/g++.dg/abi/local1.C +++ b/gcc/testsuite/g++.dg/abi/local1.C @@ -1,4 +1,4 @@ -// { dg-do run } +// { dg-do run { xfail { arm*-*-eabi* } } } // { dg-additional-sources local1-a.cc } #include typeinfo ChangeLog: 2011-09-14 Jiangning Liu jiangning@arm.com * g++.dg/abi/local1.C: Change to XFAIL for ARM EABI target. Thanks, -Jiangning
Is VRP is too conservative to identify boolean value 0 and 1?
Hi, For the following small case, int f(int i, int j) { if (i==1 j==2) return i; else return j; } with -O2 option, GCC has vrp2 dump like below, == Value ranges after VRP: i_1: VARYING i_2(D): VARYING D.1249_3: [0, +INF] j_4(D): VARYING D.1250_5: [0, +INF] D.1251_6: [0, +INF] j_10: [2, 2] EQUIVALENCES: { j_4(D) } (1 elements) Removing basic block 3 f (int i, int j) { _Bool D.1251; _Bool D.1250; _Bool D.1249; bb 2: D.1249_3 = i_2(D) == 1; D.1250_5 = j_4(D) == 2; D.1251_6 = D.1250_5 D.1249_3; if (D.1251_6 != 0) goto bb 3; else goto bb 4; bb 3: bb 4: # i_1 = PHI 1(3), j_4(D)(2) return i_1; } Variable D.1249_3, D.1250_5 and D.1251_6 should be boolean values, so the their value ranges should be D.1249_3: [0, 1] D.1250_5: [0, 1] D.1251_6: [0, 1] So why current VRP can't find out this value range? I'm asking this question because the optimizations in back-end need this info to do advanced optimization. Thanks, -Jiangning
RE: Is VRP is too conservative to identify boolean value 0 and 1?
Andrew, I realize I needn't back-end solution for my case at all, and in middle end I can directly use the _Bool type info! Appreciate your reply! Thanks, -Jiangning -Original Message- From: Andrew Pinski [mailto:pins...@gmail.com] Sent: Friday, September 02, 2011 2:27 PM To: Jiangning Liu Cc: gcc@gcc.gnu.org Subject: Re: Is VRP is too conservative to identify boolean value 0 and 1? On Thu, Sep 1, 2011 at 10:58 PM, Jiangning Liu jiangning@arm.com wrote: D.1249_3: [0, 1] D.1250_5: [0, 1] D.1251_6: [0, 1] Those are equivalent to [0, MAX] as _Bool only has two different values, 0 and 1 (MAX). Can you explain more about the optimization which you are working on that needs the ranges as (int)[0,1] rather than (_Bool)[0,MAX] ? Thanks, Andrew Pinski
[PATCH, testsuite, ARM] change XFAIL to pass for ARM on a case testing tree-ssa-dom
Test case gcc.dg/tree-ssa/20040204-1.c can pass for -O1 after Richard Guenther rguent...@suse.de fixed something in tree-ssa-dom. The link_error should be optimized away for ARM targets as well. The patch is: diff --git a/gcc/testsuite/gcc.dg/tree-ssa/20040204-1.c b/gcc/testsuite/gcc.dg/tree-ssa/20040204-1.c index 45e44a1..470b585 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/20040204-1.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/20040204-1.c @@ -33,5 +33,5 @@ void test55 (int x, int y) that the should be emitted (based on BRANCH_COST). Fix this by teaching dom to look through and register all components as true. */ -/* { dg-final { scan-tree-dump-times link_error 0 optimized { xfail { ! alpha*-*-* powerpc*-*-* cris-*-* crisv32-*-* hppa*-*-* i?86-*-* mmix-*-* mips*-*-* m68k*-*-* moxie-*-* sparc*-*-* spu-*-* x86_64-*-* } } } } */ +/* { dg-final { scan-tree-dump-times link_error 0 optimized { xfail { ! alpha*-*-* arm*-*-* powerpc*-*-* cris-*-* crisv32-*-* hppa*-*-* i?86-*-* mmix-*-* mips*-*-* m68k*-*-* moxie-*-* sparc*-*-* spu-*-* x86_64-*-* } } } } */ /* { dg-final { cleanup-tree-dump optimized } } */ gcc/testsuite/ChangeLog: 2011-08-26 Jiangning Liu jiangning@arm.com PR tree-optimization/46021 * gcc.dg/tree-ssa/20040204-1.c: Don't XFAIL on arm*-*-*. Thanks, -Jiangning
RE: [PATCH, ARM] Generate conditional compares in Thumb2 state
Attached is the new patch file. Review please! ChangeLog: 2011-08-18 Jiangning Liu jiangning@arm.com * config/arm/arm.md (*ior_scc_scc): Enable for Thumb2 as well. (*ior_scc_scc_cmp): Likewise (*and_scc_scc): Likewise. (*and_scc_scc_cmp): Likewise. (*and_scc_scc_nodom): Likewise. (*cmp_ite0, *cmp_ite1, *cmp_and, *cmp_ior): Handle Thumb2. Testsuite Changelog would be: 2011-08-18 Jiangning Liu jiangning@arm.com * gcc.target/arm/thumb2-cond-cmp-1.c: New. Make sure conditional compare can be generated. * gcc.target/arm/thumb2-cond-cmp-2.c: Likewise. * gcc.target/arm/thumb2-cond-cmp-3.c: Likewise. * gcc.target/arm/thumb2-cond-cmp-4.c: Likewise. Regression test against cortex-M0/M3/M4 profile with -mthumb option doesn't show any new failures. Thanks, -Jiangning fix_cond_cmp_5.patch Description: Binary data
RE: [PATCH, ARM] Generate conditional compares in Thumb2 state
Ramana, I only see the following three combinations are meaningful, * cmp l, lPy // keep cmp, and length is 2, on thumb * cmp r, rI // keep cmp, and length is 4 * cmp r, L // convert to cmn, and length is 4 According to ARM ARM, for negative immediate, all encodings for cmp/cmn are 4-byte long, i.e. * CMP: encoding T2 and A1 * CMN: encoding T1 and A1 so we needn't to make difference to cover Pw and Pv. Length is 8 bytes if you have Pw and L For this cases, the length should be 10 for Thumb2 instead. Finally, if we want to describe all possibilities in constraints, we would have the followings 9 combinations, * cmp1 has operands (l, l, l, r, r, r, r, r, r) (lPy,lPy,lPy,rI,rI,rI,L, L, L) * cmp2 has operands (l, r, r, l, r, r, l, r, r) (lPy,rI,L, lPy,rI,L, lPy,rI,L) So the length would be like below, (I don't know how to write it in attribute section yet. ) if (TARGET_THUMB2) { (set_attr length 6,8,8,8,10,10,8,10,10)] } else { (set_attr length 4,6,6,6,8,8,6,8,8)] } Does it make sense? Thanks, -Jiangning -Original Message- From: Ramana Radhakrishnan [mailto:ramana.radhakrish...@linaro.org] Sent: Wednesday, August 10, 2011 6:40 PM To: Jiangning Liu Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH, ARM] Generate conditional compares in Thumb2 state On 10 August 2011 09:20, Jiangning Liu jiangning@arm.com wrote: PING... BTW, in patch fix_cond_cmp_2.patch, the file mode of thumb2.md is carelessly changed, so please check attached new patch file fix_cond_cmp_3.patch. Please do not top post. I've been away for the whole of last week and then been away from my desk for the first 2 days this week and am still catching up with email . I'm missing a Changelog entry for this . Do I assume it is the same ? I've just noticed that the length attribute isn't correct for Thumb2 state. When I last looked at this I must have missed the case where the cmp and the cmn are both 32 bit instructions. The length can vary between 6 and 10 bytes for the Thumb2 variant from my reading of the ARM-ARM. i.e cmp reg, 8bitconstant itcond cmn reg, 8bitconstant Length = 6 bytes or even with cmp reg, reg it cond cmncond reg, reg All registers betwen r0 and r7 . Length is 6 bytes if you have this for both l constraints. Length is 6 bytes - you should catch this with Pw and Pv respectively . Length is 8 bytes if you have Pw and L Length is 8 bytes if you have I and Pv Length is 10 bytes if you have I and L . Showing you an example with l and Py at this point of time and leaving the rest as homework. Yes it will duplicate a few cases but it helps getting the lengths absolutely right. (define_insn *cmp_ite0 [(set (match_operand 6 dominant_cc_register ) (compare (if_then_else:SI (match_operator 4 arm_comparison_operator [(match_operand:SI 0 s_register_operand l,r,r,r,r) (match_operand:SI 1 arm_add_operand lPy,rI,L,rI,L)]) (match_operator:SI 5 arm_comparison_operator [(match_operand:SI 2 s_register_operand l,r,r,r,r) (match_operand:SI 3 arm_add_operand lPy,rI,rI,L,L)]) (const_int 0)) (const_int 0)))] TARGET_32BIT * { static const char * const cmp1[5][2] = { {\cmp\\t%2, %3\, \cmp\\t%0, %1\}, {\cmp\\t%2, %3\, \cmp\\t%0, %1\}, {\cmp\\t%2, %3\, \cmn\\t%0, #%n1\}, {\cmn\\t%2, #%n3\, \cmp\\t%0, %1\}, {\cmn\\t%2, #%n3\, \cmn\\t%0, #%n1\} }; static const char * const cmp2[5][2] = { {\cmp%d5\\t%0, %1\, \cmp%d4\\t%2, %3\}, {\cmp%d5\\t%0, %1\, \cmp%d4\\t%2, %3\}, {\cmn%d5\\t%0, #%n1\, \cmp%d4\\t%2, %3\}, {\cmp%d5\\t%0, %1\, \cmn%d4\\t%2, #%n3\}, {\cmn%d5\\t%0, #%n1\, \cmn%d4\\t%2, #%n3\} }; static const char * const ite[2] = { \it\\t%d5\, \it\\t%d4\ }; int swap = comparison_dominates_p (GET_CODE (operands[5]), GET_CODE (operands[4])); output_asm_insn (cmp1[which_alternative][swap], operands); if (TARGET_THUMB2) { output_asm_insn (ite[swap], operands); } output_asm_insn (cmp2[which_alternative][swap], operands); return \\; } [(set_attr conds set) (set_attr arch t2,any,any,any,any) (set_attr length 6,8,8,8,8)] ) As for the extra problem exposed by this specific case, may we treat it as a separate fix to decouple it with this one, and I can give follow up later on? I think it is a general problem not only for the particular pattern it/op/it/op. But I'm not sure how far we can go to optimize this kind of problems introduced by IT block. Currently the way in which the Thumb2 backend generates conditional instructions and combines them further with other IT blocks is by running a state machine at the very end before assembly
RE: [PATCH, ARM] Generate conditional compares in Thumb2 state
PING... BTW, in patch fix_cond_cmp_2.patch, the file mode of thumb2.md is carelessly changed, so please check attached new patch file fix_cond_cmp_3.patch. Thanks, -Jiangning -Original Message- From: Jiangning Liu [mailto:jiangning@arm.com] Sent: Monday, August 08, 2011 2:01 PM To: 'Ramana Radhakrishnan' Cc: gcc-patches@gcc.gnu.org Subject: RE: [PATCH, ARM] Generate conditional compares in Thumb2 state In attached new patch, arm_arch_thumb2 is changed to TARGET_THUMB2. I tried that with my patch command line option -mcpu=armv7-a9 doesn't generate IT instruction any longer, unless option -mthumb is being added. All of my tests assume command line option -mthumb, while cortex-M0, cortex-M3 cortex-M4 are covered by options -mcpu=cortex-m0, - march=armv7-m, and -march=armv7e-m respectively. As for the extra problem exposed by this specific case, may we treat it as a separate fix to decouple it with this one, and I can give follow up later on? I think it is a general problem not only for the particular pattern it/op/it/op. But I'm not sure how far we can go to optimize this kind of problems introduced by IT block. For this specific case, I see if conversion already generates conditional move before combination pass. So basically the peephole rules may probably work for most of the general scenarios. My initial thought is go over the rules introducing IT block and try to figure out all of the combination that two of this kinds of rules can be in sequential order. Maybe we only need to handle the most common case like this one. Since I do see a bunch of rules have something to do with problem, I'd like to look into all of them to give a most reasonable solution in a separate fix. Does it make sense? Thanks, -Jiangning -Original Message- From: Ramana Radhakrishnan [mailto:ramana.radhakrish...@linaro.org] Sent: Friday, August 05, 2011 9:20 AM To: Jiangning Liu Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH, ARM] Generate conditional compares in Thumb2 state On 3 August 2011 08:48, Jiangning Liu jiangning@arm.com wrote: This patch is to generate more conditional compare instructions in Thumb2 state. Given an example like below, int f(int i, int j) { if ( (i == '+') || (j == '-') ) { return i; } else { return j; } } Without the patch, compiler generates the following codes, sub r2, r0, #43 rsbs r3, r2, #0 adc r3, r3, r2 cmp r1, #45 it eq orreq r3, r3, #1 cmp r3, #0 it eq moveq r0, r1 bx lr With the patch, compiler can generate conditional jump like below, cmp r0, #43 it ne cmpne r1, #45 it ne movne r0, r1 bx lr Nice improvement but there could be a single it block to handle both and thus you could make this even better with cmp r0, #43 itt ne cmpne r1 ,#45 movne r0, r1 The way to do this would be to try and split this post-reload unfortunately into the cmp instruction and the conditional compare with the appropriate instruction length - Then the backend has a chance of merging some of this into a single instruction. Unfortunately that won't be very straight-forward but that's a direction we probably ought to proceed with in this case. In a number of places: + if (arm_arch_thumb2) Ah instead of this please use if (TARGET_THUMB2) - arm_arch_thumb2 is true based on the architecture levels and not necessarily if the user wants to generate Thumb code. I don't want an unnecessary IT instruction being emitted in the ASM block in ARM state for v7-a and above. Tested against arm-none-eabi target and no regression found. Presumably for ARM and Thumb2 state ? cheers Ramana fix_cond_cmp_3.patch Description: Binary data
RE: [PATCH, ARM] Generate conditional compares in Thumb2 state
In attached new patch, arm_arch_thumb2 is changed to TARGET_THUMB2. I tried that with my patch command line option -mcpu=armv7-a9 doesn't generate IT instruction any longer, unless option -mthumb is being added. All of my tests assume command line option -mthumb, while cortex-M0, cortex-M3 cortex-M4 are covered by options -mcpu=cortex-m0, -march=armv7-m, and -march=armv7e-m respectively. As for the extra problem exposed by this specific case, may we treat it as a separate fix to decouple it with this one, and I can give follow up later on? I think it is a general problem not only for the particular pattern it/op/it/op. But I'm not sure how far we can go to optimize this kind of problems introduced by IT block. For this specific case, I see if conversion already generates conditional move before combination pass. So basically the peephole rules may probably work for most of the general scenarios. My initial thought is go over the rules introducing IT block and try to figure out all of the combination that two of this kinds of rules can be in sequential order. Maybe we only need to handle the most common case like this one. Since I do see a bunch of rules have something to do with problem, I'd like to look into all of them to give a most reasonable solution in a separate fix. Does it make sense? Thanks, -Jiangning -Original Message- From: Ramana Radhakrishnan [mailto:ramana.radhakrish...@linaro.org] Sent: Friday, August 05, 2011 9:20 AM To: Jiangning Liu Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH, ARM] Generate conditional compares in Thumb2 state On 3 August 2011 08:48, Jiangning Liu jiangning@arm.com wrote: This patch is to generate more conditional compare instructions in Thumb2 state. Given an example like below, int f(int i, int j) { if ( (i == '+') || (j == '-') ) { return i; } else { return j; } } Without the patch, compiler generates the following codes, sub r2, r0, #43 rsbs r3, r2, #0 adc r3, r3, r2 cmp r1, #45 it eq orreq r3, r3, #1 cmp r3, #0 it eq moveq r0, r1 bx lr With the patch, compiler can generate conditional jump like below, cmp r0, #43 it ne cmpne r1, #45 it ne movne r0, r1 bx lr Nice improvement but there could be a single it block to handle both and thus you could make this even better with cmp r0, #43 itt ne cmpne r1 ,#45 movne r0, r1 The way to do this would be to try and split this post-reload unfortunately into the cmp instruction and the conditional compare with the appropriate instruction length - Then the backend has a chance of merging some of this into a single instruction. Unfortunately that won't be very straight-forward but that's a direction we probably ought to proceed with in this case. In a number of places: + if (arm_arch_thumb2) Ah instead of this please use if (TARGET_THUMB2) - arm_arch_thumb2 is true based on the architecture levels and not necessarily if the user wants to generate Thumb code. I don't want an unnecessary IT instruction being emitted in the ASM block in ARM state for v7-a and above. Tested against arm-none-eabi target and no regression found. Presumably for ARM and Thumb2 state ? cheers Ramana fix_cond_cmp_2.patch Description: Binary data
[PATCH, ARM] Generate conditional compares in Thumb2 state
This patch is to generate more conditional compare instructions in Thumb2 state. Given an example like below, int f(int i, int j) { if ( (i == '+') || (j == '-') ) { return i; } else { return j; } } Without the patch, compiler generates the following codes, sub r2, r0, #43 rsbsr3, r2, #0 adc r3, r3, r2 cmp r1, #45 it eq orreq r3, r3, #1 cmp r3, #0 it eq moveq r0, r1 bx lr With the patch, compiler can generate conditional jump like below, cmp r0, #43 it ne cmpne r1, #45 it ne movne r0, r1 bx lr The patch is essentially to insert *it* instruction for the following rules in arm.md, * cmp_ite0 * cmp_ite1 * cmp_and * cmp_ior Tested against arm-none-eabi target and no regression found. Source code Changelog would be: 2011-07-29 Jiangning Liu jiangning@arm.com * config/arm/arm.md (*ior_scc_scc): Enable for Thumb2 as well. (*ior_scc_scc_cmp): Likewise (*and_scc_scc): Likewise. (*and_scc_scc_cmp): Likewise. (*and_scc_scc_nodom): Likewise. (*cmp_ite0, *cmp_ite1, *cmp_and, *cmp_ior): Handle Thumb2. Testsuite Changelog would be: 2011-07-29 Jiangning Liu jiangning@arm.com * gcc.target/arm/thumb2-cond-cmp-1.c: New. Make sure conditional compare can be generated. * gcc.target/arm/thumb2-cond-cmp-2.c: Likewise. * gcc.target/arm/thumb2-cond-cmp-3.c: Likewise. * gcc.target/arm/thumb2-cond-cmp-4.c: Likewise. Thanks, -Jiangning fix_cond_cmp.patch Description: Binary data
RE: [RFC] Add middle end hook for stack red zone size
The answer is ARM can. However, if you look into the bugs PR30282 and PR38644, PR44199, you may find in history, there are several different cases in different ports reporting the similar failures, covering x86, PowerPC and ARM. You are right, they were all fixed in back-ends in the past, but we should fix the bug in a general way to make GCC infrastructure stronger, rather than fixing the problem target-by-target and case-by-case! If you further look into the back-end fixes in x86 and PowerPC, you may find they looks quite similar in back-ends. Thanks, -Jiangning -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Jakub Jelinek Sent: Monday, August 01, 2011 5:12 PM To: Jiangning Liu Cc: 'Joern Rennecke'; gcc@gcc.gnu.org; gcc-patc...@gcc.gnu.org; vmaka...@redhat.com; dje@gmail.com; Richard Henderson; Ramana Radhakrishnan; 'Ramana Radhakrishnan' Subject: Re: [RFC] Add middle end hook for stack red zone size On Mon, Aug 01, 2011 at 11:44:04AM +0800, Jiangning Liu wrote: It's quite necessary to solve the general problem in middle-end rather than in back-end. That's what we disagree on. All back-ends but ARM are able to handle it right, why can't ARM too? The ABI rules for stack handling in the epilogues are simply too diverse and complex to be handled easily in the scheduler. Jakub
A case that PRE optimization hurts performance
Hi, For the following simple test case, PRE optimization hoists computation (s!=1) into the default branch of the switch statement, and finally causes very poor code generation. This problem occurs in both X86 and ARM, and I believe it is also a problem for other targets. int f(char *t) { int s=0; while (*t s != 1) { switch (s) { case 0: s = 2; break; case 2: s = 1; break; default: if (*t == '-') s = 1; break; } t++; } return s; } Taking X86 as an example, with option -O2 you may find 52 instructions generated like below, f: 0: 55 push %ebp 1: 31 c0 xor%eax,%eax 3: 89 e5 mov%esp,%ebp 5: 57 push %edi 6: 56 push %esi 7: 53 push %ebx 8: 8b 55 08mov0x8(%ebp),%edx b: 0f b6 0amovzbl (%edx),%ecx e: 84 c9 test %cl,%cl 10: 74 50 je 62 f+0x62 12: 83 c2 01add$0x1,%edx 15: 85 c0 test %eax,%eax 17: 75 23 jne3c f+0x3c 19: 8d b4 26 00 00 00 00lea0x0(%esi,%eiz,1),%esi 20: 0f b6 0amovzbl (%edx),%ecx 23: 84 c9 test %cl,%cl 25: 0f 95 c0setne %al 28: 89 c7 mov%eax,%edi 2a: b8 02 00 00 00 mov$0x2,%eax 2f: 89 fb mov%edi,%ebx 31: 83 c2 01add$0x1,%edx 34: 84 db test %bl,%bl 36: 74 2a je 62 f+0x62 38: 85 c0 test %eax,%eax 3a: 74 e4 je 20 f+0x20 3c: 83 f8 02cmp$0x2,%eax 3f: 74 1f je 60 f+0x60 41: 80 f9 2dcmp$0x2d,%cl 44: 74 22 je 68 f+0x68 46: 0f b6 0amovzbl (%edx),%ecx 49: 83 f8 01cmp$0x1,%eax 4c: 0f 95 c3setne %bl 4f: 89 df mov%ebx,%edi 51: 84 c9 test %cl,%cl 53: 0f 95 c3setne %bl 56: 89 de mov%ebx,%esi 58: 21 f7 and%esi,%edi 5a: eb d3 jmp2f f+0x2f 5c: 8d 74 26 00 lea0x0(%esi,%eiz,1),%esi 60: b0 01 mov$0x1,%al 62: 5b pop%ebx 63: 5e pop%esi 64: 5f pop%edi 65: 5d pop%ebp 66: c3 ret 67: 90 nop 68: b8 01 00 00 00 mov$0x1,%eax 6d: 5b pop%ebx 6e: 5e pop%esi 6f: 5f pop%edi 70: 5d pop%ebp 71: c3 ret But with command line option -O2 -fno-tree-pre, there are only 12 instructions generated, and the code would be very clean like below, f: 0: 55 push %ebp 1: 31 c0 xor%eax,%eax 3: 89 e5 mov%esp,%ebp 5: 8b 55 08mov0x8(%ebp),%edx 8: 80 3a 00cmpb $0x0,(%edx) b: 74 0e je 1b f+0x1b d: 80 7a 01 00 cmpb $0x0,0x1(%edx) 11: b0 02 mov$0x2,%al 13: ba 01 00 00 00 mov$0x1,%edx 18: 0f 45 c2cmovne %edx,%eax 1b: 5d pop%ebp 1c: c3 ret Do you have any idea about this? Thanks, -Jiangning
RE: [RFC] Add middle end hook for stack red zone size
Hi Jakub, Appreciate for your valuable comments! I think SPARC V9 ABI doesn't have red zone defined, right? So stack_red_zone_size should be defined as zero by default, the scheduler would block moving memory accesses across stack adjustment no matter what the offset is. I don't see any risk here. Also, in my patch function *abs* is being used to avoid the opposite stack direction issue as you mentioned. Some people like you insist on the ABI diversity, and actually I agree with you on this. But part of the ABI definition is general for all targets. The point here is memory access beyond stack red zone should be avoided, which is the general part of ABI that compiler should guarantee. For this general part, middle end should take the responsibility. Thanks, -Jiangning -Original Message- From: Jakub Jelinek [mailto:ja...@redhat.com] Sent: Monday, August 01, 2011 6:31 PM To: Jiangning Liu Cc: 'Joern Rennecke'; gcc@gcc.gnu.org; gcc-patc...@gcc.gnu.org; vmaka...@redhat.com; dje@gmail.com; Richard Henderson; Ramana Radhakrishnan; 'Ramana Radhakrishnan' Subject: Re: [RFC] Add middle end hook for stack red zone size On Mon, Aug 01, 2011 at 06:14:27PM +0800, Jiangning Liu wrote: ARM. You are right, they were all fixed in back-ends in the past, but we should fix the bug in a general way to make GCC infrastructure stronger, rather than fixing the problem target-by-target and case-by-case! If you further look into the back-end fixes in x86 and PowerPC, you may find they looks quite similar in back-ends. Red zone is only one difficulty, your patch is e.g. completely ignoring existence of biased stack pointers (e.g. SPARC -m64 has them). Some targets have stack growing in opposite direction, etc. We have really a huge amount of very diverse ABIs and making the middle-end grok what is an invalid stack access is difficult. Jakub
RE: [RFC] Add middle end hook for stack red zone size
The answer is ARM can. However, if you look into the bugs PR30282 and PR38644, PR44199, you may find in history, there are several different cases in different ports reporting the similar failures, covering x86, PowerPC and ARM. You are right, they were all fixed in back-ends in the past, but we should fix the bug in a general way to make GCC infrastructure stronger, rather than fixing the problem target-by-target and case-by-case! If you further look into the back-end fixes in x86 and PowerPC, you may find they looks quite similar in back-ends. Thanks, -Jiangning -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Jakub Jelinek Sent: Monday, August 01, 2011 5:12 PM To: Jiangning Liu Cc: 'Joern Rennecke'; g...@gcc.gnu.org; gcc-patches@gcc.gnu.org; vmaka...@redhat.com; dje@gmail.com; Richard Henderson; Ramana Radhakrishnan; 'Ramana Radhakrishnan' Subject: Re: [RFC] Add middle end hook for stack red zone size On Mon, Aug 01, 2011 at 11:44:04AM +0800, Jiangning Liu wrote: It's quite necessary to solve the general problem in middle-end rather than in back-end. That's what we disagree on. All back-ends but ARM are able to handle it right, why can't ARM too? The ABI rules for stack handling in the epilogues are simply too diverse and complex to be handled easily in the scheduler. Jakub
RE: [RFC] Add middle end hook for stack red zone size
Hi Jakub, Appreciate for your valuable comments! I think SPARC V9 ABI doesn't have red zone defined, right? So stack_red_zone_size should be defined as zero by default, the scheduler would block moving memory accesses across stack adjustment no matter what the offset is. I don't see any risk here. Also, in my patch function *abs* is being used to avoid the opposite stack direction issue as you mentioned. Some people like you insist on the ABI diversity, and actually I agree with you on this. But part of the ABI definition is general for all targets. The point here is memory access beyond stack red zone should be avoided, which is the general part of ABI that compiler should guarantee. For this general part, middle end should take the responsibility. Thanks, -Jiangning -Original Message- From: Jakub Jelinek [mailto:ja...@redhat.com] Sent: Monday, August 01, 2011 6:31 PM To: Jiangning Liu Cc: 'Joern Rennecke'; g...@gcc.gnu.org; gcc-patches@gcc.gnu.org; vmaka...@redhat.com; dje@gmail.com; Richard Henderson; Ramana Radhakrishnan; 'Ramana Radhakrishnan' Subject: Re: [RFC] Add middle end hook for stack red zone size On Mon, Aug 01, 2011 at 06:14:27PM +0800, Jiangning Liu wrote: ARM. You are right, they were all fixed in back-ends in the past, but we should fix the bug in a general way to make GCC infrastructure stronger, rather than fixing the problem target-by-target and case-by-case! If you further look into the back-end fixes in x86 and PowerPC, you may find they looks quite similar in back-ends. Red zone is only one difficulty, your patch is e.g. completely ignoring existence of biased stack pointers (e.g. SPARC -m64 has them). Some targets have stack growing in opposite direction, etc. We have really a huge amount of very diverse ABIs and making the middle-end grok what is an invalid stack access is difficult. Jakub
RE: [RFC] Add middle end hook for stack red zone size
Joern, Thanks for your valuable feedback! This is only a RFC, and I will send out formal patch along with ChangLog later on. Basically, my patch is only to add new dependence in scheduler, and it only blocks some instruction movements, so it is NO RISK to compiler correctness. For whatever stack pointer changes you gave in different scenarios, the current code base should already work. My patch intends neither to replace old dependences, nor maximize the scheduler capability due to the existence of red zone in stack. It is only to block the memory access moving over stack pointer adjustment if distance is beyond red zone size, which is an OS requirement due to interruption existence. Stack adjustment in epilogue is a very general usage in stack frame. It's quite necessary to solve the general problem in middle-end rather than in back-end. Also, that old patch you attached is to solve the data dependence between two memory accesses, but stack pointer doesn't really have data dependence with memory access without using stack pointer, so they have different stories. Alternative solution of without adding blunt scheduling barrier is we insert an independent pass before scheduler to create RTL barrier by using the same interface stack_red_zone_size, but it would really be an over-design, if we add a new pass only for this *small* functionality. In my patch, *abs* of offset is being used, so you are right that it's possible to get false positive to be too conservative, but there won't exist false negative, because my code would only add new dependences. Since the compilation is based on function, it would be OK if red zone size varies due to different ABI. Could you please tell me exactly on what system being supported by GCC red zone size can be different for incoming and outgoing? And also how scheduler guarantee the correctness in current code base? Anyway, I don't think my patch will break the original solution. Thanks, -Jiangning -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Joern Rennecke Sent: Tuesday, July 26, 2011 10:33 AM To: Jiangning Liu Cc: gcc@gcc.gnu.org; gcc-patc...@gcc.gnu.org; vmaka...@redhat.com; dje@gmail.com; Richard Henderson; Ramana Radhakrishnan; 'Ramana Radhakrishnan' Subject: RE: [RFC] Add middle end hook for stack red zone size Quoting Jiangning Liu jiangning@arm.com: Hi, One month ago, I sent out this RFC to *gcc-patches* mail list, but I didn't receive any response yet. So I'm forwarding this mail to *gcc* mail list. Can anybody here really give feedback to me? Well, I couldn't approve any patch, but I can point out some issues with your patch. First, it's missing a ChangeLog, and you don't state how you have tested it. And regarding the code in sched_analyze_1, I think you'll get false positives with alloca, and false negatives when registers are involved to compute offsets or to restore the stack pointer from. FWIW, I think generally blunt scheduling barriers should be avoided, and instead the dependencies made visible to the scheduler. E.g., I've been working with another architecture with a redzone, where at -fno- omit-frame-pointer, the prologue can put pretend_args into the redzone, then after stack adjustment and frame allocation, these arguments are accessed via the frame pointer. With the attached patchlet, alias analysis works for this situation too, so no blunt scheduling block is required. Likewise, with stack adjustments, they should not affect scheduling in general, but be considered to clobber the part of the frame that is being exposed to interrupt writes either before or after the adjustment. At the moment, each port that wants to have such selective scheduling blockages has to define a stack_adjust pattern with a memory clobber in a parallel, with a memref that shows the high water mark of possible interrupt stack writes. Prima facia it would seem convenient if you only had to tell the middle-end about the redzone size, and it could figure out the implicit clobbers when the stack is changed. However, when a big stack adjustment is being made, or the stack is realigned, or restored from the frame pointer / another register where it was saved due to realignment, the adjustment is not so obvious. I'm not sure if you can actually create an robust interface that's simpler to use than putting the right memory clobber in the stack adjust pattern. Note also that the redzone size can vary from function to function depending on ABI-altering attributes, in particular for interrupt functions, which can also have different incoming and outgoing redzone sizes. Plus, you can have an NMI / reset handler which can use the stack like an ordinary address register.
RE: [RFC] Add middle end hook for stack red zone size
Joern, Thanks for your valuable feedback! This is only a RFC, and I will send out formal patch along with ChangLog later on. Basically, my patch is only to add new dependence in scheduler, and it only blocks some instruction movements, so it is NO RISK to compiler correctness. For whatever stack pointer changes you gave in different scenarios, the current code base should already work. My patch intends neither to replace old dependences, nor maximize the scheduler capability due to the existence of red zone in stack. It is only to block the memory access moving over stack pointer adjustment if distance is beyond red zone size, which is an OS requirement due to interruption existence. Stack adjustment in epilogue is a very general usage in stack frame. It's quite necessary to solve the general problem in middle-end rather than in back-end. Also, that old patch you attached is to solve the data dependence between two memory accesses, but stack pointer doesn't really have data dependence with memory access without using stack pointer, so they have different stories. Alternative solution of without adding blunt scheduling barrier is we insert an independent pass before scheduler to create RTL barrier by using the same interface stack_red_zone_size, but it would really be an over-design, if we add a new pass only for this *small* functionality. In my patch, *abs* of offset is being used, so you are right that it's possible to get false positive to be too conservative, but there won't exist false negative, because my code would only add new dependences. Since the compilation is based on function, it would be OK if red zone size varies due to different ABI. Could you please tell me exactly on what system being supported by GCC red zone size can be different for incoming and outgoing? And also how scheduler guarantee the correctness in current code base? Anyway, I don't think my patch will break the original solution. Thanks, -Jiangning -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Joern Rennecke Sent: Tuesday, July 26, 2011 10:33 AM To: Jiangning Liu Cc: g...@gcc.gnu.org; gcc-patches@gcc.gnu.org; vmaka...@redhat.com; dje@gmail.com; Richard Henderson; Ramana Radhakrishnan; 'Ramana Radhakrishnan' Subject: RE: [RFC] Add middle end hook for stack red zone size Quoting Jiangning Liu jiangning@arm.com: Hi, One month ago, I sent out this RFC to *gcc-patches* mail list, but I didn't receive any response yet. So I'm forwarding this mail to *gcc* mail list. Can anybody here really give feedback to me? Well, I couldn't approve any patch, but I can point out some issues with your patch. First, it's missing a ChangeLog, and you don't state how you have tested it. And regarding the code in sched_analyze_1, I think you'll get false positives with alloca, and false negatives when registers are involved to compute offsets or to restore the stack pointer from. FWIW, I think generally blunt scheduling barriers should be avoided, and instead the dependencies made visible to the scheduler. E.g., I've been working with another architecture with a redzone, where at -fno- omit-frame-pointer, the prologue can put pretend_args into the redzone, then after stack adjustment and frame allocation, these arguments are accessed via the frame pointer. With the attached patchlet, alias analysis works for this situation too, so no blunt scheduling block is required. Likewise, with stack adjustments, they should not affect scheduling in general, but be considered to clobber the part of the frame that is being exposed to interrupt writes either before or after the adjustment. At the moment, each port that wants to have such selective scheduling blockages has to define a stack_adjust pattern with a memory clobber in a parallel, with a memref that shows the high water mark of possible interrupt stack writes. Prima facia it would seem convenient if you only had to tell the middle-end about the redzone size, and it could figure out the implicit clobbers when the stack is changed. However, when a big stack adjustment is being made, or the stack is realigned, or restored from the frame pointer / another register where it was saved due to realignment, the adjustment is not so obvious. I'm not sure if you can actually create an robust interface that's simpler to use than putting the right memory clobber in the stack adjust pattern. Note also that the redzone size can vary from function to function depending on ABI-altering attributes, in particular for interrupt functions, which can also have different incoming and outgoing redzone sizes. Plus, you can have an NMI / reset handler which can use the stack like an ordinary address register.
RE: [RFC] Add middle end hook for stack red zone size
Hi, One month ago, I sent out this RFC to *gcc-patches* mail list, but I didn't receive any response yet. So I'm forwarding this mail to *gcc* mail list. Can anybody here really give feedback to me? Appreciate your help in advance! -Jiangning -Original Message- From: Ramana Radhakrishnan [mailto:ramana.radhakrish...@linaro.org] Sent: Tuesday, July 19, 2011 6:18 PM To: Jiangning Liu Cc: gcc-patc...@gcc.gnu.org; vmaka...@redhat.com; dje@gmail.com; Richard Henderson; Ramana Radhakrishnan Subject: Re: [RFC] Add middle end hook for stack red zone size 2011/7/19 Jiangning Liu jiangning@arm.com: I see a lot of feedbacks on other posts, but mine is still with ZERO response in the past 3 weeks, so I'm wondering if I made any mistake in my process? Who can help me? It would be worth CC'ing the other relevant target maintainers as well to get some feedback since the patch touches ARM, x86 and Powerpc. I've added the maintainers for i386 and PPC to the CC list using the email addresses from the MAINTAINERS file. Thanks, Ramana Thanks, -Jiangning -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Jiangning Liu Sent: Tuesday, July 05, 2011 8:32 AM To: gcc-patc...@gcc.gnu.org; rgue...@gcc.gnu.org Subject: RE: [RFC] Add middle end hook for stack red zone size PING... I just merged with the latest code base and generated new patch as attached. Thanks, -Jiangning -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches- ow...@gcc.gnu.org] On Behalf Of Jiangning Liu Sent: 2011年6月28日 4:38 PM To: gcc-patc...@gcc.gnu.org Subject: [RFC] Add middle end hook for stack red zone size This patch is to fix PR38644, which is a bug with long history about stack red zone access, and PR30282 is correlated. Originally red zone concept is not exposed to middle-end, and back-end uses special logic to add extra memory barrier RTL and help the correct dependence in middle-end. This way different back-ends must handle red zone problem by themselves. For example, X86 target introduced function ix86_using_red_zone() to judge red zone access, while POWER introduced offset_below_red_zone_p() to judge it. Note that they have different semantics, but the logic in caller sites of back-end uses them to decide whether adding memory barrier RTL or not. If back-end incorrectly handles this, bug would be introduced. Therefore, the correct method should be middle-end handles red zone related things to avoid the burden in different back-ends. To be specific for PR38644, this middle-end problem causes incorrect behavior for ARM target. This patch exposes red zone concept to middle-end by introducing a middle-end/back-end hook TARGET_STACK_RED_ZONE_SIZE defined in target.def, and by default its value is 0. Back-end may redefine this function to provide concrete red zone size according to specific ABI requirements. In middle end, scheduling dependence is modified by using this hook plus checking stack frame pointer adjustment instruction to decide whether memory references need to be all flushed out or not. In theory, if TARGET_STACK_RED_ZONE_SIZE is defined correctly, back-end would not be required to specially handle this scheduling dependence issue by introducing extra memory barrier RTL. In back-end, the following changes are made to define the hook, 1) For X86, TARGET_STACK_RED_ZONE_SIZE is redefined to be ix86_stack_red_zone_size() in i386.c, which is an newly introduced function. 2) For POWER, TARGET_STACK_RED_ZONE_SIZE is redefined to be rs6000_stack_red_zone_size() in rs6000.c, which is also a newly defined function. 3) For ARM and others, TARGET_STACK_RED_ZONE_SIZE is defined to be default_stack_red_zone_size in targhooks.c, and this function returns 0, which means ARM eabi and others don't support red zone access at all. In summary, the relationship between ABI and red zone access is like below, - | ARCH | ARM | X86 |POWER | others | |--|---|---|---|| |ABI | EABI | MS_64 | other | AIX | V4 || |--|---|---|---||--|| | RED ZONE | No | YES | No | YES | No | No | |--|---|---|---||--|| | RED ZONE SIZE| 0 | 128 | 0 |220/288 | 0 |0 | - Thanks, -Jiangning
RE: [RFC] Add middle end hook for stack red zone size
Hi, One month ago, I sent out this RFC to *gcc-patches* mail list, but I didn't receive any response yet. So I'm forwarding this mail to *gcc* mail list. Can anybody here really give feedback to me? Appreciate your help in advance! -Jiangning -Original Message- From: Ramana Radhakrishnan [mailto:ramana.radhakrish...@linaro.org] Sent: Tuesday, July 19, 2011 6:18 PM To: Jiangning Liu Cc: gcc-patches@gcc.gnu.org; vmaka...@redhat.com; dje@gmail.com; Richard Henderson; Ramana Radhakrishnan Subject: Re: [RFC] Add middle end hook for stack red zone size 2011/7/19 Jiangning Liu jiangning@arm.com: I see a lot of feedbacks on other posts, but mine is still with ZERO response in the past 3 weeks, so I'm wondering if I made any mistake in my process? Who can help me? It would be worth CC'ing the other relevant target maintainers as well to get some feedback since the patch touches ARM, x86 and Powerpc. I've added the maintainers for i386 and PPC to the CC list using the email addresses from the MAINTAINERS file. Thanks, Ramana Thanks, -Jiangning -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Jiangning Liu Sent: Tuesday, July 05, 2011 8:32 AM To: gcc-patches@gcc.gnu.org; rgue...@gcc.gnu.org Subject: RE: [RFC] Add middle end hook for stack red zone size PING... I just merged with the latest code base and generated new patch as attached. Thanks, -Jiangning -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches- ow...@gcc.gnu.org] On Behalf Of Jiangning Liu Sent: 2011年6月28日 4:38 PM To: gcc-patches@gcc.gnu.org Subject: [RFC] Add middle end hook for stack red zone size This patch is to fix PR38644, which is a bug with long history about stack red zone access, and PR30282 is correlated. Originally red zone concept is not exposed to middle-end, and back-end uses special logic to add extra memory barrier RTL and help the correct dependence in middle-end. This way different back-ends must handle red zone problem by themselves. For example, X86 target introduced function ix86_using_red_zone() to judge red zone access, while POWER introduced offset_below_red_zone_p() to judge it. Note that they have different semantics, but the logic in caller sites of back-end uses them to decide whether adding memory barrier RTL or not. If back-end incorrectly handles this, bug would be introduced. Therefore, the correct method should be middle-end handles red zone related things to avoid the burden in different back-ends. To be specific for PR38644, this middle-end problem causes incorrect behavior for ARM target. This patch exposes red zone concept to middle-end by introducing a middle-end/back-end hook TARGET_STACK_RED_ZONE_SIZE defined in target.def, and by default its value is 0. Back-end may redefine this function to provide concrete red zone size according to specific ABI requirements. In middle end, scheduling dependence is modified by using this hook plus checking stack frame pointer adjustment instruction to decide whether memory references need to be all flushed out or not. In theory, if TARGET_STACK_RED_ZONE_SIZE is defined correctly, back-end would not be required to specially handle this scheduling dependence issue by introducing extra memory barrier RTL. In back-end, the following changes are made to define the hook, 1) For X86, TARGET_STACK_RED_ZONE_SIZE is redefined to be ix86_stack_red_zone_size() in i386.c, which is an newly introduced function. 2) For POWER, TARGET_STACK_RED_ZONE_SIZE is redefined to be rs6000_stack_red_zone_size() in rs6000.c, which is also a newly defined function. 3) For ARM and others, TARGET_STACK_RED_ZONE_SIZE is defined to be default_stack_red_zone_size in targhooks.c, and this function returns 0, which means ARM eabi and others don't support red zone access at all. In summary, the relationship between ABI and red zone access is like below, - | ARCH | ARM | X86 |POWER | others | |--|---|---|---|| |ABI | EABI | MS_64 | other | AIX | V4 || |--|---|---|---||--|| | RED ZONE | No | YES | No | YES | No | No | |--|---|---|---||--|| | RED ZONE SIZE| 0 | 128 | 0 |220/288 | 0 |0 | - Thanks, -Jiangning
RE: [RFC] Add middle end hook for stack red zone size
PING... I just merged with the latest code base and generated new patch as attached. Thanks, -Jiangning -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches- ow...@gcc.gnu.org] On Behalf Of Jiangning Liu Sent: 2011年6月28日 4:38 PM To: gcc-patches@gcc.gnu.org Subject: [RFC] Add middle end hook for stack red zone size This patch is to fix PR38644, which is a bug with long history about stack red zone access, and PR30282 is correlated. Originally red zone concept is not exposed to middle-end, and back-end uses special logic to add extra memory barrier RTL and help the correct dependence in middle-end. This way different back-ends must handle red zone problem by themselves. For example, X86 target introduced function ix86_using_red_zone() to judge red zone access, while POWER introduced offset_below_red_zone_p() to judge it. Note that they have different semantics, but the logic in caller sites of back-end uses them to decide whether adding memory barrier RTL or not. If back-end incorrectly handles this, bug would be introduced. Therefore, the correct method should be middle-end handles red zone related things to avoid the burden in different back-ends. To be specific for PR38644, this middle-end problem causes incorrect behavior for ARM target. This patch exposes red zone concept to middle-end by introducing a middle-end/back-end hook TARGET_STACK_RED_ZONE_SIZE defined in target.def, and by default its value is 0. Back-end may redefine this function to provide concrete red zone size according to specific ABI requirements. In middle end, scheduling dependence is modified by using this hook plus checking stack frame pointer adjustment instruction to decide whether memory references need to be all flushed out or not. In theory, if TARGET_STACK_RED_ZONE_SIZE is defined correctly, back-end would not be required to specially handle this scheduling dependence issue by introducing extra memory barrier RTL. In back-end, the following changes are made to define the hook, 1) For X86, TARGET_STACK_RED_ZONE_SIZE is redefined to be ix86_stack_red_zone_size() in i386.c, which is an newly introduced function. 2) For POWER, TARGET_STACK_RED_ZONE_SIZE is redefined to be rs6000_stack_red_zone_size() in rs6000.c, which is also a newly defined function. 3) For ARM and others, TARGET_STACK_RED_ZONE_SIZE is defined to be default_stack_red_zone_size in targhooks.c, and this function returns 0, which means ARM eabi and others don't support red zone access at all. In summary, the relationship between ABI and red zone access is like below, - | ARCH | ARM | X86 |POWER | others | |--|---|---|---|| |ABI | EABI | MS_64 | other | AIX | V4 || |--|---|---|---||--|| | RED ZONE | No | YES | No | YES | No | No | |--|---|---|---||--|| | RED ZONE SIZE| 0 | 128 | 0 |220/288 | 0 |0 | - Thanks, -Jiangning stack-red-zone-patch-38644-4.patch Description: Binary data
[RFC] Add middle end hook for stack red zone size
This patch is to fix PR38644, which is a bug with long history about stack red zone access, and PR30282 is correlated. Originally red zone concept is not exposed to middle-end, and back-end uses special logic to add extra memory barrier RTL and help the correct dependence in middle-end. This way different back-ends must handle red zone problem by themselves. For example, X86 target introduced function ix86_using_red_zone() to judge red zone access, while POWER introduced offset_below_red_zone_p() to judge it. Note that they have different semantics, but the logic in caller sites of back-end uses them to decide whether adding memory barrier RTL or not. If back-end incorrectly handles this, bug would be introduced. Therefore, the correct method should be middle-end handles red zone related things to avoid the burden in different back-ends. To be specific for PR38644, this middle-end problem causes incorrect behavior for ARM target. This patch exposes red zone concept to middle-end by introducing a middle-end/back-end hook TARGET_STACK_RED_ZONE_SIZE defined in target.def, and by default its value is 0. Back-end may redefine this function to provide concrete red zone size according to specific ABI requirements. In middle end, scheduling dependence is modified by using this hook plus checking stack frame pointer adjustment instruction to decide whether memory references need to be all flushed out or not. In theory, if TARGET_STACK_RED_ZONE_SIZE is defined correctly, back-end would not be required to specially handle this scheduling dependence issue by introducing extra memory barrier RTL. In back-end, the following changes are made to define the hook, 1) For X86, TARGET_STACK_RED_ZONE_SIZE is redefined to be ix86_stack_red_zone_size() in i386.c, which is an newly introduced function. 2) For POWER, TARGET_STACK_RED_ZONE_SIZE is redefined to be rs6000_stack_red_zone_size() in rs6000.c, which is also a newly defined function. 3) For ARM and others, TARGET_STACK_RED_ZONE_SIZE is defined to be default_stack_red_zone_size in targhooks.c, and this function returns 0, which means ARM eabi and others don't support red zone access at all. In summary, the relationship between ABI and red zone access is like below, - | ARCH | ARM | X86 |POWER | others | |--|---|---|---|| |ABI | EABI | MS_64 | other | AIX | V4 || |--|---|---|---||--|| | RED ZONE | No | YES | No | YES | No | No | |--|---|---|---||--|| | RED ZONE SIZE| 0 | 128 | 0 |220/288 | 0 |0 | - Thanks, -Jiangning stack-red-zone-patch-38644-3.patch Description: Binary data