RE: missing conditional propagation in cprop.c pass
insn 882 : cc - compare (r684, 0) jump_insn 883 : if (cc != 0) goto insn 46 insn 49: r291 - r684 .. insn 46 cc contains the result of subtracting 0 from r684; control flow goes to insn_49 only if (cc == 0), which implies (r684 == 0). Then at insn_49 we have conditional const propagation r684 - 0, is it right? I believe, the optimization you may be referring to is value range propagation which does predication of values based on predicates of conditions. GCC definitely applies VRP at the tree stage, I am not sure if there is an RTL pass to do the same.
RE: missing conditional propagation in cprop.c pass
On 09/29/11 17:36, Jeff Law wrote: On 09/29/11 09:26, Bernd Schmidt wrote: ISTR cse.c has some support for this. cprop.c -- see references to implicit_sets. cse too: record_jump_equiv. Interesting. Are the two approaches subtly different or do they apply precisely the same predication?
RE: Clustering switch cases
I have been working on your patch but I didn't manage to get it working yet. Unfortunately our current stable GCC version if 4.3.4 and our 4.4.x still has some issues. I tried backporting your patch to GCC 4.3.4 but it introduced several regressions. It's my guess this is due to my incorrect manual patching of 4.3. Before I go through the regressions on our GCC 4.3 and try to fix the patching, do you have by any change a patch against 4.3.x? Can you try the attached patch again with GCC 4.4.1 and let me know if it fixes your regressions. There are related changes to how basic blocks are split that may needs changing when using switch clustering. The changes to stmt.c are the same as before. The additional changes are to cfgrtl.c and cfgbuild.c. Cheers, Rahul switch_case_clusters.patch Description: switch_case_clusters.patch
RE: Clustering switch cases
I will be looking at the patch Rahul posted and will try to see if I can improve on it. See attached patch (again) that Paulo is referring to. Sending to GCC failed due to email client issues. I have another patch for http://gcc.gnu.org/ml/gcc/2010-08/msg00413.html Which I will send out shortly. Rahul switch_case_clusters.patch Description: switch_case_clusters.patch
RE: branch probabilities on multiway branches
Hi Edmar, I have been testing your patch on our port of GCC4.4.1. The patch works well, although I stumbled on an issue. Many of our interesting cases have large case values, so profiling the default range of values 0 - 255, or even as large as 0 - 2047 is insufficient. Profiling for a larger ranges is impractical. I am attempting to use edge frequencies instead of value profile to effect the same transformation. We also use an adapted form of the patch below to form sparse clusters of contiguous large case values and shrink the switch case jump table. I have merged your work into this patch so we can form a probability/frequency based balanced decision tree for more likely cases, and jump tables with clusters otherwise. http://gcc.gnu.org/ml/gcc-patches/2004-07/msg02479.html Here's a modified test case from your patch to demonstrate the need for clusters in addition to cost balanced decision tree. enum { NEVER=1000, HARDLY=10001, FOO=1002, FOO1=1003, FOO2=1004, FOO3=1005, FOO4=1006, OFTEN=2500, BAR=2501, DUMMY=2502, DUMMY1=2503, DUMMY2=2504, DUMMY3=2505}; int seq[] = { OFTEN, OFTEN, OFTEN, OFTEN, OFTEN, HARDLY, BAR, FOO, OFTEN, FOO, OFTEN, HARDLY, OFTEN, OFTEN, OFTEN, OFTEN, OFTEN, HARDLY, OFTEN, OFTEN, OFTEN, OFTEN, OFTEN, HARDLY, OFTEN, OFTEN, OFTEN, OFTEN, OFTEN, HARDLY, 300, -1, OFTEN, OFTEN, OFTEN, OFTEN, HARDLY, }; int result; int test_it (int k, int i) { int j; j = k; switch (seq[i]) { case NEVER: j = j * i; break; case FOO: j = j + k; break; case FOO1: j = j + k2; break; case FOO2: j = j + k5; break; case FOO3: j = j + k6; break; case FOO4: j = j + k7; break; case BAR: j = j + i + k; case DUMMY: j = j * (k + i); break; case DUMMY1: j = j * (k + i2); break; case OFTEN: j = j + i; break; case DUMMY2: j = j * (k + i5); break; case DUMMY3: j = j * (k + i6); break; default: j = j * k; break; } return j; } int main (int argc, char **argv) { int i; for (i = 0; i sizeof(seq)/sizeof(seq[0]); i++) result = test_it (1, i); return 0; } Cheers, Rahul -Original Message- From: Rahul Kharche Sent: 22 April 2010 11:24 To: Edmar Wienskoski Cc: Steven Bosscher; Jan Hubicka; gcc@gcc.gnu.org; sdkteam-gnu Subject: RE: branch probabilities on multiway branches Thanks Edmar! I will try and work your patch into our GCC 4.4.1 port and get some results. Cheers, Rahul
RE: branch probabilities on multiway branches
Thanks Edmar! I will try and work your patch into our GCC 4.4.1 port and get some results. Cheers, Rahul
RE: branch probabilities on multiway branches
Hi Steven, There is a transformation implemented in GCC's value profiling to put the most-frequently taken case-label of a switch-expr at the top of the switch Could you point me to the bit of code that does this? I'm exploring the idea of implementing source hints much like __builtin_expect which would assign a probability to a case range in a switch construct. I will have to tweak the case expansion phase to implement a similar transformation to the one you mentioned earlier. Cheers, Rahul
RE: branch probabilities on multiway branches
The calculate branch probabilities algorithm (1) in the Wu Larus paper also evenly distributes branch probabilities when number of outgoing edges is 2, e.g. switch cases implemented as jump tables. Are they any known heuristics to generate better branch probabilities in this case? -Original Message- From: Jan Hubicka [mailto:hubi...@ucw.cz] Sent: 14 April 2010 00:44 To: Rahul Kharche Cc: gcc@gcc.gnu.org; sdkteam-gnu Subject: Re: branch probabilities on multiway branches Hi All, The following bit of code in predict.c implies branch probabilities are strictly evenly distributed for multiway branches at present. The comment suggests it is possible to generate better estimates for more generic cases, apart from being involved. Could anyone point me to the reference and/or if an implementation exists already. There is Wu Larus paper cited in the comment at the begging of file. Honza
RE: branch probabilities on multiway branches
What is the problem you're trying to solve? Generally speaking I was looking for a better logic based on estimated branch probability to decide between using binary search tree and jump table implementation of a switch case. One interesting test case is where the gross structure of a function is a switch case. The function is parameterized on input to the switch i.e. of the following type: void foo (int val, int arg1, ...) { switch (val) { case 100: /* blah1 */ break; case 101: /* blah2 */ break; . . . case 1: /* blah1 */ break; default: ; } } On the basis of the static call graph, the estimated frequencies of the calls to foo (), and assuming argument val is constant at each call site, is it possible to optimize the switch case in foo ()? Perhaps along the line of pulling out the most frequent case like you mentioned in your example. Or specializing foo () on frequent cases of val. In this case inlining and VRP would have automatically done what I am suggesting. Inlining however is not attempted because foo () is estimated to be large. Many Thanks, Rahul
RE: branch probabilities on multiway branches
The other case I'm working on is to selectively apply tailcall optimization when optimizing for size. Clearly tail call optimiztion is desirable along frequently executed edges. Otherwise we found tailcall optimization generates a sicall_epilogue for each tailcall which has a significant impact on size. This is true if the generated epilogue is large. I have a patch which prunes tail calls along edges which are determined to be infrequently executed from profile information. When no profile information is available we are reduced to using a heuristic whether the optimization will increase code size. However it would be useful to also use estimated branch probabilities The cases I'm looking at use switch cases heavily where branch probability information is reduced to even probabilities across all cases. We're keen on improving the estimated branch probabilities in switch cases because we cannot always gather profile information on all code paths through a program. Cheers, Rahul
branch probabilities on multiway branches
Hi All, The following bit of code in predict.c implies branch probabilities are strictly evenly distributed for multiway branches at present. The comment suggests it is possible to generate better estimates for more generic cases, apart from being involved. Could anyone point me to the reference and/or if an implementation exists already. /* When there is no successor or only one choice, prediction is easy. We are lazy for now and predict only basic blocks with two outgoing edges. It is possible to predict generic case too, but we have to ignore first match heuristics and do more involved combining. Implement this later. */ if (nedges != 2) { if (!bb-count) set_even_probabilities (bb); clear_bb_predictions (bb); if (dump_file) fprintf (dump_file, %i edges in bb %i predicted to even probabilities\n, nedges, bb-index); return; } Many Thanks, Rahul
RE: Failure to combine SHIFT with ZERO_EXTEND
Hi Jeff, Many thanks for the pointers. I will make the changes and attach the patch to the bugzilla soon. Cheers, Rahul -Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: 09 February 2010 00:45 To: Rahul Kharche Cc: gcc@gcc.gnu.org; sdkteam-gnu Subject: Re: Failure to combine SHIFT with ZERO_EXTEND On 02/04/10 08:39, Rahul Kharche wrote: Hi All, On our private port of GCC 4.4.1 we fail to combine successive SHIFT operations like in the following case #includestdlib.h #includestdio.h void f1 () { unsigned short t1; unsigned short t2; t1 = rand(); t2 = rand(); t1= 1; t2= 1; t1= 1; t2= 1; t1= 1; t2= 1; t1= 1; t2= 1; t1= 1; t2= 1; t1= 1; t2= 1; printf(%d\n, (t1+t2)); } This is a ZERO_EXTEND problem, because combining SHIFTs with whole integers works correctly, so do signed values. The problem seems to arise in the RTL combiner which combines the ZERO_EXTEND with the SHIFT to generate a SHIFT and an AND. Our architecture does not support AND with large constants and hence do not have a matching insn pattern (we prefer not doing this, because of large constants remain hanging at the end of all RTL optimisations and cause needless reloads). Fixing the combiner to convert masking AND operations to ZERO_EXTRACT fixes this issue without any obvious regressions. I'm adding the patch here against GCC 4.4.1 for any comments and/or suggestions. Good catch.However, note we are a regression bugfix only phase of development right now in preparation for branching for GCC 4.5. As a result the patch can't be checked in at this time; I would recommend you update the patch to the current sources attach it to bug #41998 which contains queued patches for after GCC 4.5 branches. Cheers, Rahul --- combine.c 2009-04-01 21:47:37.0 +0100 +++ combine.c 2010-02-04 15:04:41.0 + @@ -446,6 +446,7 @@ static void record_truncated_values (rtx *, void *); static bool reg_truncated_to_mode (enum machine_mode, const_rtx); static rtx gen_lowpart_or_truncate (enum machine_mode, rtx); +static bool can_zero_extract_p (rtx, rtx, enum machine_mode); /* It is not safe to use ordinary gen_lowpart in combine. @@ -6973,6 +6974,16 @@ make_compound_operation (XEXP (x, 0), next_code), i, NULL_RTX, 1, 1, 0, 1); + else if (can_zero_extract_p (XEXP (x, 0), XEXP (x, 1), mode)) +{ + unsigned HOST_WIDE_INT len = HOST_BITS_PER_WIDE_INT + - CLZ_HWI (UINTVAL (XEXP (x, 1))); + new_rtx = make_extraction (mode, + make_compound_operation (XEXP (x, 0), + next_code), + 0, NULL_RTX, len, 1, 0, + in_code == COMPARE); There should be a comment prior to this code fragment describing the transformation being performed. Something like: /* Convert (and (shift X Y) MASK) into ... when ... */ That will make it clear in the future when your transformation applies rather than forcing someone scanning the code to read it in detail. +} break; @@ -7245,6 +7256,25 @@ return simplify_gen_unary (TRUNCATE, mode, x, GET_MODE (x)); } +static bool +can_zero_extract_p (rtx x, rtx mask_rtx, enum machine_mode mode) There should be a comment prior to this function which briefly describes what the function does, the parameters return value. Use comments prior to other functions to guide you. @@ -8957,7 +8987,6 @@ op0 = UNKNOWN; *pop0 = op0; - /* ??? Slightly redundant with the above mask, but not entirely. Moving this above means we'd have to sign-extend the mode mask for the final test. */ Useless diff fragment. Remove this change as it's completely unrelated and useless. You should also write a ChangeLog entry for your patch. ChangeLogs describe what changed, not why something changed. So a suitable entry might look something like: date Your Name yourem...@somewhere * combine.c (make_compound_operation): Convert shifts and masks into zero_extract in certain cases. (can_zero_extract_p): New function. If you could make those changes and attach the result to PR 41998 they should be able to go in once 4.5 branches from the mainline. Jeff
Failure to combine SHIFT with ZERO_EXTEND
Hi All, On our private port of GCC 4.4.1 we fail to combine successive SHIFT operations like in the following case #include stdlib.h #include stdio.h void f1 () { unsigned short t1; unsigned short t2; t1 = rand(); t2 = rand(); t1 = 1; t2 = 1; t1 = 1; t2 = 1; t1 = 1; t2 = 1; t1 = 1; t2 = 1; t1 = 1; t2 = 1; t1 = 1; t2 = 1; printf(%d\n, (t1+t2)); } This is a ZERO_EXTEND problem, because combining SHIFTs with whole integers works correctly, so do signed values. The problem seems to arise in the RTL combiner which combines the ZERO_EXTEND with the SHIFT to generate a SHIFT and an AND. Our architecture does not support AND with large constants and hence do not have a matching insn pattern (we prefer not doing this, because of large constants remain hanging at the end of all RTL optimisations and cause needless reloads). Fixing the combiner to convert masking AND operations to ZERO_EXTRACT fixes this issue without any obvious regressions. I'm adding the patch here against GCC 4.4.1 for any comments and/or suggestions. Cheers, Rahul --- combine.c 2009-04-01 21:47:37.0 +0100 +++ combine.c 2010-02-04 15:04:41.0 + @@ -446,6 +446,7 @@ static void record_truncated_values (rtx *, void *); static bool reg_truncated_to_mode (enum machine_mode, const_rtx); static rtx gen_lowpart_or_truncate (enum machine_mode, rtx); +static bool can_zero_extract_p (rtx, rtx, enum machine_mode); /* It is not safe to use ordinary gen_lowpart in combine. @@ -6973,6 +6974,16 @@ make_compound_operation (XEXP (x, 0), next_code), i, NULL_RTX, 1, 1, 0, 1); + else if (can_zero_extract_p (XEXP (x, 0), XEXP (x, 1), mode)) +{ + unsigned HOST_WIDE_INT len = HOST_BITS_PER_WIDE_INT + - CLZ_HWI (UINTVAL (XEXP (x, 1))); + new_rtx = make_extraction (mode, + make_compound_operation (XEXP (x, 0), +next_code), +0, NULL_RTX, len, 1, 0, +in_code == COMPARE); +} break; @@ -7245,6 +7256,25 @@ return simplify_gen_unary (TRUNCATE, mode, x, GET_MODE (x)); } +static bool +can_zero_extract_p (rtx x, rtx mask_rtx, enum machine_mode mode) +{ + unsigned HOST_WIDE_INT count_lz, count_tz; + unsigned HOST_WIDE_INT nonzero, mask_all; + unsigned HOST_WIDE_INT mask_value = UINTVAL (mask_rtx); + + mask_all = (unsigned HOST_WIDE_INT) -1; + nonzero = nonzero_bits (x, mode); + count_lz = CLZ_HWI (mask_value); + count_tz = CTZ_HWI (mask_value); + + if (count_tz = (unsigned HOST_WIDE_INT) CTZ_HWI (nonzero) + ((mask_all (count_lz + count_tz)) count_tz) == mask_value) +return true; + + return false; +} + /* See if X can be simplified knowing that we will only refer to it in MODE and will only refer to those bits that are nonzero in MASK. If other bits are being computed or if masking operations are done @@ -8957,7 +8987,6 @@ op0 = UNKNOWN; *pop0 = op0; - /* ??? Slightly redundant with the above mask, but not entirely. Moving this above means we'd have to sign-extend the mode mask for the final test. */
RE: RFC PRE-ing REFERENCE expressions
Hi Richard, We would like to hang on to 4.4 for a little while. I was hoping I could grab only the alias analysis improvements from the trunk. Do you suspect this would be troublesome? Cheers, Rahul -Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: 30 October 2009 14:50 To: Rahul Kharche Cc: rgue...@gcc.gnu.org; gcc@gcc.gnu.org; sdkteam-gnu Subject: Re: RFC PRE-ing REFERENCE expressions On Fri, Oct 30, 2009 at 3:22 PM, Rahul Kharche ra...@icerasemi.com wrote: Hi Richi, Following up your suggestion on PR41488, I am continuing to test with loop header copy before FRE in GCC 4.4.1. One regression I am trying to tackle is from the test case below. typedef struct { int degree; int c[(256)]; } swbcht_Polynomial; typedef struct { int m; int n; const swbcht_Polynomial *primPoly; unsigned short alpha[(1 (13))]; } swbcht_GF; typedef struct { swbcht_GF gf; } swbcht_Handle; static const swbcht_Polynomial _primPoly13; void _ComputeGFLookupTables (swbcht_Handle *bchH, int m) { int i, j; swbcht_GF *gf = bchH-gf; gf-m = m; gf-n = (1 m) - 1; gf-primPoly = _primPoly13; for (i = 0; i gf-m; i++) { gf-alpha[gf-m] ^= (gf-primPoly-c[i] * gf-alpha[i]); } } The dump after CH - FRE looks like _ComputeGFLookupTables (struct swbcht_Handle * bchH, int m) { int i; short unsigned int D.1241; short int D.1240; short int D.1239; short unsigned int D.1238; short unsigned int D.1237; short unsigned int D.1236; const int D.1235; const struct swbcht_Polynomial * D.1233; short int D.1232; short unsigned int D.1231; int D.1230; int D.1229; int D.1228; bb 2: bchH_2(D)-gf.m = m_4(D); D.1228_5 = 1 m_4(D); D.1229_6 = D.1228_5 + -1; bchH_2(D)-gf.n = D.1229_6; bchH_2(D)-gf.primPoly = _primPoly13; if (m_4(D) 0) goto bb 5; else goto bb 6; bb 6: goto bb 4; bb 5: bb 3: # i_35 = PHI 0(5), i_23(7) D.1230_10 = bchH_2(D)-gf.m; D.1231_11 = bchH_2(D)-gf.alpha[D.1230_10]; D.1232_12 = (short int) D.1231_11; D.1233_13 = bchH_2(D)-gf.primPoly; D.1235_15 = D.1233_13-c[i_35]; D.1236_16 = (short unsigned int) D.1235_15; D.1237_18 = bchH_2(D)-gf.alpha[i_35]; D.1238_19 = D.1236_16 * D.1237_18; D.1239_20 = (short int) D.1238_19; D.1240_21 = D.1239_20 ^ D.1232_12; D.1241_22 = (short unsigned int) D.1240_21; bchH_2(D)-gf.alpha[D.1230_10] = D.1241_22; i_23 = i_35 + 1; D.1230_8 = bchH_2(D)-gf.m; if (i_23 D.1230_8) goto bb 7; else goto bb 8; bb 7: goto bb 3; bb 8: bb 4: return; } 1. Why does FRE miss eliminating expression bchH_2(D)-gf.primPoly in bb 3 with _primPoly13. This is normally the case if we ran FRE then CH. Further PRE identifies bchH_2(D)-gf.primPoly as a partially redundant expression and hoists it into predecessor bb 7 and we get bb 3: # i_35 = PHI 0(5), i_23(7) # prephitmp.25_49 = PHI m_4(D)(5), D.1230_8(7) # prephitmp.27_51 = PHI _primPoly13(5), pretmp.26_50(7) D.1230_10 = prephitmp.25_49; D.1231_11 = bchH_2(D)-gf.alpha[D.1230_10]; D.1232_12 = (short int) D.1231_11; D.1233_13 = prephitmp.27_51; D.1235_15 = D.1233_13-c[i_35]; D.1236_16 = (short unsigned int) D.1235_15; D.1237_18 = bchH_2(D)-gf.alpha[i_35]; D.1238_19 = D.1236_16 * D.1237_18; D.1239_20 = (short int) D.1238_19; D.1240_21 = D.1239_20 ^ D.1232_12; D.1241_22 = (short unsigned int) D.1240_21; bchH_2(D)-gf.alpha[D.1230_10] = D.1241_22; i_23 = i_35 + 1; D.1230_8 = bchH_2(D)-gf.m; if (D.1230_8 i_23) goto bb 7; else goto bb 8; bb 7: pretmp.26_50 = bchH_2(D)-gf.primPoly; goto bb 3; Clearly prephitmp.27_51 will make a redundant induction variable. Stepping through the insert_into_preds_of_block in tree-ssa-pre.c I can see the value numbers for expression bchH_2(D)-gf.primPoly available through bb 3 and through bb 2 are different. 2. Is this because alias analysis cannot determine bchH_2(D)-gf has a unique target? You should move to GCC 4.5 - the alias-oracle in GCC 4.4 is too weak to discover all this and virtual operands are not helpful because these are all indirect accesses through pointers without points-to information. Richard. Many Thanks, Rahul
RFC PRE-ing REFERENCE expressions
Hi Richi, Following up your suggestion on PR41488, I am continuing to test with loop header copy before FRE in GCC 4.4.1. One regression I am trying to tackle is from the test case below. typedef struct { int degree; int c[(256)]; } swbcht_Polynomial; typedef struct { int m; int n; const swbcht_Polynomial *primPoly; unsigned short alpha[(1 (13))]; } swbcht_GF; typedef struct { swbcht_GF gf; } swbcht_Handle; static const swbcht_Polynomial _primPoly13; void _ComputeGFLookupTables (swbcht_Handle *bchH, int m) { int i, j; swbcht_GF *gf = bchH-gf; gf-m = m; gf-n = (1 m) - 1; gf-primPoly = _primPoly13; for (i = 0; i gf-m; i++) { gf-alpha[gf-m] ^= (gf-primPoly-c[i] * gf-alpha[i]); } } The dump after CH - FRE looks like _ComputeGFLookupTables (struct swbcht_Handle * bchH, int m) { int i; short unsigned int D.1241; short int D.1240; short int D.1239; short unsigned int D.1238; short unsigned int D.1237; short unsigned int D.1236; const int D.1235; const struct swbcht_Polynomial * D.1233; short int D.1232; short unsigned int D.1231; int D.1230; int D.1229; int D.1228; bb 2: bchH_2(D)-gf.m = m_4(D); D.1228_5 = 1 m_4(D); D.1229_6 = D.1228_5 + -1; bchH_2(D)-gf.n = D.1229_6; bchH_2(D)-gf.primPoly = _primPoly13; if (m_4(D) 0) goto bb 5; else goto bb 6; bb 6: goto bb 4; bb 5: bb 3: # i_35 = PHI 0(5), i_23(7) D.1230_10 = bchH_2(D)-gf.m; D.1231_11 = bchH_2(D)-gf.alpha[D.1230_10]; D.1232_12 = (short int) D.1231_11; D.1233_13 = bchH_2(D)-gf.primPoly; D.1235_15 = D.1233_13-c[i_35]; D.1236_16 = (short unsigned int) D.1235_15; D.1237_18 = bchH_2(D)-gf.alpha[i_35]; D.1238_19 = D.1236_16 * D.1237_18; D.1239_20 = (short int) D.1238_19; D.1240_21 = D.1239_20 ^ D.1232_12; D.1241_22 = (short unsigned int) D.1240_21; bchH_2(D)-gf.alpha[D.1230_10] = D.1241_22; i_23 = i_35 + 1; D.1230_8 = bchH_2(D)-gf.m; if (i_23 D.1230_8) goto bb 7; else goto bb 8; bb 7: goto bb 3; bb 8: bb 4: return; } 1. Why does FRE miss eliminating expression bchH_2(D)-gf.primPoly in bb 3 with _primPoly13. This is normally the case if we ran FRE then CH. Further PRE identifies bchH_2(D)-gf.primPoly as a partially redundant expression and hoists it into predecessor bb 7 and we get bb 3: # i_35 = PHI 0(5), i_23(7) # prephitmp.25_49 = PHI m_4(D)(5), D.1230_8(7) # prephitmp.27_51 = PHI _primPoly13(5), pretmp.26_50(7) D.1230_10 = prephitmp.25_49; D.1231_11 = bchH_2(D)-gf.alpha[D.1230_10]; D.1232_12 = (short int) D.1231_11; D.1233_13 = prephitmp.27_51; D.1235_15 = D.1233_13-c[i_35]; D.1236_16 = (short unsigned int) D.1235_15; D.1237_18 = bchH_2(D)-gf.alpha[i_35]; D.1238_19 = D.1236_16 * D.1237_18; D.1239_20 = (short int) D.1238_19; D.1240_21 = D.1239_20 ^ D.1232_12; D.1241_22 = (short unsigned int) D.1240_21; bchH_2(D)-gf.alpha[D.1230_10] = D.1241_22; i_23 = i_35 + 1; D.1230_8 = bchH_2(D)-gf.m; if (D.1230_8 i_23) goto bb 7; else goto bb 8; bb 7: pretmp.26_50 = bchH_2(D)-gf.primPoly; goto bb 3; Clearly prephitmp.27_51 will make a redundant induction variable. Stepping through the insert_into_preds_of_block in tree-ssa-pre.c I can see the value numbers for expression bchH_2(D)-gf.primPoly available through bb 3 and through bb 2 are different. 2. Is this because alias analysis cannot determine bchH_2(D)-gf has a unique target? Many Thanks, Rahul
RFC: missed loop optimizations from loop induction variable copies
The following causes missed loop optimizations in O2 from creating unnecessary loop induction variables. Or, is a case of IV opts not able to coalesce copies of induction variables. A previous post related to this was made in PR41026 which had a type promoted loop index variable copied. I believe this example makes the problem more obvious. struct struct_t { int* data; }; void testAutoIncStruct (struct struct_t* sp, int start, int end) { int i; for (i = 0; i+start end; i++) { sp-data[i+start] = 0; } } With GCC v4.4.1 release) and gcc -O2 -fdump-tree-all on the above case we get the following dump from IVOpts testAutoIncStruct (struct struct_t * sp, int start, int end) { unsigned int D.1283; unsigned int D.1284; int D.1282; unsigned int ivtmp.32; int * pretmp.17; int i; int * D.1245; unsigned int D.1244; unsigned int D.1243; bb 2: if (start_3(D) end_5(D)) goto bb 3; else goto bb 6; bb 3: pretmp.17_22 = sp_6(D)-data; D.1282_23 = start_3(D) + 1; ivtmp.32_25 = (unsigned int) D.1282_23; D.1283_27 = (unsigned int) end_5(D); D.1284_28 = D.1283_27 + 1; bb 4: # start_20 = PHI start_4(5), start_3(D)(3) # ivtmp.32_7 = PHI ivtmp.32_24(5), ivtmp.32_25(3) D.1243_9 = (unsigned int) start_20; D.1244_10 = D.1243_9 * 4; D.1245_11 = pretmp.17_22 + D.1244_10; *D.1245_11 = 0; start_26 = (int) ivtmp.32_7; start_4 = start_26; ivtmp.32_24 = ivtmp.32_7 + 1; if (ivtmp.32_24 != D.1284_28) goto bb 5; else goto bb 6; bb 5: goto bb 4; bb 6: return; } IVOpts cannot identify start_26, start_4 and ivtmp_32_7 to be copies. The root cause is that expression 'i + start' is identified as a common expression between the test in the header and the index operation in the latch. This is unified by copy propagation or FRE prior to loop optimizations and creates a new induction variable. If we disable tree copy propagation and FRE with gcc -O2 -fno-tree-copy-prop -fno-tree-fre -fdump-tree-all we get testAutoIncStruct (struct struct_t * sp, int start, int end) { unsigned int D.1287; unsigned int D.1288; unsigned int D.1289; int D.1290; unsigned int D.1284; unsigned int D.1285; unsigned int D.1286; int * pretmp.17; int i; int * D.1245; unsigned int D.1244; unsigned int D.1243; int D.1242; int * D.1241; bb 2: if (start_3(D) end_5(D)) goto bb 3; else goto bb 6; bb 3: pretmp.17_23 = sp_6(D)-data; D.1287_27 = (unsigned int) end_5(D); D.1288_28 = (unsigned int) start_3(D); D.1289_29 = D.1287_27 - D.1288_28; D.1290_30 = (int) D.1289_29; bb 4: # i_20 = PHI i_12(5), 0(3) D.1241_7 = pretmp.17_23; D.1284_26 = (unsigned int) start_3(D); D.1285_25 = (unsigned int) i_20; D.1286_24 = D.1284_26 + D.1285_25; MEM[base: pretmp.17_23, index: D.1286_24, step: 4] = 0; i_12 = i_20 + 1; if (i_12 != D.1290_30) goto bb 5; else goto bb 6; bb 5: goto bb 4; bb 6: return; } The correct single induction variable as been identified here. This is not a loop header copying problem either. If we disable loop header copying, we still get multiple induction variables created. In fact in the above case loop header copying correctly enables post-increment mode on our port. testAutoIncStruct (struct struct_t * sp, int start, int end) { unsigned int D.1282; unsigned int ivtmp.31; unsigned int ivtmp.29; int i; int * D.1245; unsigned int D.1244; unsigned int D.1243; int D.1242; int * D.1241; bb 2: ivtmp.29_18 = (unsigned int) start_3(D); D.1282_21 = (unsigned int) start_3(D); ivtmp.31_22 = D.1282_21 * 4; goto bb 4; bb 3: D.1241_7 = sp_6(D)-data; D.1244_10 = ivtmp.31_19; D.1245_11 = D.1241_7 + D.1244_10; *D.1245_11 = 0; ivtmp.29_17 = ivtmp.29_8 + 1; ivtmp.31_20 = ivtmp.31_19 + 4; bb 4: # ivtmp.29_8 = PHI ivtmp.29_18(2), ivtmp.29_17(3) # ivtmp.31_19 = PHI ivtmp.31_22(2), ivtmp.31_20(3) D.1242_23 = (int) ivtmp.29_8; if (D.1242_23 end_5(D)) goto bb 3; else goto bb 5; bb 5: return; } Does this imply we try and not copy propagate or FRE potential induction variables? Or is this simply a missed case in IVOpts? Rahul
RE: [4.5] Find more autoinc addressing for induction variables
Hi, I am trialing this patch on a private GCC port that I'm working on. The patch works well with several test cases we have. However, fails on the following int main () { const int arr[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }; int i; for (i = 0; i 10; i++) { printf(arr[%d] : %d\n, i, arr[i]); } } Looking at the pass dump it seems the SSA re-scan of the symbol 'arr' marks its use inside the loop as volatile bb 2: arr[0] = 1; arr[1] = 2; arr[2] = 3; arr[3] = 4; arr[4] = 5; arr[5] = 6; arr[6] = 7; arr[7] = 8; arr[8] = 9; arr[9] = 10; __builtin_loop_start (1, 9); ivtmp.42_23 = (long unsigned int) arr[1]; bb 3: # i_19 = PHI i_6(4), 0(2) # prephitmp.14_18 = PHI pretmp.13_1(4), 1(2) # ivtmp.42_21 = PHI ivtmp.42_22(4), ivtmp.42_23(2) __builtin_loop_iteration (1); printf (arr[%d] == var : %d\n[0], i_19, prephitmp.14_18); i_6 = i_19 + 1; if (i_6 != 10) goto bb 4; else goto bb 5; bb 4: pretmp.13_1 ={v} MEM[index: ivtmp.42_21]; ivtmp.42_22 = ivtmp.42_21 + 4; goto bb 3; bb 5: return; And subsequently dead code elimination removes the array initialization completely. bb 2: __builtin_loop_start (1, 9); ivtmp.42_23 = (long unsigned int) arr[1]; bb 3: # i_19 = PHI i_6(4), 0(2) # prephitmp.14_18 = PHI pretmp.13_1(4), 1(2) # ivtmp.42_21 = PHI ivtmp.42_22(4), ivtmp.42_23(2) __builtin_loop_iteration (1); printf (arr[%d] == var : %d\n[0], i_19, prephitmp.14_18); i_6 = i_19 + 1; if (i_6 != 10) goto bb 4; else goto bb 5; bb 4: pretmp.13_1 ={v} MEM[index: ivtmp.42_21]; ivtmp.42_22 = ivtmp.42_21 + 4; goto bb 3; bb 5: return; } Rahul -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Bernd Schmidt Sent: 16 May 2009 01:05 To: Steven Bosscher Cc: GCC Patches; F. Kenneth Zadeck; Zdenek Dvorak Subject: Re: [4.5] Find more autoinc addressing for induction variables I wrote: I'll see whether I can achieve a similar effect by modifying tree-ssa-loop-ivopts instead. Maybe I can add new candidates that get incremented in suitable basic blocks other than just the last one. So here's a draft of what that would look like. For every use/cand pair for which autoinc is possible (which now requires that the use's block is not in a nested loop, and dominates the latch), we create a new candidate which is incremented at the point of the use, and compute its cost without including the cost of the step. This also gets rid of the special handling in iv_ca_recount_cost. That could be extended later to have candidates that are incremented several times, e.g. to create two 16-bit postinc addressing modes for a candidate that's incremented by 4. Does this look like an approach that everyone can live with? Bernd -- This footer brought to you by insane German lawmakers. Analog Devices GmbH Wilhelm-Wagenfeld-Str. 6 80807 Muenchen Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368 Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif
RE: [4.5] Find more autoinc addressing for induction variables
4.4.1, GCC 4.4 branch. Will I need dependent changes from the 4.5 branch? Many Thanks, Rahul -Original Message- From: Bernd Schmidt [mailto:bernds_...@t-online.de] Sent: 22 May 2009 16:56 To: Rahul Kharche Cc: gcc@gcc.gnu.org; sdkteam-all Subject: Re: [4.5] Find more autoinc addressing for induction variables Hi, Rahul Kharche wrote: I am trialing this patch on a private GCC port that I'm working on. The patch works well with several test cases we have. However, fails on the following int main () { const int arr[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }; int i; for (i = 0; i 10; i++) { printf(arr[%d] : %d\n, i, arr[i]); } } I'm not getting anything that looks like the dump you sent. What version of gcc is your port based on? Bernd -- This footer brought to you by insane German lawmakers. Analog Devices GmbH Wilhelm-Wagenfeld-Str. 6 80807 Muenchen Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368 Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif
RE: [4.5] Find more autoinc addressing for induction variables
Hi, Its fixed for me. I was missing get_ref_tag () in copy_ref_info () when I patched against 4.5. Thanks again, Rahul -Original Message- From: Bernd Schmidt [mailto:bernds_...@t-online.de] Sent: 22 May 2009 16:56 To: Rahul Kharche Cc: gcc@gcc.gnu.org; sdkteam-all Subject: Re: [4.5] Find more autoinc addressing for induction variables Hi, Rahul Kharche wrote: I am trialing this patch on a private GCC port that I'm working on. The patch works well with several test cases we have. However, fails on the following int main () { const int arr[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }; int i; for (i = 0; i 10; i++) { printf(arr[%d] : %d\n, i, arr[i]); } } I'm not getting anything that looks like the dump you sent. What version of gcc is your port based on? Bernd -- This footer brought to you by insane German lawmakers. Analog Devices GmbH Wilhelm-Wagenfeld-Str. 6 80807 Muenchen Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368 Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif
RE: Constant folding and Constant propagation
- If I patch in this code, actually I get the same results I did before where the constants are propagated. It seems that in 4.3.2, every part of the compiler is trying to do that. There are at least two forward propagation passes, one before and another after GCSE. I haven't tried to tackle those passes, but I believe they already have a cost model in place. This patch will prevent partial sums involving registers and constants from being combined during GCSE. - Just for testing I turned off the GCSE pass and it still propagated and folded the constants... GCSE won't help with your trimmed down example int main(void) { long a = 0xcafecafe; printf(Final: %lx %lx %lx\n, a, a+5, a+15); return EXIT_SUCCESS; } I believe Paolo's canon_reg solution together with tweaking rtx_cost of constants with outer code PLUS might help. Any luck with that pass?
RE: Constant folding and Constant propagation
Hi Jean, Do you have a link to that patch? So that I can see if it applies for me ? See patch below. I put in some comments to try and explain what I have attempted to do. The hash SET generation, to track potential candidates in replacing a register USE, needed some tweaking. This function was not tracking register copies if the associated INSN had a NOTE attached that associated the register with a constant. Also, only the last register or constant in an assignment chain was being added to the hash SET. It may be profitable to have a closer register copy in the assignment chain as a potential replacement. Before the actual replacing operation, the cost of temporary replacement is determined using rtx_cost. A cost cannot be reliably formed if we come across jump INSNs because they may alter jumps or successors. The default replacement is used in this case. Rahul Index: gcse.c === --- gcse.c (revision 141156) +++ gcse.c (working copy) @@ -520,7 +520,7 @@ static void record_set_info (rtx, const_rtx, void *); static void compute_sets (void); static void hash_scan_insn (rtx, struct hash_table *, int); -static void hash_scan_set (rtx, rtx, struct hash_table *); +static void hash_scan_set (rtx, rtx, struct hash_table *, bool); static void hash_scan_clobber (rtx, rtx, struct hash_table *); static void hash_scan_call (rtx, rtx, struct hash_table *); static int want_to_gcse_p (rtx); @@ -560,7 +560,7 @@ static void compute_cprop_data (void); static void find_used_regs (rtx *, void *); static int try_replace_reg (rtx, rtx, rtx); -static struct expr *find_avail_set (int, rtx); +static struct expr *find_avail_set (int, rtx, bool); static int cprop_jump (basic_block, rtx, rtx, rtx, rtx); static void mems_conflict_for_gcse_p (rtx, const_rtx, void *); static int load_killed_in_block_p (const_basic_block, int, const_rtx, int); @@ -1671,11 +1671,11 @@ expression one). */ static void -hash_scan_set (rtx pat, rtx insn, struct hash_table *table) +hash_scan_set (rtx pat, rtx insn, struct hash_table *table, bool use_note) { rtx src = SET_SRC (pat); rtx dest = SET_DEST (pat); - rtx note; + rtx note = NULL_RTX; if (GET_CODE (src) == CALL) hash_scan_call (src, insn, table); @@ -1685,16 +1685,21 @@ unsigned int regno = REGNO (dest); rtx tmp; - /* See if a REG_NOTE shows this equivalent to a simpler expression. -This allows us to do a single GCSE pass and still eliminate -redundant constants, addresses or other expressions that are -constructed with multiple instructions. */ - note = find_reg_equal_equiv_note (insn); - if (note != 0 - (table-set_p - ? gcse_constant_p (XEXP (note, 0)) - : want_to_gcse_p (XEXP (note, 0 - src = XEXP (note, 0), pat = gen_rtx_SET (VOIDmode, dest, src); + /* Do not substitute insn SET_SRC for note if caller uses + uses use_note = false. */ + if (use_note) +{ + /* See if a REG_NOTE shows this equivalent to a simpler expression. +This allows us to do a single GCSE pass and still eliminate +redundant constants, addresses or other expressions that are +constructed with multiple instructions. */ + note = find_reg_equal_equiv_note (insn); + if (note != 0 + (table-set_p + ? gcse_constant_p (XEXP (note, 0)) + : want_to_gcse_p (XEXP (note, 0 + src = XEXP (note, 0), pat = gen_rtx_SET (VOIDmode, dest, src); + } /* Only record sets of pseudo-regs in the hash table. */ if (! table-set_p @@ -1835,14 +1840,30 @@ what's been modified. */ if (GET_CODE (pat) == SET) -hash_scan_set (pat, insn, table); +{ + /* If we're hashing SETs for constant/copy propagation + also hash a SET ignoring any insn notes. */ + if (table-set_p) +{ + hash_scan_set (pat, insn, table, false); + } + hash_scan_set (pat, insn, table, true); +} else if (GET_CODE (pat) == PARALLEL) for (i = 0; i XVECLEN (pat, 0); i++) { rtx x = XVECEXP (pat, 0, i); if (GET_CODE (x) == SET) - hash_scan_set (x, insn, table); + { + /* If we're hashing SETs for constant/copy propagation + also hash a SET ignoring any insn notes. */ + if (table-set_p) + { + hash_scan_set (x, insn, table, false); + } + hash_scan_set (x, insn, table, true); + } else if (GET_CODE (x) == CLOBBER) hash_scan_clobber (x, insn, table); else if (GET_CODE (x) == CALL) @@ -2071,7 +2092,7 @@ if (table-set_p implicit_sets[current_bb-index] != NULL_RTX) hash_scan_set (implicit_sets[current_bb-index], - BB_HEAD (current_bb), table); +
Re: Constant folding and Constant propagation
I am looking at a related problem in GCSE, GCC 4.3 whereby constants are propagated to their use irrespective of the final instruction cost of generating them (machine cycles or instruction count). Global constant propagation effectively voids common expressions that form large constants, identified by global CSE. This is especially true of SYMBOl_REF constants like section-anchors generated while indexing global arrays. For the GCC port I work on, I have fixed this by weighing the rtx_cost of propagating a register copy Vs propagating the constant into an insn. I have an initial patch for this problem. Rahul Vijay Kharche