Re: Help understand the may_be_zero field in loop niter information
Hi, > > I noticed there is below code/comments about may_be_zero field in loop > > niter desc: > > > > tree may_be_zero;/* The boolean expression. If it evaluates to true, > >the loop will exit in the first iteration (i.e. > >its latch will not be executed), even if the niter > >field says otherwise. */ > > > > I had difficulty in understanding this because I ran into some cases > > in which it didn't behave as said. actually, in all the examples below, the field behaves as described, i.e., the number of iterations = may_be_zero ? 0 : niter; In particular, the fact that may_be_zero is false *does not imply* that the number of iterations as described by niter is non-zero. > > Example1, the dump of loop before sccp is like: > > > > : > > bnd_4 = len_3(D) + 1; > > > > : > > # ivtmp_1 = PHI <0(2), ivtmp_11(4)> > > _6 = ivtmp_1 + len_3(D); > > _7 = a[ivtmp_1]; > > _8 = b[ivtmp_1]; > > _9 = _7 + _8; > > a[_6] = _9; > > ivtmp_11 = ivtmp_1 + 1; > > if (bnd_4 > ivtmp_11) > > goto ; > > else > > goto ; > > > > : > > goto ; > > > > The loop niter information analyzed in sccp is like: > > > > Analyzing # of iterations of loop 1 > > exit condition [1, + , 1] < len_3(D) + 1 > > bounds on difference of bases: -1 ... 4294967294 > > result: > > zero if len_3(D) == 4294967295 > > # of iterations len_3(D), bounded by 4294967294 > > > > Qeustion1, shouldn't it be like "len_3 +1 <= 1" because the latch > > won't be executed when "len_3 == 0", right? the analysis determines the number of iterations as len_3, that is 0 if len_3 == 0. So, the information is computed correctly here. > > But when boundary condition is the only case that latch get ZERO > > executed, the may_be_zero info will not be computed. See example2, > > with dump of loop before sccp like: > > > > foo (int M) > > > > : > > if (M_4(D) > 0) > > goto ; > > else > > goto ; > > > > : > > return; > > > > : > > > > : > > # i_13 = PHI <0(4), i_10(6)> > > _5 = i_13 + M_4(D); > > _6 = a[i_13]; > > _7 = b[i_13]; > > _8 = _6 + _7; > > a[_5] = _8; > > i_10 = i_13 + 1; > > if (M_4(D) > i_10) > > goto ; > > else > > goto ; > > > > : > > goto ; > > > > The niter information analyzed in sccp is like: > > > > Analyzing # of iterations of loop 1 > > exit condition [1, + , 1](no_overflow) < M_4(D) > > bounds on difference of bases: 0 ... 2147483646 > > result: > > # of iterations (unsigned int) M_4(D) + 4294967295, bounded by > > 2147483646 > > > > So may_be_zero is always false here, but the latch may be ZERO > > executed when "M_4 == 1". Again, this is correct, since then ((unsigned int) M_4) + 4294967295 == 0. > > Start from Example1, we can create Example3 which makes no sense to > > me. Again, the dump of loop is like: > > > > : > > bnd_4 = len_3(D) + 1; > > > > : > > # ivtmp_1 = PHI <0(2), ivtmp_11(4)> > > _6 = ivtmp_1 + len_3(D); > > _7 = a[ivtmp_1]; > > _8 = b[ivtmp_1]; > > _9 = _7 + _8; > > a[_6] = _9; > > ivtmp_11 = ivtmp_1 + 4; > > if (bnd_4 > ivtmp_11) > > goto ; > > else > > goto ; > > > > : > > goto ; > > > > : > > return 0; > > > > The niter info is like: > > > > Analyzing # of iterations of loop 1 > > exit condition [4, + , 4] < len_3(D) + 1 > > bounds on difference of bases: -4 ... 4294967291 > > result: > > under assumptions len_3(D) + 1 <= 4294967292 > > zero if len_3(D) == 4294967295 > > # of iterations len_3(D) / 4, bounded by 1073741823 > > > > The problem is: won't latch be ZERO executed when "len_3 == 0/1/2/3"? Again, in all these cases the number of iterations is len_3 / 4 == 0. Zdenek
Re: How do I modify SSA and copy basic blocks?
Hi, > On Tue, 2013-04-23 at 15:24 -0600, Jeff Law wrote: > > > Well, you have to copy the blocks, adjust the edges and rewrite the SSA > > graph. I'd use duplicate_block to help. > > > > You really want to look at tree-ssa-threadupdate.c. There's a nice big > > block comment which gives the high level view of what needs to happen > > when you copy a block for this kind of optimization. Feel free to > > ignore the implementation which has to be fairly efficient when there's > > a large number of edges to update. > > > > Jeff > > I think I understand the high level work, it is mapping that hight level > description to the low level calls that I am having trouble with. I'd suggest looking at gimple_duplicate_sese_region in tree-cfg.c. It does not do exactly what you need, but it deals with a somewhat similar situation, Zdenek
Re: A question about loop ivopt
Hi, > > > > Why can't we replace function force_expr_to_var_cost directly with > function > > > > computation_cost in tree-ssa-loop-ivopt.c? > > > > > > > > Actually I think it is inaccurate for the current recursive algorithm > in > > > > force_expr_to_var_cost to estimate expr cost. Instead > computation_cost can > > > > count some back-end factors and make the estimation more accurate. > > > > > > > > For example, using computation_cost, we may check whether back-ends > can tie > > > > some modes transformation expressed by SUBREG or not. If we use > > > > force_expr_to_var_cost, some more computations around type > > > > promotion/demotion would increase the cost estimated. > > > > > > > > Looking at the algorithm in force_expr_to_var_cost, it is just to > analyze > > > > the operator in the expression and give estimation. Should it be the > same as > > > > expanding EXPR to RTX and give estimation like in computation_cost? > > > > > > > > Any thoughts? > > > > > > I suppose Zdenek may remember. > > > > computation_cost actually expands the expression to RTL. Since cost > estimates > > are computed a lot in ivopts, using it directly would require a huge > amount of memory, > > Zdenek, > > Do you know how huge is it? Any data on this? For GCC, is this "huge" > memory consumption is critical enough, and there aren't any other else > consuming even more memory? no, not really (I haven't worked on this for a few years now), although of course I did some measurements when ivopts were created. Feel free to experiment with that, if it turned out that the memory consumption and extra time spent by it is negligible, it would be a nice cleanup. Zdenek
Re: A question about loop ivopt
Hi, > > Why can't we replace function force_expr_to_var_cost directly with function > > computation_cost in tree-ssa-loop-ivopt.c? > > > > Actually I think it is inaccurate for the current recursive algorithm in > > force_expr_to_var_cost to estimate expr cost. Instead computation_cost can > > count some back-end factors and make the estimation more accurate. > > > > For example, using computation_cost, we may check whether back-ends can tie > > some modes transformation expressed by SUBREG or not. If we use > > force_expr_to_var_cost, some more computations around type > > promotion/demotion would increase the cost estimated. > > > > Looking at the algorithm in force_expr_to_var_cost, it is just to analyze > > the operator in the expression and give estimation. Should it be the same as > > expanding EXPR to RTX and give estimation like in computation_cost? > > > > Any thoughts? > > I suppose Zdenek may remember. computation_cost actually expands the expression to RTL. Since cost estimates are computed a lot in ivopts, using it directly would require a huge amount of memory, Zdenek
Re: reverse conditionnal jump
Hi, > I'm still developping a new private target backend (gcc4.5.2) and I noticed > something strange in the assembler generated for conditionnal jump. > > > The compiled C code source is : > > void funct (int c) { > int a; > a = 7; > if (c < 0) > a = 4; > return a; > } > > > The assembler generated is : > > [...] > mov 7,a > cmp 0,c #set the CC status > jmpif LT .L2 #conditionnal jump using CC status > .L1 > ret a #return to callee > .L2 > mov 4,a > jmp .L1 #unconditionnal jump this could actually be intentional. gcc tries to lay out the code so that forward conditional jumps are usually not taken, and it has a heuristic that numbers are usually non-negative. I am just guessing, though, there is really no way to tell what is happening without debugging the issue. To check whether this theory could be true, you can try changing the condition to "if (c > 0)", in which case the code should be generated the way you prefer, Zdenek
Re: Simplification of relational operations (was [patch for PR18942])
Hi, > I'm looking at a missed optimizations in combine and it is similar to the one > you've fixed in PR18942 > (http://thread.gmane.org/gmane.comp.gcc.patches/81504). > > I'm trying to make GCC optimize > (leu:SI > (plus:SI (reg:SI) (const_int -1)) > (const_int 1)) > > into > > (leu:SI > (reg:SI) > (const_int 2)) > > Your patch for PR18942 handles only EQ/NE comparisons, and I wonder if there > is a reason not to handle LEU/GEU, LTU/GTU comparisons as well. I'm a bit > fuzzy whether signed comparisons can be optimized here as well, but I can't > see the problem with unsigned comparisons. > > Any reason why this optimization would be unsafe? the reason I handled only EQ/NE patterns is that you do not need to worry about overflows there. For ordering predicates, x - 1 <= 1 is not equivalent to x <= 2 if x = 0 (the former becomes MAX_UNSIGNED_INT <= 1, which is false), Zdenek
Re: Loop-iv.c ICEs on subregs
Hi, > I'm investigating an ICE in loop-iv.c:get_biv_step(). I hope you can shed > some light on what the correct fix would be. > > The ICE happens when processing: > == > (insn 111 (set (reg:SI 304) >(plus (subreg:SI (reg:DI 251) 4) > (const_int 1 > > (insn 177 (set (subreg:SI (reg:DI 251)) >(reg:SI 304))) > == > > The code like the above does not occur on current mainline early enough for > loop-iv.c to catch it. The subregs above are produced by Tom's (CC'ed) > extension elimination pass (scheduled before fwprop1) which is not in > mainline yet [*]. > > The failure is due to assert in loop-iv.c:get_biv_step(): > == > gcc_assert ((*inner_mode == *outer_mode) != (*extend != UNKNOWN)); > == > i.e., inner and outer modes can differ iff there's an extend in the chain. > > Get_biv_step_1() starts with insn 177, then gets to insn 111, then loops back > to insn 177 at which point it stops and returns GRD_MAYBE_BIV and sets: > > * outer_mode == DImode == natural mode of (reg A); > > * inner_mode == SImode == mode of (subreg (reg A)), set in get_biv_step_1: > == > if (GET_CODE (next) == SUBREG) > { > enum machine_mode amode = GET_MODE (next); > > if (GET_MODE_SIZE (amode) > GET_MODE_SIZE (*inner_mode)) > return false; > > *inner_mode = amode; > *inner_step = simplify_gen_binary (PLUS, outer_mode, >*inner_step, *outer_step); > *outer_step = const0_rtx; > *extend = UNKNOWN; > } > == > > * extend == UNKNOWN as there are no extensions in the chain. > > It seems to me that computations of outer_mode and extend are correct, I'm > not sure about inner_mode. > > Zdenek, what do you think is the right way to handle the above case in loop > analysis? loop iv analysis currently does not handle assignments to subregs; we test for that in iv_get_reaching_def, but the corresponding test is missing in latch_dominating_def. I.e., a simple fix is to add if (single_rd && (DF_REF_FLAGS (single_rd) & DF_REF_READ_WRITE)) return false; at the end of latch_dominating_def. On the other hand, this means that the loops using the code like above will not get optimized. So, if such a code gets produced consistently for a large fraction of the loops, it would be necessary to teach loop-iv to analyze induction variables represented in subregs. This would mean a more involved rewrite of loop-iv, though, Zdenek
Re: non-algorithmic maintainers
> On Mon, Nov 15, 2010 at 10:00 PM, Paolo Bonzini wrote: > > We currently have 3 non-algorithmic maintainers: > > > > loop optimizer Zdenek Dvorak o...@ucw.cz > > loop optimizer Daniel Berlin dber...@dberlin.org > > libcpp Tom Tromey tro...@redhat.com > > > > Especially for the loop optimizer, the situation is a bit strange. There are > > no other maintainers for this specific area (only middle-end/GIMPLE/RTL > > maintainers); Daniel has never really been active in this area (except for > > the old lambda code and scev which has separate maintainers); Zdenek, while > > not very active lately, has basically written the loop optimizer and has > > always been helpful with reviews. > > > > I don't think there is a particular reason to keep this category, whose > > boundaries have actually never been very clear. > > I thought that all non-algorithmic maintainers were demoted to reviewers > automagically. It would be indeed nice to get rid of the strange > category. > > Consider demoting yourself voluntarily (by applying a patch moving > yourself to the reviewers section). done; I am not likely to use the non-algorithmic maintainer privileges (whatever they actualy used to be) anytime soon, anyway (unfortunately), Zdenek
Re: Question about Doloop
Hi, > Doloop optimization fails to be applied on the following inner loop > when compiling for PowerPC (GCC -r162294) due to: > > Doloop: number of iterations too costly to compute. strength reduction is performed in ivopts, introducing new variable: for (p = inptr; p < something; p += 3) ... So the number of iterations is (something - p) / 3, which doloop considers too costly to compute. Zdenek > I do not understand why as the number of iterations is max_cols and I > appreciate an explanation. > > Thanks, > Revital > > 11 while (--max_rows >= 0) > 12 { > 13 inptr = *inbuf++; > 14 outp = outbuf[0][rows]; > 15 rows++; > 16 > 17 for (y = 0; y < max_cols; y++) > 18 { > 19 k = ((int) (inptr[0])); > 20 inptr += 3; > 21 > 22 outp[y] = (unsigned char) ((inarr[k]) >> 16); > 23 } > 24 } > > > >From Doloop dump: > > Analyzing operand (reg/f:DI 246 [ D.2082 ]) of insn (insn 118 116 119 5 > test1.c:17 (set (reg:CC 272) > (compare:CC (reg/v/f:DI 199 [ inptr ]) > (reg/f:DI 246 [ D.2082 ]))) 535 {*cmpdi_internal1} > (expr_list:REG_DEAD (reg/f:DI 246 [ D.2082 ]) > (nil))) > invariant (reg/f:DI 246 [ D.2082 ]) (in DI) > Loop 2 is simple: > simple exit 5 -> 6 > number of iterations: (mult:DI (plus:DI (minus:DI (reg/f:DI 246 > [ D.2082 ]) > (reg/v/f:DI 199 [ inptr ])) > (const_int -3 [0xfffd])) > (const_int -6148914691236517205 [0xaaab])) > upper bound: -1 > Doloop: number of iterations too costly to compute. > > > (See attached file: test1.c)
Re: Irreducible loops in generated code
Hi, > > I'm working on decompiling x86-64 binary programs, using branches to rebuild > > a control-flow graph and looking for loops. I've found a significant number > > of irreducible loops in gcc-produced code (irreducible loops are loops with > > more than one entry point), especially in -O3 optimized binaries, even when > > the source code is "well" structured. My experiments are done mainly on the > > SPEC CPU-2006 benchmarks. > > > > My question is: where do these irreducible loops come from? Which > > optimization pass leads to irreducible regions? Thanks in advance for any > > pointer. > > Questions right back at you: What compiler version are you using? Got > a specific exampe (test case)? > > In older releases of GCC4, jump threading sometimes resulted in > irreducible regions, but more recent GCCs carefully try to avoid them. I am not sure that is actually true. Afaik, we only avoid this on gimple, while rtl jump threading does not care, Zdenek
Re: A question about loop-unroll
Hi, > > Is there a way to pass to the unroller the maximum number of iterations > > of the loop such that it can decide to avoid unrolling if > > the maximum number is small. > > > > To be more specific, I am referring to the following case: > > After the vectorizer decides to peel for alignment > > it creates three loops: > > [1] scalar loop - the prologue to align > > memory access. > > [2] the vecorized loop > > [3] scalar loop - the remaining scalar computations. > > > > If the unroller does not know the number of iterations at compile time > > it unrolls loops with run-time checks in the following way > > (taken from loop-unroll.c): > > > > for (i = 0; i < n; i++) > > body; > > > > ==> > > > > i = 0; > > mod = n % 4; > > > > switch (mod) > > { > > case 3: > > body; i++; > > case 2: > > body; i++; > > case 1: > > body; i++; > > case 0: ; > > } > > > > while (i < n) > > { > > body; i++; > > body; i++; > > body; i++; > > body; i++; > > } > > > > > > The vectorizer knowns at compile time the maximum number of iterations > > that will be needed for the prologue and the epilogue. In some cases > > seems there is no need to unroll and create redundant loops. > > You can set niter_max in the niter_desc of simple loops. There is > also nb_iter_bound for all loops. Of course the > issue is that loop information is destroyed sometimes. It also looks > like that RTL loop analysis may not re-use this information. > > Maybe Zdenek knows a better answer. currently, there is no reliable way how to pass this information to RTL. The best I can come up with (without significant amount of changes to other parts of the compiler) would be to insert a code like if (n > 5) special_abort (); before the loop in the vectorizer if you know for sure that the loop will iterate at most 5 times, use these hints to bound the number of iterations in the unroller (we do not do this at the moment, but it should be easy), and remove the calls to special_abort and the corresponding branches after the unroller. Zdenek
Re: Turning off unrolling to certain loops
Hi, > I faced a similar issue a while ago. I filed a bug report > (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36712) In the end, > I implemented a simple tree-level unrolling pass in our port > which uses all the existing infrastructure. It works quite well for > our purpose, but I hesitated to submit the patch because it contains > our not-very-elegannt #prgama unroll implementation. could you please do so anyway? Even if there are some issues with the #prgama unroll implementation, it could serve as a basis of a usable patch. > /* Perfect unrolling of a loop */ > static void tree_unroll_perfect_loop (struct loop *loop, unsigned factor, > edge exit) > { > ... > } > > > > /* Go through all the loops: > 1. Determine unrolling factor > 2. Unroll loops in different conditions > -- perfect loop: no extra copy of original loop > -- other loops: the original version of loops to execute the > remaining iterations > */ > static unsigned int rest_of_tree_unroll (void) > { ... >tree niters = number_of_exit_cond_executions(loop); > >bool perfect_unrolling = false; >if(niters != NULL_TREE && niters!= chrec_dont_know && > TREE_CODE(niters) == INTEGER_CST){ > int num_iters = tree_low_cst(niters, 1); > if((num_iters % unroll_factor) == 0) >perfect_unrolling = true; >} > >/* If no. of iterations can be divided by unrolling factor, we have > perfect unrolling */ >if(perfect_unrolling){ > tree_unroll_perfect_loop(loop, unroll_factor, single_dom_exit(loop)); >} >else{ > tree_unroll_loop (loop, unroll_factor, single_dom_exit (loop), > &desc); >} It would be better to move this test to tree_unroll_loop, and not duplicate its code in tree_unroll_perfect_loop. Zdenek
Re: Turning off unrolling to certain loops
Hi, > Ok, I've actually gone a different route. Instead of waiting for the > middle end to perform this, I've directly modified the parser stage to > unroll the loop directly there. I think this is a very bad idea. First of all, getting the information needed to decide at this stage whether unrolling is possible at all is difficult; for instance, what happens for a loop of form for (...) { something; label: something else; } ... goto label; ? Are you sure that you handle correctly loops with several exits from the loop body, loops whose control variables may overflow, exception handling, ...? And if so, what is the benefit of having the code to deal with all these complexities twice in the compiler? Furthermore, unrolling the loops this early may increase compile time with little or no gain in code quality. Zdenek
Re: Turning off unrolling to certain loops
Hi, > 2) I was using a simple example: > > #pragma unroll 2 > for (i=0;i<6;i++) > { > printf ("Hello world\n"); > } > > If I do this, instead of transforming the code into : > for (i=0;i<3;i++) > { > printf ("Hello world\n"); > printf ("Hello world\n"); > } > > as we could expect, it is transformed into: > for (i=0;i<2;i++) > { > printf ("Hello world\n"); > printf ("Hello world\n"); > } > for (i=0;i<2;i++) > { > printf ("Hello world\n"); > } > > > (I am using 4.3.2 currently) > > I am using the tree_unroll_loop function to perform the unrolling and > it seems to always want to keep that epilogue. Is there a reason for > this? Or is this a bug of some sorts? such an epilogue is needed when the # of iterations is not known in the compile time; it should be fairly easy to modify the unrolling not to emit it when it is not necessary, Zdenek
Re: Scev analysing number of loop iterations returns (2^32-1) instead of -1
> On Wed, Oct 7, 2009 at 7:21 PM, Tobias Grosser > wrote: > > On Wed, 2009-10-07 at 18:30 +0200, Tobias Grosser wrote: > >> On Wed, 2009-10-07 at 17:44 +0200, Richard Guenther wrote: > >> > On Wed, Oct 7, 2009 at 5:35 PM, Tobias Grosser > >> > wrote: > >> > > On Wed, 2009-10-07 at 17:23 +0200, Richard Guenther wrote: > >> > >> On Wed, Oct 7, 2009 at 5:16 PM, Tobias Grosser > >> > >> wrote: > >> > >> > On Wed, 2009-10-07 at 16:42 +0200, Richard Guenther wrote: > >> > >> >> On Wed, Oct 7, 2009 at 4:37 PM, Tobias Grosser > >> > >> >> wrote: > >> > >> >> > I try to analyse this code: > >> > >> >> > -- > >> > >> >> > int foo (int N) > >> > >> >> > { > >> > >> >> > int ftab[257]; > >> > >> >> > int i, j; > >> > >> >> > > >> > >> >> > for (i = 0; i < N - 7488645; i++) > >> > >> >> > j = ftab [i]; > >> > >> >> > > >> > >> >> > return j; > >> > >> >> > } > >> > >> >> > -- > >> > >> >> > > >> > >> >> > The number of iterations I get is: > >> > >> >> > > >> > >> >> > (unsigned int) N_5(D) + 0x0 > >> > >> >> > > >> > >> >> > However I expect it to be > >> > >> >> > > >> > >> >> > (unsigned int) N_5(D) + (-1) > >> > >> >> > >> > >> >> No, that would be (unsigned int) (N_5(D) + -1) instead. > >> > >> >> > >> > >> >> It's fold that canonicalizes this to the above form - you > >> > >> >> simply have to deal with it (unsigned arithmetic, that is). > >> > >> > > >> > >> > OK. So I need to understand this better. > >> > >> > > >> > >> > E.g: > >> > >> > > >> > >> > int foo (int N) > >> > >> > { > >> > >> > int ftab[257]; > >> > >> > int i, j; > >> > >> > > >> > >> > for (i = 0; i < N - 50; i++) > >> > >> > j = ftab [i]; > >> > >> > > >> > >> > return j; > >> > >> > } > >> > >> > > >> > >> > Number of latch executions: (unsigned int) N_5(D) + 0x0ffcd > >> > >> > > >> > >> > What happens if N == 5? I would expect the number of latch > >> > >> > iterations to > >> > >> > be 0 as i < 5 - 50 is always false. However using the upper > >> > >> > expression I > >> > >> > get something like > >> > >> > 5 + 0x0ffcd == 0x0ffd2 > >> > >> > what is a lot bigger than 0. > >> > >> > >> > >> It's undefined if N == 5 because the loop counter would overflow. > >> > > > >> > > > >> > > Why? > >> > > > >> > > The loop > >> > > > >> > > for (i=0; i < 5 - 50; i++) > >> > > > >> > > is equivalent to > >> > > > >> > > for (i=0; i < -45; i++) > >> > > > >> > > Which just evaluates to false and will not be executed. How can the > >> > > loop > >> > > counter overflow? > >> > > >> > Ah, indeed. Sorry for being confused. Is tree-niter-desc->assumptions > >> > or ->may_be_zero non-NULL? > >> > >> Yes both. I attached the gdb content for both. > > > > (gdb) call debug_generic_expr (ndesc.assumptions) > > 1 > > (gdb) call debug_generic_expr (ndesc.may_be_zero) > > 0 > > (gdb) call debug_tree (ndesc.assumptions) > > constant > > 1> > > (gdb) call debug_tree (ndesc.may_be_zero) > > constant > > 0> > > > > So it seems the "niter" expression should contain the correct solution > > for every N. The cases where "niter" is not valid are not fullfilled > > following the description in tree-flow.h > > Yes, I would think so. Maybe Zdenek knows more. this is because the header of the loop was copied; i.e., the code looks like if (N >= 50) { the loop } so # of iterations analysis knows that N >= 50, and hence the number of iterations is N-50, Zdenek
Re: Scev analysing number of loop iterations returns (2^32-1) instead of -1
Hi, > > Ah, indeed. Sorry for being confused. Is tree-niter-desc->assumptions > > or ->may_be_zero non-NULL? > > Yes both. I attached the gdb content for both. you need to check may_be_zero, which in your case should contain something like N <= 49. If this condition is true, the number of iterations of the loop is zero. Please also check the comments in tree-flow.h, where the possible outcomes of # of iteration analysis are explained, Zdenek
Re: Turning off unrolling to certain loops
Hi, > I was wondering if it was possible to turn off the unrolling to > certain loops. Basically, I'd like the compiler not to consider > certain loops for unrolling but fail to see how exactly I can achieve > that. > > I've traced the unrolling code to multiple places in the code (I'm > working with the 4.3.2 version) and, for the moment, I'm trying to > figure out if I can add something in the loop such as a note that I > can later find in the FOR_EACH_LOOP loops in order to turn the > unrolling for that particular loop off. > > Have you got any ideas of what I could use like "note" or even a > better idea all together? the normal way of doing this is using a #pragma. For instance, icc implements #pragma nounroll for this purpose. I have some prototype implementation for gcc, I can send it to you if you were interested in finishing it, Zdenek
Re: RFC: missed loop optimizations from loop induction variable copies
Hi, > IVOpts cannot identify start_26, start_4 and ivtmp_32_7 to be copies. > The root cause is that expression 'i + start' is identified as a common > expression between the test in the header and the index operation in the > latch. This is unified by copy propagation or FRE prior to loop > optimizations > and creates a new induction variable. > > > Does this imply we try and not copy propagate or FRE potential induction > variables? Or is this simply a missed case in IVOpts? IIRC, at some point maybe year or two ago Sebastian worked on enhancing scev to analyze such induction variables (thus enabling IVopts to handle them). But it seems the code did not make it to mainline, Zdenek
Re: M32C vs PR tree-optimization/39233
Hi, > Can we somehow make this fix contingent on ports that have suitable > integral modes? yes; however, maybe it would be easier to wait till Richard finishes the work on not representing the overflow semantics in types (assuming that's going to happen say in a few weeks?), which should make the fix unnecessary, Zdenek
Re: New no-undefined-overflow branch
Hi, > > I obviously thought about this. The issue with using a flag is > > that there is no convenient place to stick it and that it makes > > the distinction between the two variants less visible. Consider > > the folding routines that take split trees for a start. > > > > IMHO using new tree-codes is much less complicated than carrying > > around flags. I did consider putting flags on the tree code > > itself, but that isn't going to make the changes easier either. > > I think it's probably a far safer design too. > > If you suddenly introduce a new semantic for an existing code, suddenly > every current check for that tree code becomes a potential bug site that has > to be manually inspected to see if it's overflow-sensitive or not and > therefore whether it needs to test the new flag or not; people who don't know > about the new flag will carry on adding code that processes those tree codes > without knowing to test the flag; and the bugs that do arise will be hard to > find and result in silent bad codegen. > > If OTOH you add new tree codes, the meaning will be explicit, nobody using > them can be under any misapprehension about the overflow semantics, nobody can > forget to check the flag, and any places where they should be handled but > aren't will very quickly draw themselves to our attention by ICEing, or if > they don't will simply fall back to the safe option of not doing any > processing on a tree code that they don't recognize; that might lead to > pessimal code, but it shouldn't generate incorrect code. you are way too optimistic. Regardless of whether you introduce the new codes or represent the new information in another way, the fact is that this changes semantics of PLUS_EXPR in some cases, and it will lead to a wave of bugs from places that you forgot to update or updated incorrectly. Also, adding new codes usually will not lead to ICEs on places that do not handle them, just to missed optimizations (most places that handle expressions have some kind of "default" -- if this is not any of the operations that we handle specially, do something generic). Using special codes is in no way safer than having an extra flag -- even with the extra flag, the default behavior for the newly created operations would be that they may wrap, so there is really no difference, Zdenek
Re: New no-undefined-overflow branch
Hi, > > introducing new codes seems like a bad idea to me. There are many > > places that do not care about the distinction between PLUS_EXPR and > > PLUSV_EXPR, and handling both cases will complicate the code (see eg. > > the problems caused by introducing POINTER_PLUS_EXPR vs PLUS_EXPR > > distinction). Why not just use a flag to mark the operation as > > non-overflowing? > > I obviously thought about this. The issue with using a flag is > that there is no convenient place to stick it and that it makes > the distinction between the two variants less visible. Consider > the folding routines that take split trees for a start. > > IMHO using new tree-codes is much less complicated than carrying > around flags. I did consider putting flags on the tree code > itself, but that isn't going to make the changes easier either. OK, then what about this: introduce accessor functions like tree_code get_operation_semantics (tree_code) -- returns PLUS_EXPR for PLUS_EXPR and PLUSNV_EXPR, etc. bool get_operation_overflow (tree_code) -- obvious tree_code operation_code (tree_code, bool) -- PLUS_EXPR, false -> PLUS_EXPR -- PLUS_EXPR, true -> PLUSNV_EXPR -- PLUSNV_EXPR, * -> abort etc. (possibly with more clear names), and change the code to always use them? Zdenek
Re: New no-undefined-overflow branch
Hi, in general, I like this proposal a lot. However, > As a start there will be no-overflow variants of NEGATE_EXPR, > PLUS_EXPR, MINUS_EXPR, MULT_EXPR and POINTER_PLUS_EXPR. > > The sizetypes will simply be operated on in no-overflow variants > by default (by size_binop and friends). > > Naming suggestions welcome, at the moment I consider NEGATEV_EXPR > (thus appending V to the first part). introducing new codes seems like a bad idea to me. There are many places that do not care about the distinction between PLUS_EXPR and PLUSV_EXPR, and handling both cases will complicate the code (see eg. the problems caused by introducing POINTER_PLUS_EXPR vs PLUS_EXPR distinction). Why not just use a flag to mark the operation as non-overflowing? Zdenek
Re: Warning when compiling dwarf2out.c with -ftree-parallelize-loops=4
Hi, > As far as I get it, there is no real failure here. > Parloop, unaware of the array's upper bound, inserts the 'enough > iterations' condition (i>400-1), and thereby > makes the last iteration range from 400 upwards. > VRP now has a constant it can compare to the array's upper bound. > Correct programs that do not exceed the array bound will now fail because > VRP is not aware of the > fact that the parallel code (and last iteration it is looking at) is > actually not executed in these cases. > > I'm not sure how to proceed, in order to avoid this warning. > Any ideas? I would propose changing the test expected_loop_iterations (loop) <= n_threads in tree-parloops.c to ((nit = estimated_loop_iterations_int (loop, false)) >= 0 && nit < MIN_PER_THREAD * n_threads) Which should prevent the parallelization for this loop, as we should be able to derive that the loop iterates at most 4 times due to the array access. Zdenek
Re: query regarding adding a pass to undo final value replacement.
Hi, > > but you only take the hash of the argument of the phi node (i.e., the > > ssa name), not the computations on that it is based > > Is this something like what you had in mind ? > > gen_hash (stmt) > { > > if (stmt == NULL) > return 0; > > use_operand_p use_p; > ssa_op_iter iter; > val += get_changed_stmt_hash (stmt); > FOR_EACH_SSA_USE_OPERAND (use_p, stmt, iter, SSA_OP_USE) > { > tree use = USE_FROM_PTR (use_p); > val += get_stmt_hash (use); > val += gen_hash (SSA_NAME_DEF_STMT (use)); > } > } > else > val += generate_phi_node_hashcode (stmt); > > > return val; > } > > and one calls this in continuation to my earlier email by > > >arg_p = PHI_ARG_DEF_PTR (phi_node , 0); >op0 = USE_FROM_PTR (arg_p); >val = iterative_hash_expr (op0, val); >if (TREE_CODE (op0) == SSA_NAME) > { >val = iterative_hash_expr (SSA_NAME_VAR (op0), val); >val += SSA_NAME_VERSION (op0); > >val += gen_hash (SSA_NAME_DEF_STMT (op0)); otoh, there seem to be at least four problems with this: 1) The fact that the statement is identical does not mean that it also returns the same value (calls, memory accesses, ...). I guess one could argue that if FVR succeeded, we should not run into such problems, which is a bit shaky reasoning, but might actually be the case. 2) You would be hashing everything reachable through the uses, which as written will lead to an infinite loop in any program with a cycle. Even if this were fixed, hashing (potentially) whole program would be extremely slow. 3) Two different expressions may have the same hash value. 4) SSA_NAME_DEF_STMT (op0) may have been removed, so this would be accessing garbage. Zdenek
Re: query regarding adding a pass to undo final value replacement.
Hi, > >> >> So if the ssa_names are infact reused they won't be the same > >> >> computations. > >> > > >> > do you also check this for ssa names inside the loop (in your example, > >> > D.10_1? > >> > >> If we have to reinsert for a = phi (B) . We do the following checks. > >> > >> 1. If the edge information in the phi node has remained the same then > >> we don't insert the phi node back in. This check is done by running > >> the phi node applying iterative_hash_expr on its PHI_ARG_DEF_PTR . > >> > >> Something like this is done for checking the consistency of hash values. > >> > >> arg_p = PHI_ARG_DEF_PTR (phi_node , 0); > >> op0 = USE_FROM_PTR (arg_p); > >> val = iterative_hash_expr (op0, val); > >> if (TREE_CODE (op0) == SSA_NAME) > >> { > >> val = iterative_hash_expr (SSA_NAME_VAR (op0), val); > >> val += SSA_NAME_VERSION (op0); > >> > >> } > >> > >> I suspect that this should catch the case that you mention here. > > > > now you have lost me completely -- this does not seem to check anything > > regarding the computations inside the loop, just about the phi node, > > Well since the phi node with a single operand depends on statements / > operations within the loop, the hashcode computed initially and > finally need to match. If the computations within the loop change then > clearly the hash values at the time of storing and restoring would be > different . So if the assignment to D.10_1 has changed in some form, > the hashvalues would be different. but you only take the hash of the argument of the phi node (i.e., the ssa name), not the computations on that it is based, Zdenek
Re: query regarding adding a pass to undo final value replacement.
Hi, > [Sorry about dropping the ball on this. I've had some trouble with > internet connectivity and was on vacation for a few days. ] > > On Thu, Oct 2, 2008 at 2:56 AM, Zdenek Dvorak <[EMAIL PROTECTED]> wrote: > > Hi, > > > >> >> b) If any PHI node has count zero it can be inserted back and its > >> >> corresponding computations removed, iff the argument of the > >> >> PHI node > >> >> still exists as an SSA variable. This means that we can insert > >> >> a_1 = PHI if D.10_1 still exists and hasnt been removed by > >> >> any of the passes between the scalar evolution pass and the > >> >> loopdone pass. > >> > > >> > this does not work: > >> > -- we reuse ssa names, so it can happen that the argument of the PHI node > >> > is eliminated, then reused for a different purpose > >> > >> I wasn't sure if from the proposal strong enough to catch this case ? i.e. > >> if > >> > >> > >> So if the ssa_names are infact reused they won't be the same > >> computations. > > > > do you also check this for ssa names inside the loop (in your example, > > D.10_1? > > If we have to reinsert for a = phi (B) . We do the following checks. > > 1. If the edge information in the phi node has remained the same then > we don't insert the phi node back in. This check is done by running > the phi node applying iterative_hash_expr on its PHI_ARG_DEF_PTR . > > Something like this is done for checking the consistency of hash values. > > arg_p = PHI_ARG_DEF_PTR (phi_node , 0); > op0 = USE_FROM_PTR (arg_p); > val = iterative_hash_expr (op0, val); > if (TREE_CODE (op0) == SSA_NAME) > { > val = iterative_hash_expr (SSA_NAME_VAR (op0), val); > val += SSA_NAME_VERSION (op0); > > } > > I suspect that this should catch the case that you mention here. now you have lost me completely -- this does not seem to check anything regarding the computations inside the loop, just about the phi node, Zdenek
Re: query regarding adding a pass to undo final value replacement.
Hi, > > I would disagree on that. Whether a final value replacement is > > profitable or not largely depends on whether it makes further > > optimization of the loop possible or not; this makes it difficult > > to find a good cost model. I think undoing FVR is a good approach > > to solve this problem (unfortunately, the proposed implementation > > does not work), > > Ok, fair enough. Ideally we would then be able to retain the PHI nodes > and somehow record an equivalency in the IL from which we later could > remove either of the definitions. Something like > > def_1 = PHI < ... > > > def_2 = compute > > def_3 = EQUIV > (def_3 = ASSERT_EXPR ?) > > much similar to REG_EQUAL notes. This means that both def_1 and def_2 > are conditionally dead if the EQUIV is the only remaining use. > > No idea if this is feasible and useful enough in general though. > > Do you remember what kind of missed optimizations you saw (apart from > missed dead loop removal)? vectorization and linear loop transformations did not like values used outside of the loop; I am not sure whether (our implementation of) graphite handles them or not, Zdenek
Re: query regarding adding a pass to undo final value replacement.
Hi, > >> b) If any PHI node has count zero it can be inserted back and its > >> corresponding computations removed, iff the argument of the PHI > >> node > >> still exists as an SSA variable. This means that we can insert > >> a_1 = PHI if D.10_1 still exists and hasnt been removed by > >> any of the passes between the scalar evolution pass and the > >> loopdone pass. > > > > this does not work: > > -- we reuse ssa names, so it can happen that the argument of the PHI node > > is eliminated, then reused for a different purpose > > I wasn't sure if from the proposal strong enough to catch this case ? i.e. if > > > So if the ssa_names are infact reused they won't be the same > computations. do you also check this for ssa names inside the loop (in your example, D.10_1? > > -- in case more complex loop transformations were performed > > (e.g., loop reversal), the final value of the ssa name might have > > changed. > > Could you give an example for this ? for (i = 100; i > 0; i--) a[i] = i; transformed to for (i = 1; i <= 100; i++) a[i] = i; the final value of i was originally 0, now it is 101. > Is there anything else you might > suggest in terms of undoing the transformations from scalar cprop.? I would probably try to somehow pass the information from scev analysis to value numbering, and let PRE take care of the issue, Zdenek
Re: query regarding adding a pass to undo final value replacement.
Hi, > On Wed, Oct 1, 2008 at 3:59 PM, Richard Guenther > <[EMAIL PROTECTED]> wrote: > > On Wed, Oct 1, 2008 at 3:22 PM, Ramana Radhakrishnan <[EMAIL PROTECTED]> > > wrote: > >> Hi , > >> > >> Based on the conversation in the thread at > >> http://gcc.gnu.org/ml/gcc/2008-03/msg00513.html , we've tried to get a > >> pass trying to undo final value replacement going. The initial > >> implementation was done by Pranav Bhandarkar when he was employed at > >> Azingo as part of work sponsored by Icera Semiconductor. I've been > >> trying to get this working with my private port over here. We intend > >> to contribute this back once our copyright assignments are sorted and > >> if this will be acceptable to all. I've been getting a few compile > >> time ICEs with this approach and haven't been able to resolve them > >> well enough yet. Whilst doing so, I wanted to check on the approach as > >> outlined below and ask if there's anything that we might have missed > >> or any problem that one can see with us going along these lines. > >> Thanks for your time and patience. > > > > Some quick comments. First, do you have a non-pseudo-code testcase > > that exposes the extra computations? Second, I think rather than > > trying to undo what SCEV const(!)-prop is doing adjust its cost > > model (maybe there isn't one) to not create the costly substitutions. > > Indeed the comment on scev_const_prop says > > "Also perform final value replacement in loops, >in case the replacement expressions are cheap." > > but no such check for cheapness is done. sorry for the leftover comment -- there used to be a test for the cost of the computation, but it caused so many (missed optimization) problems that I removed it in the end, Zdenek
Re: query regarding adding a pass to undo final value replacement.
Hi, > > Based on the conversation in the thread at > > http://gcc.gnu.org/ml/gcc/2008-03/msg00513.html , we've tried to get a > > pass trying to undo final value replacement going. The initial > > implementation was done by Pranav Bhandarkar when he was employed at > > Azingo as part of work sponsored by Icera Semiconductor. I've been > > trying to get this working with my private port over here. We intend > > to contribute this back once our copyright assignments are sorted and > > if this will be acceptable to all. I've been getting a few compile > > time ICEs with this approach and haven't been able to resolve them > > well enough yet. Whilst doing so, I wanted to check on the approach as > > outlined below and ask if there's anything that we might have missed > > or any problem that one can see with us going along these lines. > > Thanks for your time and patience. > > Some quick comments. First, do you have a non-pseudo-code testcase > that exposes the extra computations? Second, I think rather than > trying to undo what SCEV const(!)-prop is doing adjust its cost > model (maybe there isn't one) to not create the costly substitutions. I would disagree on that. Whether a final value replacement is profitable or not largely depends on whether it makes further optimization of the loop possible or not; this makes it difficult to find a good cost model. I think undoing FVR is a good approach to solve this problem (unfortunately, the proposed implementation does not work), Zdenek
Re: query regarding adding a pass to undo final value replacement.
Hi, > b) If any PHI node has count zero it can be inserted back and its > corresponding computations removed, iff the argument of the PHI node > still exists as an SSA variable. This means that we can insert > a_1 = PHI if D.10_1 still exists and hasnt been removed by > any of the passes between the scalar evolution pass and the > loopdone pass. this does not work: -- we reuse ssa names, so it can happen that the argument of the PHI node is eliminated, then reused for a different purpose -- in case more complex loop transformations were performed (e.g., loop reversal), the final value of the ssa name might have changed. Zdenek
Re: Using cfglayout mode in the selective scheduler
Hi, > > I am probably missing something: > > > >> The basic idea is enabling cfglayout mode and then ensuring that insn > >> stream and control flow are in sync with each other at all times. This > >> is required because e.g. on Itanium the final bundling happens right > >> after scheduling, and any extra jumps emitted by cfg_layout_finalize > >> will spoil the schedule. > > > > what is the difference between this idea and the cfgrtl mode? > > In cfgrtl mode, the functions to manipulate the cfg ensure that the > insn stream and the CFG match. For cfglayout mode you have to do it by > hand. I must say that I do not like this idea at all. As cfgrtl mode routines show, this is nontrivial (and consequently, error-prone), and you would be duplicating much of cfgrtl functionality. I fail to see the advantage over simply using cfgrtl mode and handling the created jump insns (by checking the last insn of altered and newly created blocks, no changes to cfgrtl routines or new hooks necessary), Zdenek
Re: Using cfglayout mode in the selective scheduler
Hi, I am probably missing something: > The basic idea is enabling cfglayout mode and then ensuring that insn > stream and control flow are in sync with each other at all times. This > is required because e.g. on Itanium the final bundling happens right > after scheduling, and any extra jumps emitted by cfg_layout_finalize > will spoil the schedule. what is the difference between this idea and the cfgrtl mode? Zdenek
Re: Out of ssa form only for some basic block
Hi, > I'm trying to add a simple function to the callgraph using > cgraph_add_new_function() ( new function body is obtained by function > actually processed) . > I put my pass in pass_tree_loop.sub as first pass just after > pass_tree_loop_init pass, but I have some problems because the code > that I put into the new function ( using move_sese_region_to_fn() ) is > already in ssa form. > Unfortunately I have to use gcc version 4.2.2 and so I think that the > only solution is to put the code out of the ssa form. if at all possible, try to use the mainline for new development (it will usually make your life much easier, and it will also make it possible to contribute any bugfixes/improvements you may make during your project, whatever it is). That being said, if for whatever reason you cannot affect this decision, have a look at parloop branch (tree-outof-ssa.c:go_out_of_ssa and its users) -- the early version of loop parallelization had to solve the same problem, Zdenek
Re: Moving statements from one BB to other BB.
Hi, > > > To clarify what Richard means, your assertion that "you have updated > > > SSA information" is false. > > > If you had updated the SSA information, the error would not occur :). > > > > > > How exactly are you updating the ssa information? > > > > > > The general way to update SSA for this case would be: > > > > > > For each statement you have moved: > > > Call update_stmt (t); > > > > > > Then call update_ssa (TODO_update_ssa) (or instead use > > > rewrite_into_loop_closed_ssa if this is a loop pass). > > > > > > If you do not call update_stmt in this case, update_ssa won't actually > > > do anything. > > > > actually, it will not do anything even if he calls update_stmt, as the > > statements that he is moving are already in ssa form, so update_stmt > > will not mark anything for renaming. > > > You are partially right (and right where it matters, that it won't fix > the problem) > That said, update_stmt on ssa form statements will at least reset the > vuse/vdef back to original variables for renaming. I don't think so; as long as you do not introduce new vops (which you should not by just moving the statements), update_stmt is a no-op on vops too (unless something has changed since the last time I needed this), Zdenek
Re: Moving statements from one BB to other BB.
Hi, > To clarify what Richard means, your assertion that "you have updated > SSA information" is false. > If you had updated the SSA information, the error would not occur :). > > How exactly are you updating the ssa information? > > The general way to update SSA for this case would be: > > For each statement you have moved: > Call update_stmt (t); > > Then call update_ssa (TODO_update_ssa) (or instead use > rewrite_into_loop_closed_ssa if this is a loop pass). > > If you do not call update_stmt in this case, update_ssa won't actually > do anything. actually, it will not do anything even if he calls update_stmt, as the statements that he is moving are already in ssa form, so update_stmt will not mark anything for renaming. IIRC what he tries to do is loop fusion, and according to the error message that he gets, he probably needs to add the calculations of the induction variable(s) of the original loop to the new one, and replace their uses (or maybe just move phi nodes), Zdenek
Re: Fusing two loops
Hi, > The error is rectified. The bug is in the function that calls fuse_loops(). > Now I am trying to transfer all the statements, using code - > > /* The following function fuses two loops. */ > > void > fuse_loops (struct loop *loop_a, struct loop *loop_b) > { > debug_loop (loop_a, 10); > debug_loop (loop_b, 10); > block_stmt_iterator bsi_a = bsi_start (loop_a->header); > block_stmt_iterator bsi_a_last = bsi_last (loop_a->header); > block_stmt_iterator bsi_b = bsi_last (loop_b->header); > while (&bsi_a != &bsi_a_last) > { > bsi_move_before (&bsi_a, &bsi_b); > fprintf (stderr, " transferred one statement from loop %d to > loop %d ", loop_a->num, loop_b->num); > bsi_next (&bsi_a); > } try bsi_b = bsi_last (loop_b->header); for (bsi_a = bsi_start (loop_a->header); !bsi_end_p (bsi_a); ) { if (some condition) /* you probably want to skip labels and cond_exprs */ bsi_move_before (&bsi_a, &bsi_b); else bsi_next (&bsi_a); } Zdenek
Re: Fusing two loops
Hi, > I am trying to fuse two loops in tree level. For that, I am trying to > transfer statements in the header of one loop to the header of the > other one. > The code " http://rafb.net/p/fha0IG57.html " contains the 2 loops. > After moving a statement from one BB to another BB, do I need to > update any other info? you will need to update SSA form; other than that, just using bsi_move_after to move the relevant statements should work. > I need to transfer all the statements of bb_6 to bb_3. Can any one > please tell me what is the exact procedure. > Can i simply use add_bb_to_loop() and add bb_6 to loop_1 ? No; in the case that you describe, moving the statements one by one is probably the simplest way (you could also move whole basic block, but it is more complicated, and since you still need to update SSA form, you would need to process the statements anyway). Zdenek
Re: [PATCH][RFC] Statistics "infrastructure"
Hi, > > > A statistics event consists of a function (optional), a statement > > > (optional) and the counter ID. I converted the counters from > > > tree-ssa-propagate.c as an example, instead of > > > > > > prop_stats.num_copy_prop++; > > > > > > you now write > > > > > > statistics_add ("copy propagations"); > > > > > > (function and statement omitted, you might prefer #defines for strings > > > that you use multiple times). > > > > it would perhaps be better to use #defines with integer values? Also, > > it would be more consistent to have statistics.def similar to > > timevar.def for this. It would make creation of new counters a bit > > more difficult, but on the other hand, it would make it possible to > > classify the counters (by type of the counted operation/its > > expensiveness/...), > > The difficultness to add new counters is exactly why I didn't go > down that route. I expect this mainly used for experimentation > where it is IMHO inconvenient to go the .def route I thought of it more as an aid in debugging performance problems, as in, checking the dumps without introducing new statistics counters; in which case, having some description of what the counters mean and the metadata from the .def file would be useful. On the other hand, I agree that for the purpose that you suggest avoiding .def is better. Perhaps we could require that all the statistics strings are #defined and documented (and of course you can ignore this rule for the counters that you use for experimentation)? Zdenek
Re: [PATCH][RFC] Statistics "infrastructure"
Hi, > This is an attempt to provide (pass) statistics collection. The > goal is to provide infrastructure to handle the current (pass specific) > statistics dumping that is done per function and per pass along the > regular tree/rtl dumps as well as to allow CU wide "fancy" analysis. > > The most important aspect I think is simplicity to use it and especially > add new "counters". Thus, we simply associate a counter with a string ID. > > The patch is a rough implementation of the current features of > pass specific statistics plus a global "log" with statistics events. > In the end you can do any postprocessing / globbing / summing of > such global log. > > A statistics event consists of a function (optional), a statement > (optional) and the counter ID. I converted the counters from > tree-ssa-propagate.c as an example, instead of > > prop_stats.num_copy_prop++; > > you now write > > statistics_add ("copy propagations"); > > (function and statement omitted, you might prefer #defines for strings > that you use multiple times). it would perhaps be better to use #defines with integer values? Also, it would be more consistent to have statistics.def similar to timevar.def for this. It would make creation of new counters a bit more difficult, but on the other hand, it would make it possible to classify the counters (by type of the counted operation/its expensiveness/...), Zdenek
Re: [tuples] gimple_assign_subcode for GIMPLE_SINGLE_RHS
Hi, > On 03/10/08 08:24, Richard Guenther wrote: > > >You could either do > > > >GIMPLE_ASSIGN > > But 'cond' would be an unflattened tree expression. I'm trying to avoid > that. > > >or invent COND_GT_EXPR, COND_GE_EXPR, etc. (at least in GIMPLE > >we always have a comparison in COND_EXPR_COND, never a plain > >boolean variable). > > Yeah, that would mean adding 5 more tree codes, though. Unless we gave > up and invented a new set of subcodes exclusively for gimple. That > seems like a waste, since tree.def neatly defines all the subcodes we > want already. another possibility would be to represent a = b < c ? d : e as GIMPLE_ASSIGN (LT_EXPR, a, b, c, d, e) and a = (b < c) as GIMPLE_ASSIGN (LT_EXPR, a, b, c, true, false) Zdenek
Re: The effects of closed loop SSA and Scalar Evolution Const Prop.
Hi, > Now tree scalar evolution goes over PHI nodes and realises that > aligned_src_35 has a scalar evolution {aligned_src_22 + 16, +, 16}_1) > where aligned_src_22 is > (const long int *) src0_12(D) i.e the original src pointer. Therefore > to calculate aligned_src_62 before the second loop computations are > introduced based on aligned_src_22. > > My question is, shouldnt scalar evolution ignore PHI nodes with one > argument (implying a copy) no, it should not (scev_cprop only deals with phi nodes with one argument). > or If not atleast pay heed to the cost of > additional computations introduced. Yes, it should, in some form; however, it would probably not help this testcase anyway, as computing x + 16 * y is too cheap. Final value replacement is often profitable even if it introduces some additional computation, as performing it may make other loop optimizations (vectorization, loop nest optimizations) possible. One solution would be to add a pass that would replace the computations with final values in a loop, undoing this transformation, after the mentioned optimizations are performed (FRE could do this if value numbering were strong enough, but that might not be feasible). Zdenek
Re: [tuples] gimple_assign_subcode for GIMPLE_SINGLE_RHS
Hi, > On 3/9/08 3:24 PM, Zdenek Dvorak wrote: > > >however, it would make things simpler. Now, we need to distiguish > >three cases -- SINGLE, UNARY and BINARY; if we pretended that > >GIMPLE_COPY is an unary operator, this would be reduced just > >to UNARY and BINARY. Of course, GIMPLE_COPY would never be used > >in a tree expression. > > That's not the problem, the problem is in functions like > extract_ops_from_tree. extract_ops_from_tree would return GIMPLE_COPY as subcode and the whole expression as op1, where's the problem? > I need to introduce GIMPLE_TERNARY_RHS (for ASSERT_EXPR) and > GIMPLE_QUATERNARY_RHS (for COND_EXPR), How are you going to represent the COND_EXPRs (i.e., where are you going to put their comparison operator)? > so we will have at least four > classes to deal with in the future. I could add GIMPLE_COPY then, > unless you want to work on it now. I think it can wait till then, Zdenek
Re: [tuples] gimple_assign_subcode for GIMPLE_SINGLE_RHS
Hi, > >>So, what about adding a GIMPLE_COPY code? The code would have 0 > >>operands and used only for its numeric value. > > > >another possibility would be to make GIMPLE_COPY an unary operator, and > >get rid of the SINGLE_RHS case altogether (of course, unlike any other > >unary operator, it would not require its operand to be gimple_val, so > >special handling might still be necessary at some places. but it might > >be less cumbersome), > > Remember that GIMPLE_COPY *never* needs to be built as a tree node. > When gimplifying something MODIFY_EXPR , CONST_INT> we do > not need to create a GIMPLE_COPY tree node. > > We merely need to set the subcode for GIMPLE_ASSIGN to be GIMPLE_COPY: > > GIMPLE_ASSIGN , CONST_INT> > > So, making it a one operand operator is not really necessary. however, it would make things simpler. Now, we need to distiguish three cases -- SINGLE, UNARY and BINARY; if we pretended that GIMPLE_COPY is an unary operator, this would be reduced just to UNARY and BINARY. Of course, GIMPLE_COPY would never be used in a tree expression. Zdenek
Re: [tuples] gimple_assign_subcode for GIMPLE_SINGLE_RHS
Hi, > So, what about adding a GIMPLE_COPY code? The code would have 0 > operands and used only for its numeric value. another possibility would be to make GIMPLE_COPY an unary operator, and get rid of the SINGLE_RHS case altogether (of course, unlike any other unary operator, it would not require its operand to be gimple_val, so special handling might still be necessary at some places. but it might be less cumbersome), Zdenek
Re: [tuples] gimple_assign_subcode for GIMPLE_SINGLE_RHS
Hi, > On Sun, Mar 9, 2008 at 2:17 PM, Diego Novillo <[EMAIL PROTECTED]> wrote: > > On Sun, Mar 9, 2008 at 08:15, Richard Guenther > > <[EMAIL PROTECTED]> wrote: > > > > > What is GIMPLE_SINGLE_RHS after all? > > > > Represents a "copy" operation, an operand with no operator (e.g., a = 3, b > > = c) > > > > '3' and 'c' are "single" operands. There is no operator involved in > > the assignment. > > So as opposed to a unary operation which would look exactly the same apart > from having a subcode? So what does gimple_assign_subcode () return > for the GIMPLE_SINGLE_RHS case? Some random garbage? just now, it will return the TREE_CODE of the rhs of the assignment (e.g. VAR_DECL and INTEGER_CST in the cases mentioned above). The problem we discuss in this thread is that actually the code is the TREE_CODE of rhs in the moment that the statement was built; so after ssa form creation, the gimple_assign_subcode of b_1 = c_2 will still be VAR_DECL; and after constant propagation of c_2 = 4, the gimple_assign_subcode of b_1 = 4 will still be VAR_DECL. While this is essentially harmless (only the fact that the code is in GIMPLE_SINGLE_RHS class is important for the semantics), it seems likely to cause some confusion, Zdenek
[tuples] gimple_assign_subcode for GIMPLE_SINGLE_RHS
Hi, I just noticed an error in a part of the code that I converted, that looks this way: switch (gimple_assign_subcode (stmt)) { case SSA_NAME: handle_ssa_name (); break; case PLUS_EXPR: handle_plus (); break; default: something (); } The problem of course is that for GIMPLE_SINGLE_RHS, we do not maintain the invariant that gimple_assign_subcode (stmt) == TREE_CODE (gimple_assign_rhs1 (stmt)), so gimple_assign_subcode typically will not be SSA_NAME, but VAR_DECL. Enforcing this invariant might be hard and probably asking for more trouble than it is worth. However, perhaps it would make sense to use some special tree code to indicate GIMPLE_SINGLE_RHS, in order to avoid confusion? Zdenek
Re: GCC loop optimizations
Hi, > I'd like to know your experiences with the gcc loop optimizations. > > What loop optimizations (in your opinion) can be applied to a large > number of programs and yield a (significant) improvement of the > program run-time? in general, I would say invariant motion, load/store motion, strength reduction and loop unrolling. Most of the other loop optimizations tend to apply only in special situations. Note however that the usefulness also heavily depends on the type of the compiled code; e.g., wide class of programs (simulations, optimization) spend most of the time doing simple calculations on large arrays, and other optimizations (loop interchange/reordering/..., prefetching) become important. > I've tested the gcc compiler optimization "loop unswitching" > (introduced in gcc 3.4) but going through a large number of different > benchmarks I could find very few cases where the optimization could be > applied. Do you share my experience? yes, the situations where loop unswitching can be applied are not very common. Zdenek
Re: Assigning a value to a temp var in GIMPLE SSA
Hi, > I'm trying to add a simple statement to GIMPLE code adding a new pass, > that I put in pass_tree_loop.sub as last pass just before > pass_tree_loop_done pass. Just as test I'd like to add a call like: > > .palp = shmalloc (16); > > This is the code I'm using: > > t = build_function_type_list (ptr_type_node, > integer_type_node, NULL_TREE); > decl = build_fn_decl ("shmalloc", t); > t = build_int_cst (integer_type_node, 16); > args = tree_cons (NULL, t, NULL); > t = build_function_call_expr (decl, args); > TREE_SIDE_EFFECTS (t) = 1; > > palpstruct = create_tmp_var (ptr_type_node, ".palp"); > add_referenced_var (palpstruct); > t1 = build2 (MODIFY_EXPR, ptr_type_node, palpstruct, t); depending on what version of gcc you use, you might need to use t1 = build_gimple_modify_stmt (palpstruct, t); > t2 = make_ssa_name (palpstruct, NULL); t2 = make_ssa_name (palpstruct, t1); > DECL_ARTIFICIAL (t1) = 1; > TREE_SIDE_EFFECTS (t1) = 1; > DECL_EXTERNAL (t1) = 1; remove these three statements (t1 is not decl, and setting TREE_SIDE_EFFECTS of t1 is not necessary). > TREE_OPERAND (t1, 0) = t2; > bsi_insert_after (&si, t1, BSI_CONTINUE_LINKING); > > If I try to add just the shmalloc(16); statement it works, but if I > add the code to assign the result to a temporary variable, gcc > segfaults while executing the vrp2 pass in tree-vrp.c. Could someone > point me to the right direction? What am I doing wrong? It may also turn out to be necessary to ensure that virtual operands are updated, Zdenek
Re: [tuples] tree-tailcall.c
Hi, > Zdenek, you committed changes to tree-tailcall.c but you didn't fully > convert the file. Was that a mis-commit? The file does not compile and > uses PHI_RESULT instead of gimple_phi_result. the file compiles for me; it indeed uses PHI_RESULT, but since that is equivalent to DEF_FROM_PTR (gimple_phi_result_ptr (PHI)), it does not cause any problems (I was not aware of that using gimple_phi_result is preferable, I will fix that). Zdenek
Re: [tuples] Call for help converting passes
Hi, > Everything else should work well enough for passes to be converted. > If anyone has some free cycles and are willing to put up with various > broken bits, would you be willing to help converting passes? There is > a list of the passes that need conversion in the tuples wiki > (http://gcc.gnu.org/wiki/tuples). Passes that have already been > converted or are being converted are marked. I will take care of the loop optimizer passes. Zdenek
Re: Tree-SSA and POST_INC address mode inompatible in GCC4?
Hi, > >> I believe that this is something new and is most likely fallout from > >> diego's reworking of the tree to rtl converter. > >> > >> To fix this will require a round of copy propagation, most likely in > >> concert with some induction variable detection, since the most > >> profitable place for this will be in loops. > >> > >> I wonder if any of this effects the rtl level induction variable > >> discovery? > >> > > > > it should not (iv analysis is able to deal with this kind of ivs). > > > does the iv analysis canonize them in a way that we should perhaps > consider moving the auto-inc detection after the iv analysis? no, iv analysis does not change the program; also, since the code in this particular example is not in any loop, iv analysis is somewhat irrelevant for it. Btw. I would have actually expected this code to be folded to *a_3(D) = D.1543_2; a_4 = a_3(D) + 1; b_5 = b_1(D) + 1; D.1543_6 = *b_5; *a_4 = D.1543_6; a_7 = a_3 + 2; b_8 = b_1 + 2; D.1543_9 = *b_8; *a_7 = D.1543_9; a_10 = a_3 + 3; b_11 = b_1 + 3; D.1543_12 = *b_11; *a_10 = D.1543_12; a_13 = a_3 + 4; b_14 = b_1 + 4; D.1543_15 = *b_14; *a_13 = D.1543_15; etc.; I am fairly sure we used to do this. Zdenek
Re: Tree-SSA and POST_INC address mode inompatible in GCC4?
Hi, > I believe that this is something new and is most likely fallout from > diego's reworking of the tree to rtl converter. > > To fix this will require a round of copy propagation, most likely in > concert with some induction variable detection, since the most > profitable place for this will be in loops. > > I wonder if any of this effects the rtl level induction variable > discovery? it should not (iv analysis is able to deal with this kind of ivs). Zdenek > > Hi, Ramana, > > I tried the trunk version with/without your patch. It still produces > > the same code as gcc4.2.2 does. In auto-inc-dec.c, the comments say > > > > *a > >... > >a <- a + c > > > > becomes > > > >*(a += c) post > > > > But the problem is after Tree-SSA pass, there is no > >a <- a + c > > But something like > >a_1 <- a + c > > > > Unless the auto-inc-dec.c can reverse a_1 <- a + c to a <- a + c. I > > don't see this transformation is applicable in most scenarios. Any > > comments? > > > > Cheers, > > Bingfeng > >
Re: optimising recursive functions
Hi, > > > So I am guessing the Felix version is lucky there are > > > no gratuitous temporaries to be saved when this happens, > > > and the C code is unlucky and there are. > > > > > > Maybe someone who knows how the optimiser works can comment? > > > > One problem with departing from the ABI even on a local level > > like this is that it wipes out lots of tools that depend on > > ABI compliance for the entire call chain. I suspect the overall > > gain is too small to be worth this hit. > > If you make the function static then gcc can chose ABI-incompatible > calling conventions. alternatively, you can use -fwhole-program to get the same effect. Zdenek
Re: problem with iv folding
Hi, > traceback, tt, and ops follow. Why is this going wrong? > [ gdb ] call debug_tree(arg0) > type > [ gdb ] call debug_tree(arg1) > type
Re: From SSA back to GIMPLE.
Dear Mr. Pizzaro, > Is not it easy to write 3 stages GENERIC->GIMPLE->RTL instead of 5 stages? > > Is meaningful the optimization of the complex bi-transformation > GIMPLE->SSA->GIMPLE? > > Is more powerful GENERIC->GIMPLE->RTL + "trial-and-error" local optimization? > >Sincerely, J.C. Pizarro everyone else here is too polite to tell it to you, but could you please shut up, until: -- you learn at least basics of English grammar (so that we can actually understand what you are saying), and -- at least something about gcc and compilers in general (so that what you say makes some sense)? While I was mildly annoyed by your previous "contributions" to the discussion in the gcc mailing list, I could tolerate those. But answering a seriously ment question of a beginner by this confusing and completely irrelevant drivel is another thing. Sincerely, Zdenek Dvorak
Re: Question on GGC
Hello, > I have several global variables which are of type rtx. They are used > in flow.c ia64.c and final.c. As stated in the internal doc with > types. I add GTY(()) marker after the keyword 'extern'. for example: > extern GTY(()) rtx a; > these 'extern's are added in regs.h which is included in flow.c ia64.c > and final.c > > However, I init 'a' at ia64_compute_frame which is defined in ia64.c > but found 'a' incorrectly collected by ggc_collect. (I watch the > memory location which is allocated for a, and found it is collected by > GGC. > > Is there any thing I forget to do ? you need to add regs.h to GTFILES in Makefile.in. Zdenek
Re: Re[2]: [GSoC: DDG export][RFC] Current status
Hello, > An important missing piece is correction of exported information for > loop unrolling. As far as I can tell, for loop unrolled by factor N we > need to clone MEM_ORIG_EXPRs and datarefs for newly-created MEMs, create > no-dependence DDRs for those pairs, for which original DDR was > no-dependence, and for DDRs with recorded distance vectors we will be > able to recompute new distances from old distance(D) and N (but only if > N % D == 0 || D % N == 0). Is that right, Zdenek? yes, that looks correct to me. >Where should I start to implement this? In loop-unroll.c, look for apply_opt_in_copies; that is used to post-process the unrolled loop body. Zdenek
Re: question about rtl loop-iv analysis
Hello, > And finally at the stage of rtl unrolling it looks like this: > [6] r186 = r2 + C; > r318 = r186 + 160; > loop: > r186 = r186 + 16 > if (r186 != r318) then goto loop else exit > > Then, in loop-unroll.c we call iv_number_of_iterations, which eventually > calls iv_analyze_biv (with r258/r186), which in turn calls > latch_dominating_def. > There, when processing the vectorized case, it encounters the def in the > loop ('r186+16'), and then the def outside the loop ('r2+C'), at which > point it fails cause it found two defs (and so we get that this is not a > simple iv, and not a simple loop, and unrolling fails: "Unable to prove > that the loop iterates constant times"). > When processing the unvectorized case, we also first encounter the def in > the loop ('r258+16'), and then the def outside the loop ('0'), but this > def > succeeds the test "if (!bitmapset (bb_info->out,))", and so we don't > fail when we encounter the second def, and all is well. > > So one question I have is what is that bitmap exactly, and why does loop > [6] fail rtl iv-analysis? > > > >>> the bitmap bb_info->out should contain reaching definitions at the end > >>> of the latch block; so the definition of r186 outside of the loop > >>> should not be contained in it. This seems to be a dataflow analysis > >>> problem. > >>> > >>> > >> It could be, but it is hard to see how. There is nothing magic about > >> the code, i goes over the set of blocks looking for defs and that is > >> what it uses. Each call to df_set_blocks resets the table of defs that > >> can be considered. > >> > >> In particular, you need to first verify that you are passing exactly the > >> correct set of blocks to df_set_blocks. > >> > > > > regardless of the set of blocks considered, there is no way how both the > > definition outside of the loop and the definition inside the loop could > > both be reaching definitions at the same point in the program. > > > sure it can. if the def is not a killing def, i.e. the def is a partial > or conditional, then the prior defs reach right in. that obviously is not the case here, though. Do you (or someone else responsible for df) have time to have a look at this problem? Otherwise, we may discuss it forever, but we will not get anywhere. Zdenek
Re: question about rtl loop-iv analysis
Hello, > >> And finally at the stage of rtl unrolling it looks like this: > >> [6] r186 = r2 + C; > >> r318 = r186 + 160; > >> loop: > >> r186 = r186 + 16 > >> if (r186 != r318) then goto loop else exit > >> > >> Then, in loop-unroll.c we call iv_number_of_iterations, which eventually > >> calls iv_analyze_biv (with r258/r186), which in turn calls > >> latch_dominating_def. > >> There, when processing the vectorized case, it encounters the def in the > >> loop ('r186+16'), and then the def outside the loop ('r2+C'), at which > >> point it fails cause it found two defs (and so we get that this is not a > >> simple iv, and not a simple loop, and unrolling fails: "Unable to prove > >> that the loop iterates constant times"). > >> When processing the unvectorized case, we also first encounter the def in > >> the loop ('r258+16'), and then the def outside the loop ('0'), but this def > >> succeeds the test "if (!bitmapset (bb_info->out,))", and so we don't > >> fail when we encounter the second def, and all is well. > >> > >> So one question I have is what is that bitmap exactly, and why does loop > >> [6] fail rtl iv-analysis? > >> > > > > the bitmap bb_info->out should contain reaching definitions at the end > > of the latch block; so the definition of r186 outside of the loop > > should not be contained in it. This seems to be a dataflow analysis > > problem. > > > It could be, but it is hard to see how. There is nothing magic about > the code, i goes over the set of blocks looking for defs and that is > what it uses. Each call to df_set_blocks resets the table of defs that > can be considered. > > In particular, you need to first verify that you are passing exactly the > correct set of blocks to df_set_blocks. regardless of the set of blocks considered, there is no way how both the definition outside of the loop and the definition inside the loop could both be reaching definitions at the same point in the program. Zdenek
Re: question about rtl loop-iv analysis
Hello, > And finally at the stage of rtl unrolling it looks like this: > [6] r186 = r2 + C; > r318 = r186 + 160; > loop: > r186 = r186 + 16 > if (r186 != r318) then goto loop else exit > > Then, in loop-unroll.c we call iv_number_of_iterations, which eventually > calls iv_analyze_biv (with r258/r186), which in turn calls > latch_dominating_def. > There, when processing the vectorized case, it encounters the def in the > loop ('r186+16'), and then the def outside the loop ('r2+C'), at which > point it fails cause it found two defs (and so we get that this is not a > simple iv, and not a simple loop, and unrolling fails: "Unable to prove > that the loop iterates constant times"). > When processing the unvectorized case, we also first encounter the def in > the loop ('r258+16'), and then the def outside the loop ('0'), but this def > succeeds the test "if (!bitmapset (bb_info->out,))", and so we don't > fail when we encounter the second def, and all is well. > > So one question I have is what is that bitmap exactly, and why does loop > [6] fail rtl iv-analysis? the bitmap bb_info->out should contain reaching definitions at the end of the latch block; so the definition of r186 outside of the loop should not be contained in it. This seems to be a dataflow analysis problem. Zdenek
Re: GCC 4.3.0 Status Report (2007-08-09)
Hello, > > Are there any folks out there who have projects for Stage 1 or Stage 2 > > that they are having trouble getting reviewed? Any comments > > re. timing for Stage 3? > > Zadeck has the parloop branch patches, which I've been reviewing. I am > not sure how many other patches are left, but at least a couple. Zdenek > are the remaining patches submitted already? I have one in my review > list, but I don't know if there are others. I could go over them next week. not yet, I just returned from vacation and I should send the remaining two or three patches for the parloop branch merge this week. Zdenek
Re: RFC: Rename Non-Autpoiesis maintainers category
Hello, > I liked the idea of 'Reviewers' more than any of the other options. > I would like to go with this patch, unless we find a much better > option? to cancel this category of maintainers completely? I guess it was probably discussed before (I am too lazy to check), but the existence of non-self-approving maintainers does not make much sense to me. It definitely is a good idea to have someone else but you read your patch before committing it, but this general guideline should apply to everyone, not just to a special group of maintainers (and IMHO sending your patch to mailing list and waiting for people to comment on it before committing it is enough to achieve this). Let me state my reasons to dislike the existence of this category of maintainers. We have non-global maintainers for the following reasons: 1) to reduce the workload of the global maintainers, who would otherwise have to approve every single patch, and to do so, understand thoroughly every single component of the compiler. 2) to speed up the rate at which changes can be made to some specific components, by allowing the local maintainers to commit them without waiting for an external review. 3) to indicate what persons are responsible for bugs/features of the specific component. I will not mention this reason in further discussion, as it does not seem to be really all that important (there are other ways to share this information). The item 2) obviously does not apply to non-self-approving maintainers. Also, it does not make sense to have only one non-self-approving maintainer for some component, as most of the patches for the component would usually be written by him/her and would have to be approved by a global maintainer, thus making 1) not to apply as well. So the only situation when having non-self-approving maintainers makes at least some sense is when you have several co-maintainers for one component. However, in this case they usually work together and would discuss any significant changes to the component anyway, so making them as non-self-approving does not really change anything under normal circumstances. That being the case, I would prefer not to place formal restrictions on their maintainership and let them work out whatever mode of cooperation suits them the best (if it ever happened that co-maintainers of some component will not be willing to work together and cooperate, no artificial restriction will make this work, anyway). And finally, we will avoid the need to find weird words to describe this category of maintainers :-) Zdenek > Thanks. > > > > Index: MAINTAINERS > === > --- MAINTAINERS (revision 126951) > +++ MAINTAINERS (working copy) > @@ -231,7 +231,7 @@ > maintainers need approval to check in algorithmic changes or changes > outside of the parts of the compiler they maintain. > > - Non-Autopoiesis Maintainers > + Reviewers > > dataflow Daniel Berlin [EMAIL PROTECTED] > dataflow Paolo Bonzini [EMAIL PROTECTED] > @@ -251,10 +251,9 @@ > FortranPaul Thomas [EMAIL PROTECTED] > > > -Note that individuals who maintain parts of the compiler as > -non-autopoiesis maintainers need approval changes outside of the parts > -of the compiler they maintain and also need approval for their own > -patches. > +Note that individuals who maintain parts of the compiler as reviewers > +need approval changes outside of the parts of the compiler they > +maintain and also need approval for their own patches. > > Write After Approval(last name alphabetical order) >
Re: Loop optimizations cheatsheet
Hello, > Can you send out your presentation too? the slides and the example code are at http://kam.mff.cuni.cz/~rakdver/slides-gcc2007.pdf http://kam.mff.cuni.cz/~rakdver/diff_reverse.diff Zdenek
Loop optimizations cheatsheet
Hello, you can find the cheatsheet I used during my loop optimizations tutorial on gccsummit at http://kam.mff.cuni.cz/~rakdver/loopcheat.ps Zdenek
Re: Re[2]: [GSoC: DDG export][RFC] Current status
Hello, > Testing on tree-vectorizer testsuite and some of the GCC source files > showed that frequent source of apparent loss of exported information > were passes that performed basic block reordering or jump threading. > The verifier asserted that number of loops was constant and the order > they are visited by FOR_EACH_LOOP remained the same throughout the > compilation. you definitely cannot rely on that, the number and order of loops may change due to many different optimizations (for example those you mention, but also others). > Zdenek, could you please clarify whether and how your LoopPreserving > project is going to influence the mentioned problems? Once the patches for the project are merged, the information about loops will be preserved across most of the gimple optimizations; i.e., the loops may still be removed or their order may be changed by different optimizations, but any info attached to "struct loop" corresponding to any given loop will survive (some amount of work may still be needed to keep it correct through some transformations, but in particular for ddg this should not be an issue, as it should stay conservatively correct after almost all optimizations) However, I do not preserve loops on RTL, in particular during tree->rtl expansion; this would need some more work. > That said, I do not see the mentioned concerns as serious problems, > because the verifier can be made more permissive, and one can search > exported data references in all loops. Even if order and count of > loop changes, if we are able to find data dependency relation for two > given MEMs later, it will still make sense, with this notable > exception: if RTL loop unrolling (or reversal) is performed, We do not do loop reversal on RTL. > then the > found ddr will contain incorrect distance (because the increment of > the index changed). Is that correct? What would be the preferred > approach: fixing up the exported data, or discarding the irrelevant > information? Loop optimizations definitely need to know about ddg; it should be fairly easy to keep it up-to-date (invalidating the affected part is also a possibility, however the loops we unroll are most likely exactly the loops for that we later need a better ddg in scheduling, so this would be a bit counterproductive). > I mentioned before that I would need to take care of basic block > duplication on tree level, but I have seen no example of that > happening after iv-opts so far. Does anyone know offhand whether it > is possible? Some code duplication may happen in jump threading, performed in dom and vrp, both are run also after ivopts; they probably do not have that many opportunities to do so at that time, but you cannot rely on that. Zdenek
Re: Does unrolling prevents doloop optimizations?
Hello, > > > It doesn't seem that the number of iterations analysis from loop-iv.c > deals > > > with EQ closing branches. > > > > loop-iv works just fine for EQ closing branches. > > > > Thanks for the clarification (I didn't see EQ in iv_number_of_iterations's > switch (cond)). that is because we canonicalize the condition before that to a NE condition. Zdenek
Re: Does unrolling prevents doloop optimizations?
ribes the number of iterations of the loop. */ > Index: modulo-sched.c > === > --- modulo-sched.c (revision 840) > +++ modulo-sched.c (working copy) > @@ -288,7 +288,7 @@ > if (!INSN_P (PREV_INSN (insn))) > return NULL_RTX; > > - condition = doloop_condition_get (insn); > + condition = sms_condition_get (insn); > if (! condition) > return NULL_RTX; > > > > > > > On 6/30/07, Zdenek Dvorak <[EMAIL PROTECTED]> wrote: > > > >Hello, > > > >> It doesn't seem that the number of iterations analysis from > >loop-iv.cdeals > >> with EQ closing branches. > > > >loop-iv works just fine for EQ closing branches. > > > >Zdenek > > > >> One option is for sms to use > >> doloop_condition_get/loop-iv analysis in their current form, and if > >failed > >> check (on our own) for reversed doloop patterns. Another is to try and > >> reverse the closing branch before calling doloop_condition_get/loop-iv > >> analysis; any suitable branch reversal facility available? > >> > >> Ayal. > >
Re: Does unrolling prevents doloop optimizations?
Hello, > It doesn't seem that the number of iterations analysis from loop-iv.c deals > with EQ closing branches. loop-iv works just fine for EQ closing branches. Zdenek > One option is for sms to use > doloop_condition_get/loop-iv analysis in their current form, and if failed > check (on our own) for reversed doloop patterns. Another is to try and > reverse the closing branch before calling doloop_condition_get/loop-iv > analysis; any suitable branch reversal facility available? > > Ayal.
Re: Does unrolling prevents doloop optimizations?
Hello, > By "this change" I mean just commenting out the check in > doloop_condition_get. After applying the patch that introduced DOLOOP > patterns for SPU (http://gcc.gnu.org/ml/gcc-patches/2007-01/msg01470.html) > we needed this hack in order to be able to use the doloop_condition_get to > return the register decremented by the branch instruction for any unrolled > loop (The unroller changed originally GE loops to EQ ). Where can this check > be required? Note that we did not touched the similar check in > doloop_modify. We tested this on our SPU branch and saw no regressions. hmmm I see now that modulo-sched.c:doloop_register_get uses doloop_condition_get, which is why this may affect something. Anyway, changing doloop_condition_get is wrong. Just teach modulo-sched to use number of iterations analysis from loop-iv.c instead of the doloop_register_get hack, Zdenek
Re: Does unrolling prevents doloop optimizations?
Hello, > We did just this change > (on top of SMS patches we intend to send in the next > couple of weeks) and it did helped a lot - we succeeded on all unrolled > loops that could not be SMSed before. Do you think it is safe to remove this > check? There was no regressions found in our testing. For note, we did our > work on gcc 4.1 based branch. what change? Could you please send me the patch? If you mean just removing the check in doloop_condition_get, that is definitely wrong and cannot change anything (at least on mainline, I am not sure about 4.1 branch -- but anyway, you cannot submit new changes for 4.1). Zdenek > Thanks, > Vladimir > > On 6/12/07, Zdenek Dvorak <[EMAIL PROTECTED]> wrote: > > > >Hello, > > > >> To make sure I understood you correctly, does it mean that the change > >> (below in /* */) in doloop_condition_get is safe? > >> > >> /* We expect a GE or NE comparison with 0 or 1. */ > >> if (/*(GET_CODE (condition) != GE > >> && GET_CODE (condition) != NE) > >> ||*/ (XEXP (condition, 1) != const0_rtx > >> && XEXP (condition, 1) != const1_rtx)) > >>return 0; > > > >no; that there is nothing wrong with doloop_condition_get -- > >changing it will not help, as it is not applied to the > >exit condition of the loop at all. The problem must be somewhere > >else. > > > >Zdenek > > > >> Thanks, > >> Vladimir > >> > >> > >> On 6/12/07, Zdenek Dvorak <[EMAIL PROTECTED]> wrote: > >> >Hello, > >> > > >> >> In file loop_doloop.c function doloop_condition_get makes sure that > >> >> the condition is GE or NE > >> >> otherwise it prevents doloop optimizations. This caused a problem for > >> >> a loop which had NE condition without unrolling and EQ if unrolling > >> >> was run. > >> > > >> >actually, doloop_condition_get is not applied to the code of the > >> >program, so this change is irrelevant (doloop_condition_get is applied > >> >to the doloop pattern from the machine description). So there must be > >> >some other reason why doloop transformation is not applied for your > >> >loop. > >> > > >> >Zdenek > >> > > >> >> Can I make doloop work after the unroller? > >> >> > >> >> Thanks, > >> >> Vladimir > >> >> > >> >> > >> > >> > >> >> Without unrolling: > >> >> (insn 135 80 136 4 (set (reg:SI 204 [ LastIndex ]) > >> >>(plus:SI (reg:SI 204 [ LastIndex ]) > >> >>(const_int -1 [0x]))) 51 {addsi3} (nil) > >> >>(nil)) > >> >> > >> >> (jump_insn 136 135 84 4 (set (pc) > >> >>(if_then_else (ne:SI (reg:SI 204 [ LastIndex ]) > >> >>(const_int 0 [0x0])) > >> >>(label_ref:SI 69) > >> >>(pc))) 368 {*spu.md:3288} (insn_list:REG_DEP_TRUE 135 > >(nil)) > >> >>(expr_list:REG_BR_PROB (const_int 9000 [0x2328]) > >> >>(nil))) > >> >> > >> >> > >> >> After unrolling: > >> >> (insn 445 421 446 21 (set (reg:SI 213) > >> >>(plus:SI (reg:SI 213) > >> >>(const_int -1 [0x]))) 51 {addsi3} (nil) > >> >>(nil)) > >> >> > >> >> (jump_insn 446 445 667 21 (set (pc) > >> >>(if_then_else (eq:SI (reg:SI 213) > >> >>(const_int 0 [0x0])) > >> >>(label_ref:SI 465) > >> >>(pc))) 368 {*spu.md:3288} (insn_list:REG_DEP_TRUE 445 > >(nil)) > >> >>(expr_list:REG_BR_PROB (const_int 1000 [0x3e8]) > >> >>(nil))) > >> > > >
Re: [tuples] Accessors for RHS of assignments
Hello, > So, I think I am still not convinced which way we want to access the RHS > of a GS_ASSIGN. > > Since GS_ASSIGN can have various types of RHS, we originally had: > > gs_assign_unary_rhs (gs) <- Access the only operand on RHS > gs_assign_binary_rhs1 (gs)<- Access the 1st RHS operand > gs_assign_binary_rhs2 (gs)<- Access the 2nd RHS operand > > And the corresponding _set functions. > > I then managed to half convince myself that it'd be better to have a > single gs_assign_rhs() accessor with a 'which' parameter. After > implementing that, I think I hate it. Particularly since this 'which' > parameter is just a number (0 or 1). It could be a mnemonic, but it > would still be annoying. > > So, I'm thinking of going back to the way it was before, but it is not > optimal. Do people feel strongly over one or the other? I may be missing something, but surely having the accessors uniform would be better? So that I can write things like /* Process all operands. */ for (i = 0; i < n_operands (gs); i++) process (gs_assign_rhs (gs, i)); rather than if (is_unary (gs)) process (gs_assign_unary_rhs (gs)); else if (is_binary (gs)) { process (gs_assign_binary_rhs1 (gs)); process (gs_assign_binary_rhs2 (gs)); } else if (is_ternary (gs)) ... Anyway, you can always #define gs_assign_unary_rhs(X) gs_assign_rhs(X, 0) #define gs_assign_binary_rhs1(X) gs_assign_rhs(X, 0) #define gs_assign_binary_rhs2(X) gs_assign_rhs(X, 1) as well, and use these in cases where you know with which arity you are working. Zdenek
Re: machine learning for loop unrolling
Hello, > Of course, instead of clock(), I'd like to use a non-intrusive > mechanism. However, my research on this topic didn't lead to anything > but perfsuite, which doesn't work very well for me (should it?). > > So here are the questions > > - how can I actually insert the code (I need to do this during the > loop-unrolling phase, when the code is already in RTL form)? that is a bit difficult; you will run into all sorts of problems, especially if you will want to emit new calls. It would be simpler to emit the measurement code on gimple (see e.g. value-prof.c where something similar is done), although it would then also be nontrivial to match the loops to those found also on RTL. In some older versions of gcc (I think 4.0) we used to have value profiling on RTL, you may check that to see how we used to do that. Zdenek
Re: Does unrolling prevents doloop optimizations?
Hello, > To make sure I understood you correctly, does it mean that the change > (below in /* */) in doloop_condition_get is safe? > > /* We expect a GE or NE comparison with 0 or 1. */ > if (/*(GET_CODE (condition) != GE > && GET_CODE (condition) != NE) > ||*/ (XEXP (condition, 1) != const0_rtx > && XEXP (condition, 1) != const1_rtx)) >return 0; no; that there is nothing wrong with doloop_condition_get -- changing it will not help, as it is not applied to the exit condition of the loop at all. The problem must be somewhere else. Zdenek > Thanks, > Vladimir > > > On 6/12/07, Zdenek Dvorak <[EMAIL PROTECTED]> wrote: > >Hello, > > > >> In file loop_doloop.c function doloop_condition_get makes sure that > >> the condition is GE or NE > >> otherwise it prevents doloop optimizations. This caused a problem for > >> a loop which had NE condition without unrolling and EQ if unrolling > >> was run. > > > >actually, doloop_condition_get is not applied to the code of the > >program, so this change is irrelevant (doloop_condition_get is applied > >to the doloop pattern from the machine description). So there must be > >some other reason why doloop transformation is not applied for your > >loop. > > > >Zdenek > > > >> Can I make doloop work after the unroller? > >> > >> Thanks, > >> Vladimir > >> > >> > > > >> Without unrolling: > >> (insn 135 80 136 4 (set (reg:SI 204 [ LastIndex ]) > >>(plus:SI (reg:SI 204 [ LastIndex ]) > >>(const_int -1 [0x]))) 51 {addsi3} (nil) > >>(nil)) > >> > >> (jump_insn 136 135 84 4 (set (pc) > >>(if_then_else (ne:SI (reg:SI 204 [ LastIndex ]) > >>(const_int 0 [0x0])) > >>(label_ref:SI 69) > >>(pc))) 368 {*spu.md:3288} (insn_list:REG_DEP_TRUE 135 (nil)) > >>(expr_list:REG_BR_PROB (const_int 9000 [0x2328]) > >>(nil))) > >> > >> > >> After unrolling: > >> (insn 445 421 446 21 (set (reg:SI 213) > >>(plus:SI (reg:SI 213) > >>(const_int -1 [0x]))) 51 {addsi3} (nil) > >>(nil)) > >> > >> (jump_insn 446 445 667 21 (set (pc) > >>(if_then_else (eq:SI (reg:SI 213) > >>(const_int 0 [0x0])) > >>(label_ref:SI 465) > >>(pc))) 368 {*spu.md:3288} (insn_list:REG_DEP_TRUE 445 (nil)) > >>(expr_list:REG_BR_PROB (const_int 1000 [0x3e8]) > >>(nil))) > >
Re: Does unrolling prevents doloop optimizations?
Hello, > In file loop_doloop.c function doloop_condition_get makes sure that > the condition is GE or NE > otherwise it prevents doloop optimizations. This caused a problem for > a loop which had NE condition without unrolling and EQ if unrolling > was run. actually, doloop_condition_get is not applied to the code of the program, so this change is irrelevant (doloop_condition_get is applied to the doloop pattern from the machine description). So there must be some other reason why doloop transformation is not applied for your loop. Zdenek > Can I make doloop work after the unroller? > > Thanks, > Vladimir > > > Without unrolling: > (insn 135 80 136 4 (set (reg:SI 204 [ LastIndex ]) >(plus:SI (reg:SI 204 [ LastIndex ]) >(const_int -1 [0x]))) 51 {addsi3} (nil) >(nil)) > > (jump_insn 136 135 84 4 (set (pc) >(if_then_else (ne:SI (reg:SI 204 [ LastIndex ]) >(const_int 0 [0x0])) >(label_ref:SI 69) >(pc))) 368 {*spu.md:3288} (insn_list:REG_DEP_TRUE 135 (nil)) >(expr_list:REG_BR_PROB (const_int 9000 [0x2328]) >(nil))) > > > After unrolling: > (insn 445 421 446 21 (set (reg:SI 213) >(plus:SI (reg:SI 213) >(const_int -1 [0x]))) 51 {addsi3} (nil) >(nil)) > > (jump_insn 446 445 667 21 (set (pc) >(if_then_else (eq:SI (reg:SI 213) >(const_int 0 [0x0])) >(label_ref:SI 465) >(pc))) 368 {*spu.md:3288} (insn_list:REG_DEP_TRUE 445 (nil)) >(expr_list:REG_BR_PROB (const_int 1000 [0x3e8]) >(nil)))
Re: Help understanding tree-affine.c
Hello, > I am trying to understand the usage of some functions in tree-affine.c > file and I appreciate your help. > > For example; for the two memory accesses > arr[b+8].X and arr[b+9].X, how does their affine combinations > will look like after executing the following sequence of operation? > (taken from http://gcc.gnu.org/ml/gcc-patches/2007-01/msg02605.html) > > get_inner_reference_aff (mem1, &off1, &size1); > get_inner_reference_aff (mem2, &off2, &size2); off1 will be affine combination 1 * &arr + 8 * b + 64, off2 will be 1 * &arr + 8 * b + 72 (assuming that size of the element of arr is 8), > aff_combination_expand (&off1, ttae_cache); > aff_combination_expand (&off2, ttae_cache); In these cases, off1 and off2 will be unchanged, since b is not defined in terms of an affine expression. > aff_combination_scale (&off1, double_int_minus_one); > aff_combination_add (&off2, &off1); This subtracts the two affine combinations, so off2 will be 8 in the end. > Also, why does diff->n is an indication for the fact that two memory > accesses can overlap in the following snippet taken from the above link? n is the number of terms of the affine combination, excluding the constant offset. For example, off1 above has n = 2, the terms of the affine combination (&arr, b) and their coefficients (1, 8) are kept in the elts array, the constant offset (64) is kept in the offset field. So, if there is any non-constant term (i.e., n > 0), we cannot be sure that the difference of the addresses is non-zero (actually, this could be improved -- we could consider the value of the offset modulo the greatest common divisor of the coefficients, and in some cases derive that there cannot be an overlap). Zdenek > static bool > cannot_overlap_p (aff_tree *diff, double_int size1, double_int size2) > { > double_int d, bound; > > /* Unless the difference is a constant, we fail. */ > if (diff->n != 0) > return false; > > Thanks in advance, > Revital
Re: machine learning for loop unrolling
Hello, > The number of floating point ops. in loop body. > The number of memory ops. in loop body. > The number of operands in loop body. > The number of implicit instructions in loop body. > The number of unique predicates in loop body. > The number of indirect references in loop body. > The number of uses in the loop.h > The number of defs. in the loop. you have to scan insns in loop body, and check what they do for this. > The number of parallel "computations" in loop. > The estimated latency of the critical path of loop. > The estimated cycle length of loop body. > The max. dependence height of computations. > The max. height of memory dependencies of computations. > The max. height of control dependencies of computations. > The average dependence height of computations. > The min. memory-to-memory loop-carried dependence. > The number of memory-to-memory dependencies. This is a bit more difficult; I guess you could persuade scheduler to give you this information, but I have no idea how exactly. See modulo-sched.c, it considers similar kind of information. > The language (C or Fortran). You may check name in langhooks. > The tripcount of the loop (-1 if unknown). See find_simple_exit and related functions in loop-iv.c. > Here is how I'm thinking of conducting the experiment: > > - for each innermost loop: >- compile with the loop unrolled 1x, 2x, 4x, 8x, 16x, 32x and > measure the time the benchmark takes >- write down the loop features and the best unroll factor > - apply some machine learning technique to the above data to determine > the correlations between loop features and best unroll factor > - integrate the result into gcc and measure the benchmarks again > > Do you think it is ok to only consider inner-most loops? We do not unroll non-innermost loops at the moment, so if you want to test non-innermost loops, you would probably have to extend some loop manipulation functions (loop unrolling was written to work for non-innermost loops as well, but it was not well tested and thus it is very likely buggy). > What about > the unroll factors? Should I consider bigger unroll factors? I think unroll factors over 32 should not give you much more gain, but you will see what results you will get. > Do you think the above setup is ok? I am somewhat skeptical; I would expect the results to be quite target-dependent, and also to differ from program to program. Thus, it may be somewhat hard to derive a useful general heuristics from them. But I would be very happy to be proven wrong :-) Zdenek
Re: [rfc] Moving bbs back to pools
Hello, > > The problem is, that it does not give any speedups (it is almost > > completely compile-time neutral for compilation of preprocessed > > gcc sources). I will check whether moving also edges to pools > > changes anything, but so far it does not seem very promising :-( > > Well, the benefits of these things are always very dificult to predict > :(. We later tweaked ggc-page to put basic blocks into > extra_order_size_table that ight've improved locality and get back some > part of the original 1% slowdown. > > I would expect edges to be more interesting since they are smaller and > there are more of them but I might be mistaken. at least for the preprocessed gcc sources, this turns out not to be the case. I guess there are two other reasons why we do not see the expected gains: 1) previously we kept only one cfg at a time (so using pool meant that the same memory was reused for each function), while now we keep cfgs for all functions alive at the same time. 2) garbage collection now needs to traverse basic blocks and edges (through the dynamic roots), while previously it did not see them at all. Zdenek
Re: [rfc] Moving bbs back to pools
Hello, > Ian Lance Taylor <[EMAIL PROTECTED]> writes: > > > Zdenek Dvorak <[EMAIL PROTECTED]> writes: > > > > > The problem is, that it does not give any speedups (it is almost > > > completely compile-time neutral for compilation of preprocessed > > > gcc sources). I will check whether moving also edges to pools > > > changes anything, but so far it does not seem very promising :-( > > > > Does it make any difference for gcc.c-torture/compile/20001226-1.c? > > And of course it won't make any difference if you have enough memory > > that you never need to garbage collect without your patch. > > I was wrong to say that. It could make a difference even if you don't > garbage collect, since the allocation will be cheaper. But that is > likely to be a relatively small effect. actually, the expected speedup was mostly supposed to come from better locality. The speed of garbage collection should be about unchanged. Regarding the speed of allocation, pool_alloc might be slightly faster, but probably not significantly. Zdenek
[rfc] Moving bbs back to pools
Hello, as discussed in http://gcc.gnu.org/ml/gcc-patches/2007-05/msg01133.html, it might be a good idea to try moving cfg to alloc pools. The patch below does that for basic blocks (each function has a separate pool from that its basic blocks are allocated). At the moment, the patch breaks precompiled headers, but otherwise bootstraps and passes regtesting. The problem is, that it does not give any speedups (it is almost completely compile-time neutral for compilation of preprocessed gcc sources). I will check whether moving also edges to pools changes anything, but so far it does not seem very promising :-( Zdenek Index: doc/gty.texi === *** doc/gty.texi(revision 125526) --- doc/gty.texi(working copy) *** reachable. This routine should not chang *** 292,297 --- 292,304 routine. Its only argument is a pointer to the just marked (const) structure or union. + @findex custom_mark + @item custom_mark ("@var{marking-routine-name}") + + If provided for a structure or union type, the given + @var{marking-routine-name} (between double-quotes) is the name of a + routine called to mark the object instead of ggc_set_mark. + @findex maybe_undef @item maybe_undef Index: gengtype.c === *** gengtype.c (revision 125526) --- gengtype.c (working copy) *** walk_type (type_p t, struct walk_type_da *** 1916,1921 --- 1916,1923 desc = oo->info; else if (strcmp (oo->name, "mark_hook") == 0) ; + else if (strcmp (oo->name, "custom_mark") == 0) + ; else if (strcmp (oo->name, "nested_ptr") == 0) nested_ptr_d = (const struct nested_ptr_data *) oo->info; else if (strcmp (oo->name, "dot") == 0) *** write_func_for_structure (type_p orig_s, *** 2418,2423 --- 2420,2426 const char *chain_next = NULL; const char *chain_prev = NULL; const char *mark_hook_name = NULL; + const char *marker_routine = wtd->marker_routine; options_p opt; struct walk_type_data d; *** write_func_for_structure (type_p orig_s, *** 2437,2442 --- 2440,2449 chain_prev = opt->info; else if (strcmp (opt->name, "mark_hook") == 0) mark_hook_name = opt->info; + else if (strcmp (opt->name, "custom_mark") == 0 +/* FIXME -- this will break pch. */ +&& !wtd->param_prefix) + marker_routine = opt->info; if (chain_prev != NULL && chain_next == NULL) error_at_line (&s->u.s.line, "chain_prev without chain_next"); *** write_func_for_structure (type_p orig_s, *** 2473,2479 s->kind == TYPE_UNION ? "union" : "struct", s->u.s.tag); if (chain_next == NULL) { ! oprintf (d.of, " if (%s (x", wtd->marker_routine); if (wtd->param_prefix) { oprintf (d.of, ", x, gt_%s_", wtd->param_prefix); --- 2480,2486 s->kind == TYPE_UNION ? "union" : "struct", s->u.s.tag); if (chain_next == NULL) { ! oprintf (d.of, " if (%s (x", marker_routine); if (wtd->param_prefix) { oprintf (d.of, ", x, gt_%s_", wtd->param_prefix); *** write_func_for_structure (type_p orig_s, *** 2484,2490 } else { ! oprintf (d.of, " while (%s (xlimit", wtd->marker_routine); if (wtd->param_prefix) { oprintf (d.of, ", xlimit, gt_%s_", wtd->param_prefix); --- 2491,2497 } else { ! oprintf (d.of, " while (%s (xlimit", marker_routine); if (wtd->param_prefix) { oprintf (d.of, ", xlimit, gt_%s_", wtd->param_prefix); *** write_func_for_structure (type_p orig_s, *** 2516,2523 oprintf (d.of, ");\n"); oprintf (d.of, "if (xprev == NULL) break;\n"); oprintf (d.of, "x = xprev;\n"); ! oprintf (d.of, "(void) %s (xprev", ! wtd->marker_routine); if (wtd->param_prefix) { oprintf (d.of, ", xprev, gt_%s_", wtd->param_prefix); --- 2523,2529 oprintf (d.of, ");\n"); oprintf (d.of, "if (xprev == NULL) break;\n"); oprintf (d.of, "x = xprev;\n"); ! oprintf (d.of, "(void) %s (xprev", marker_routine); if (wtd->param_prefix) { oprintf (d.of, ", xprev, gt_%s_", wtd->param_prefix); Index: final.c === *** final.c (revision 125526) --- final.c (working copy) *** rest_of_handle_final (void) *** 3996,4004 if (! quiet_flag) fflush (asm_out_file); - /* Release all memory allocated by flow. */ - free_basic_block_vars (); - /* Write DBX symbols if requested. */ /* Note that for those i
Re: Predictive commoning miscompiles 482.sphinx3 in SPEC CPU 2006
Hello, > > Because the patch had other effects like adding a DCE after Copyprop > > in the loop optimizer section. > > > > Disable DCE after Copyprop in the loop optimizer section fixes my > problem. Any idea why? no, not really; it could be anything (it may even have nothing to do with dce, performing dce can enable/alter other loop transformations). Zdenek
Re: A problem with the loop structure
Hello, > ii) > In loop_version there are two calls to loop_split_edge_with > 1. loop_split_edge_with (loop_preheader_edge (loop), NULL); > 2. loop_split_edge_with (loop_preheader_edge (nloop), NULL); > nloop is the versioned loop, loop is the original. > > loop_split_edge_with has the following: > new_bb = split_edge (e); > add_bb_to_loop (new_bb, loop_c); > > 1) When we get to the fist call, nloop->outer->num_nodes = 8 while dfs > returns 6. then the problem is before this call; you need to check which two blocks that are marked as belonging to nloop->outer in fact do not belong to this loop, and why. Zdenek
Re: A problem with the loop structure
Hello, > (based on gcc 4.1.1). now that is a problem; things have changed a lot since then, so I am not sure how much I will be able to help. > 1. The problem was unveiled by compiling a testcase with dump turned > on. The compilation failed while calling function get_loop_body from > flow_loop_dump on the following assert : > > else if (loop->latch != loop->header) >{ > tv = dfs_enumerate_from (loop->latch, 1, glb_enum_p, > tovisit + 1, loop->num_nodes - 1, > loop->header) + 1; > > > gcc_assert (tv == loop->num_nodes); > > The compilation exits successfully if compiled without enabling the dump. this means that there is some problem in some loop transformation, forgetting to record membership of some blocks to their loops or something like that. > 2. SMS pass contained a single call to loop_version on the loop to be > SMSed. This happened before any SMS related stuff was done. Trying to > call verify_loop_structure(loops) just after the call to loop_version > failed on the same assert in get_loop_body as in (1). The loop on > which we fail is neither the versioned loop nor the new loop. Probably it is their superloop? > Below > there are dumps to verify_loop_structure called from different places > in loop_version: These dumps are not very useful, loop structures do not have to be consistent in the middle of the transformation. > 3. At the very beginning of the SMS pass we build the loop structure > using build_loops_structure defined in modulo-sched.c. Just after the > > call I tried to print in gdb the loop on which we failed in > get_loop_body. This failed as well > > (gdb) p print_loop(dumpfile, 0xbabe20, 0) Do not use print_loop, that only works on gimple. Use flow_loops_dump to find which blocks belong to which loops, and possibly debug_bb_n if you need to see the contents of the blocks. Zdenek
Re: GCC 4.2.0 Status Report (2007-04-24)
Hello, > 4. PR 31360: Missed optimization > > I don't generally mark missed optimization bugs as P1, but not hoisting > loads of zero out of a 4-instruction loop is bad. Zdenek has fixed this > on mainline. Andrew says that patch has a bug. So, what's the story here? I found the problem, I will send a patch once it passes regtesting. Zdenek
Re: GCC mini-summit - compiling for a particular architecture
Hello, > On Sun, 2007-04-22 at 14:44 +0200, Richard Guenther wrote: > > On 4/22/07, Laurent GUERBY <[EMAIL PROTECTED]> wrote: > > > > > but also does not make anyone actually use the options. Nobody reads > > > > > the documention. Of course, this is a bit overstatement, but with a > > > > > few exceptions, people in general do not enable non-default flags. > > > > > > > > I don't think this is fair. > > > > Most people don't read the docs because they don't care about > > > > performance, but most people who develop code that spends a lot of CPU > > > > cycles actually read the docs at least up to loop unrolling. > > > > > > Exactly my experience. > > > > > > Unfortunately there's no useful information on this topic in the GCC > > > manual... > > > > Well, we have too many switches really. So the default is use -O2. If you > > want extra speed, try -O3, or even better use profile feedback. (Not many > > people realize that with profile feedback you get faster code than with > > -O3 and smaller code than with -Os - at least for C++ programs) > > At work we use -O3 since it gives 5% performance gain against -O2. > profile-feedback has many flags actually, only two that are really important -- -fprofile-generate and -fprofile-use. Zdenek
Re: GCC mini-summit - compiling for a particular architecture
> Look from what we're starting: > > << > @item -funroll-loops > @opindex funroll-loops > Unroll loops whose number of iterations can be determined at compile > time or upon entry to the loop. @option{-funroll-loops} implies > @option{-frerun-cse-after-loop}. This option makes code larger, > and may or may not make it run faster. > > @item -funroll-all-loops > @opindex funroll-all-loops > Unroll all loops, even if their number of iterations is uncertain when > the loop is entered. This usually makes programs run more slowly. > @option{-funroll-all-loops} implies the same options as > @option{-funroll-loops}, > >> > > It could gain a few more paragraphs written by knowledgeable people. > And expanding documentation doesn't introduce regressions :). but also does not make anyone actually use the options. Nobody reads the documention. Of course, this is a bit overstatement, but with a few exceptions, people in general do not enable non-default flags. Zdenek
Re: GCC mini-summit - compiling for a particular architecture
Hello, > Steve Ellcey wrote: > > >This seems unfortunate. I was hoping I might be able to turn on loop > >unrolling for IA64 at -O2 to improve performance. I have only started > >looking into this idea but it seems to help performance quite a bit, > >though it is also increasing size quite a bit too so it may need some > >modification of the unrolling parameters to make it practical. > > To me it is obvious that optimizations are target dependent. For > instance loop unrolling is really a totally different optimization > on the ia64 as a result of the rotating registers. that we do not use. Nevertheless, there are still compelling reasons for why unrolling is more useful on ia64 then on other architectures (importance of scheduling, insensitivity to code size growth). Another option would be to consider enabling (e.g.) -funroll-loops -fprefetch-loop-arrays by default on -O3. I think it is fairly rare for these flags to cause performance regressions (although of course more measurements to support this claim would be necessary). Zdenek
Re: adding dependence from prefetch to load
Hello, > Well, the target architecture is actually quite peculiar, it's a > parallel SPMD machine. The only similarity with MIPS is the ISA. The > latency I'm trying to hide is somewhere around 24 cycles, but because it > is a parallel machine, up to 1024 threads have to stall for 24 cycles in > the absence of prefetching, which affects overall performance. > My initial studies show that this latency can be hidden with a properly > inserted prefetch instruction, and I think that the scheduler can help > with that, if properly guided. > > So my initial question remains: is there any way to tell the scheduler > not to place the prefetch instruction after the actual read? > > The prefetch instruction takes an address_operand, and it seems all I > need to do is tell the scheduler prefetch will "write" to that address, > so it will see a true dependence between the prefetch and the read. this might also restrict the possibility to move the prefetch, since it would prevent it from being moved over any other memory operations that alias with the prefetched one. Unfortunately, I do not really understand the internals of the gcc scheduler, so I cannot give you more concrete help; but hopefully someone else will. Zdenek > But > I don't know how to do that, and changing the md file to say "+p" or > "+d" for the first operand of the prefetch didn't help.
Re: adding dependence from prefetch to load
Hello, > 2. Right now I am inserting a __builting_prefetch(...) call immediately > before the actual read, getting something like: > D.1117_12 = &A[D.1101_14]; > __builtin_prefetch (D.1117_12, 0, 1); > D.1102_16 = A[D.1101_14]; > > However, if I enable the instruction scheduler pass, it doesn't realize > there's a dependency between the prefetch and the load, and it actually > moves the prefetch after the load, rendering it useless. How can I > instruct the scheduler of this dependence? > > My thinking is to also specify a latency for prefetch, so that the > scheduler will hopefully place the prefetch somewhere earlier in the > code to partially hide this latency. Do you see anything wrong with this > approach? well, it assumes that the scheduler works with long enough lookahead to actually be able to move the prefetch far enough; i.e., if the architecture you work with is relatively slow in comparison with the memory access times, this might be feasible approach. However, on modern machines, miss in L2 cache may take hundreds of cycles, and it is not clear to me that scheduler will be able to move the prefetch so far, or indeed, that it would even be possible (I think often you do not know the address far enough in advance). Also, prefetching outside of loops in general appears not to be all that profitable, since usually most of the time is spent within loops. So I would recommend first doing some analysis and measurements (say by introducing the prefetches by hand) to check whether this project really has potential to lead to significant speedups. Zdenek
Re: Proposal: changing representation of memory references
Hello, > >Remarks: > >-- it would be guaranteed that the indices of each memory reference are > > independent, i.e., that &ref[idx1][idx2] == &ref[idx1'][idx2'] only > > if idx1 == idx1' and idx2 = idx2'; this is important for dependency > > analysis (and for this reason we also need to remember the list of > > indices, rather than trying to reconstruct them from the address). > > I didn't completely think through this, > but how would this property help in the presence of array flattening ? > e.g. suppose two adjacent loops, both two-nest-deep, > and only one of them is flattened, then > one loop will have one dimensional access (hence will have only one index) > vs the other loop will have two dimensional. > In other words, &ref[idx1] == &ref[idx1'][idx2'] can happen. this cannot happen. Suppose you have a declaration int ref[100][100]; Then the first reference would be represented as tmp = (int (*)[1])) &ref; tmp[idx1] It cannot have ref as its base. Similarly, if one wanted to implement the reverse transformation, recovering multidimensional arrays from the flat representation, he would have to transform int a[1]; for (i = 0; i < 100; i++) for (j = 0; j < 100; j++) a[100 * i + j] = ... to tmp = (int (*)[100][100]) &a; for (i = 0; i < 100; i++) for (j = 0; j < 100; j++) ref with base: tmp, indices: (i,j) = ... > Another question is, how would the representation look > for more complicated address calculations > e.g. a closed hashtable access like: > > table[hash_value % hash_size] > > and would it help in such cases ? I am not quite sure what you ask for here; this would just be represented as idx = hash_value % hash_size; and reference with base: table, indices: (idx) Zdenek
Re: Proposal: changing representation of memory references
Hello, > >> >> That is, unless we could share most of the index struct (upper, > >> >> lower, step) among expressions that access them (IE make index be > >> >> immutable, and require unsharing and resharing if you want to modify > >> >> the expression). > >> > > >> >That appears a bit dangerous to me, I would rather avoid such tricks. > >> > >> Like Richard, I have doubts you are not going to increase memory if > >> you store *this* much in every single memory access. > > > >for structured accesses (ARRAY_REFs etc.) we now need much more memory -- > >4 pointers, plus overhead of the common tree fields. My proposal will > >save some memory here. > > I'm not sure how, since you have more than 4 pointers worth of info in: > > -- base of the reference -> pointer > -- constant offset -> HOST_WIDE_INT > -- vector of indices -> embedded vec > -- type of the accessed location -> Pointer > -- original tree of the memory reference (or another summary of the > structure of the access, for aliasing purposes) -> pointer > -- flags -> No idea > > for each index, we remeber > -- lower and upper bound -> pointers > -- step -> pointer > -- value of the index -> pointer > > What have i misunderstood? consider a[i][j], on 32 bit machine: Currently, this is two ARRAY_REFS, 40 bytes each, for total of 80 bytes. With my proposal, this would be 24 bytes (the items contained in the base of tree_exp: tree_common, including type+flags; locus, and block), base, offset, summary, vector (16 bytes), in the vector size and bound (8 bytes), plus 2 indices (16 bytes each), for total of 80 bytes. I.e., we get the same memory consumption for references with at least two levels. For each other index or component ref, the proposed representation saves 40 - 16 = 24 bytes. There are some possible improvements: -- if we used tail array for the indices instead of vector, we could save some 8 bytes -- assuming that the case that the bounds or step of the array are varying is infrequent, we can change the representation to require -- 8 bytes/index if both bounds and the step are constants, -- ~32 bytes otherwise by refering to the appropriate array type in the former case and TREE_VEC containing the three values in the latter case, or some similar representation Zdenek
Re: Proposal: changing representation of memory references
Hello, > >at the moment, any pass that needs to process memory references are > >complicated (or restricted to handling just a limited set of cases) by > >the need to interpret the quite complex representation of memory > >references that we have in gimple. For example, there are about 1000 of > >lines of quite convoluted code in tree-data-ref.c and about 500 lines in > >tree-ssa-loop-ivopts.c dealing with parsing and analysing memory > >references. > > I have my doubts changing this is the correct step forward. Having a > high level representation is a good thing and having it represented as > *(base+offset) (plus other info) might just make the other passes that > like the high level representation get worse. I also don't see why > tree-data-ref.c and tree-ssa-loop-ivopts.c could not use > get_inner_reference which parses the memory references for you. for many reasons. Probably the most important is that you could not recover indices of multidimensional arrays this way, thus making dependency analysis impossible. Also, with pointer arithmetics, it is not sufficient just to call get_inner_reference, you also need to traverse SSA chains. Also, time and memory-wise, calling get_inner_reference is fairly expensive. Note that the representation I propose is not really low-level. It carefully preserves all the information you could get from the "high-level" representation (indices and their ranges, alias information). We already have low-level representation of memory references (TARGET_MEM_REFs), I do not feel any need to make an alternative proposal for that at the moment. > Maybe > I don't see the benifit in always changing our IR without really > thinking about the problem and seeing if there are already tools > (functions) which do the same thing in a common place. Sorry, but you are completely out here. I have spent a lot of time working with the code for dealing with memory references, trying many different approaches. Yes, I know about get_inner_reference, I use it where appropriate, and for many reasons (e.g., those mentioned above), it does not do the job in general. Also, I did not just pull the proposal out of thin air. I have read previous discussions regarding the topic, and I took the opportunity to ask how the memory references are represented in ICC (which turns out to be quite similar to my proposal). I assume that you are trying to help to get us towards some solution, however please note that some people could find remarks like this offensive. Zdenek
Re: Proposal: changing representation of memory references
Hello, > > -- base of the reference > > -- constant offset > > -- vector of indices > > -- type of the accessed location > > -- original tree of the memory reference (or another summary of the > > structure of the access, for aliasing purposes) > > -- flags > > What do you do with Ada COMPONENT_REFs, at a variable offset? for concreteness, let us consider a.x{offset: n} I have considered three possibilities 1) making the offset part of the expression for base, i.e., having tmp = &a + n and reference with base: tmp 2) allowing the offset in the reference to be non-constant, i.e., having reference with base: &a, offset: n 3) making this offset into an index, i.e, having base: &a, indices: (step:1, value: n) Out of these, I like 3) the most, although it might be fairly expensive memory-wise (any idea how common the variable offsets are?) Zdenek
Re: Proposal: changing representation of memory references
Hello, > >> >-- flags > >> > > >> >for each index, we remeber > >> >-- lower and upper bound > >> >-- step > >> >-- value of the index > >> > >> This seems a lot, however, since most of it can be derived from the > >> types, why are we also keeping it in the references. > > > >The lower bound and step may be SSA_NAMEs, see the current ARRAY_REF, > >and as such, they need to be stored somewhere in IR (not just in type). > > > >> That is, unless we could share most of the index struct (upper, > >> lower, step) among expressions that access them (IE make index be > >> immutable, and require unsharing and resharing if you want to modify > >> the expression). > > > >That appears a bit dangerous to me, I would rather avoid such tricks. > > Like Richard, I have doubts you are not going to increase memory if > you store *this* much in every single memory access. for structured accesses (ARRAY_REFs etc.) we now need much more memory -- 4 pointers, plus overhead of the common tree fields. My proposal will save some memory here. On unstructured accesses (INDIRECT_REFs), the proposal would indeed require three extra pointers. I now think keeping INDIRECT_REFs might turn out to be necessary, to represent unstructured memory accesses. Having to deal with INDIRECT_REFs should not complicate optimizers significantly, so it would not beat the purpose of the patch. All in all, I believe it will need some experimentation, but we should not consume more memory than we do now, and possibly we could even save somewhat. Zdenek
Re: Proposal: changing representation of memory references
Hello, > >Proposal: > > > >For each memory reference, we remember the following information: > > > >-- base of the reference > >-- constant offset > >-- vector of indices > >-- type of the accessed location > >-- original tree of the memory reference (or another summary of the > > structure of the access, for aliasing purposes) > This i don't think we should keep, because i don't believe it's > necessary for aliasing. At worst, aliasing could come up with it's > own summary if it really wanted to. long term, I totally agree with you. Short term, I think even implementing the proposal in the current form will be a lot of work; I somewhat fear having to include having to modify alias analysis significantly into it (of course, if someone familiar with tree alias analysis -- i.e., you or Diego -- volunteered to help me with this part of the change, it would help a lot). > >-- flags > > > >for each index, we remeber > >-- lower and upper bound > >-- step > >-- value of the index > > This seems a lot, however, since most of it can be derived from the > types, why are we also keeping it in the references. The lower bound and step may be SSA_NAMEs, see the current ARRAY_REF, and as such, they need to be stored somewhere in IR (not just in type). > That is, unless we could share most of the index struct (upper, > lower, step) among expressions that access them (IE make index be > immutable, and require unsharing and resharing if you want to modify > the expression). That appears a bit dangerous to me, I would rather avoid such tricks. Zdenek
Re: Proposal: changing representation of memory references
Hello, > >> This looks like a very complicated (though very generic) way of > >> specifying a memory > >> reference. Last time we discussed this I proposed to just have BASE, > >OFFSET > >> and accessed TYPE (and an alias tag of the memory reference). I realize > >> this > >> doesn't cover accesses to multi-dimensional arrays, but using the > >> full-blown scheme > >> in place of a simple INDIRECT_REF makes memory usage for the trees go up > >> noticable. > > > >by the way, does your claim "using the full-blown scheme... makes memory > >usage go up noticeable" mean that you have experimented with some > >similar representation? If so, could you please post your > >patches/results? Or do you have at least some data regarding how great > >part of the tree memory consumption comes from memory references? > > No, I didn't do patches or measurements. I just extrapolated that if you > want > to handle things like a.x[i].y.z[j] you need to store not only two indices > but > either two strides or two types (to extract the stride). So you'd > have (assuming > a flat tree) for example two VECs in the memory reference tree. Together > with the (possibly non-constant) offset and the base these are four pointers > compared to one in a simple INDIRECT_REF. to make this comparison fair, you also need to take into account that the computations need to be done somewhere. So for the "simple INDIRECT_REF", you in fact have the following statements in the program (assuming that lower bound of each array is 0, otherwise you would also have the subtractions): off1 = step1 * i; off2 = step2 * i; off = off1 + off2; addr = base + off; *addr Given all the temporary variables, expression and assignment nodes needed for this, this more than makes up for any extra space that is used in the proposed representation. Of course, in some situations some of these computations may be shared. For simple INDIRECT_REF (without any structure), we store base, plus we have fields for offset (constant 0) and the vector of indices (NULL). So there are two extra pointers over the current state. Without experimentation, I cannot tell whether overall, my proposal would require more or less memory than we use now. In fact, if this turned out to be a problem, I can live with allowing also INDIRECT_REF in its current form, to represent references that do not have any structure. > Maybe we can incrementally create out of all the existing memory reference > two reference trees, one simple that does not handle multi-dimensional array > accesses well and one full-blown one? Would not that just mean that you would consume even more memory? If you mean incremental lowering, yes, I can imagine we might want to do that (possibly lowering the references later to the existing TARGET_MEM_REFs), but it does not really help with the memory consumption. Zdenek
Re: Proposal: changing representation of memory references
Hello, > This looks like a very complicated (though very generic) way of > specifying a memory > reference. Last time we discussed this I proposed to just have BASE, OFFSET > and accessed TYPE (and an alias tag of the memory reference). I realize > this > doesn't cover accesses to multi-dimensional arrays, but using the > full-blown scheme > in place of a simple INDIRECT_REF makes memory usage for the trees go up > noticable. by the way, does your claim "using the full-blown scheme... makes memory usage go up noticeable" mean that you have experimented with some similar representation? If so, could you please post your patches/results? Or do you have at least some data regarding how great part of the tree memory consumption comes from memory references? Zdenek
Re: Proposal: changing representation of memory references
Hello, > This looks like a very complicated (though very generic) way of > specifying a memory > reference. Last time we discussed this I proposed to just have BASE, OFFSET > and accessed TYPE (and an alias tag of the memory reference). I realize > this > doesn't cover accesses to multi-dimensional arrays, but using the > full-blown scheme > in place of a simple INDIRECT_REF makes memory usage for the trees go up > noticable. as was pointed out before, you cannot avoid remembering the indices of the memory reference in some way if you want to be able to do dependency analysis. Also, many high-level optimizations work with multidimensional arrays, so the representation should make this possible. Regarding the memory consumption, let us forget the part of the information for alias analysis (that for simplicity proposes preserving the current representation, but that can be changed). Then, the proposed representation still is cheaper than the current way (component_refs are collapsed into a single offset field; only the necessary information is recorded for indices, whereas now we have redundant information stored in ARRAY_REFs). For simple indirect_refs, there are no indices, so the overhead of the proposal over the base+offset one is basically one pointer. Zdenek
Proposal: changing representation of memory references
Hello, at the moment, any pass that needs to process memory references are complicated (or restricted to handling just a limited set of cases) by the need to interpret the quite complex representation of memory references that we have in gimple. For example, there are about 1000 of lines of quite convoluted code in tree-data-ref.c and about 500 lines in tree-ssa-loop-ivopts.c dealing with parsing and analysing memory references. I would like to propose (and possibly also work on implementing) using a more uniform representation, described in more details below. The proposal is based on the previous discussions (http://gcc.gnu.org/ml/gcc/2006-06/msg00295.html) and on what I learned about the way memory references are represented in ICC. It also subsumes the patches of Daniel to make p[i] (where p is pointer) use ARRAY_REFs/MEM_REFs. I am not sure whether someone does not already work on something similar, e.g. with respect to LTO (where the mentioned discussion started), or gimple-tuples branch? Proposal: For each memory reference, we remember the following information: -- base of the reference -- constant offset -- vector of indices -- type of the accessed location -- original tree of the memory reference (or another summary of the structure of the access, for aliasing purposes) -- flags for each index, we remeber -- lower and upper bound -- step -- value of the index The address of the reference can be computed as base + offset + sum_{idx} offsetof(idx) where offsetof(idx) = (value - lower) * step For example, a.x[i][j].y would be represented as base = &a offset = offsetof (x) + offsetof (y); indices: lower = 0 upper = ? step = sizeof (a.x[i]) value = i lower = 0 upper = ? step = sizeof (a.x[j]) value = j p->x would be represented as base = p; offset = offsetof (x); indices: empty p[i] as base = p; offset = 0 indices: lower = 0 upper = ? step = sizeof (p[i]) value = i Remarks: -- it would be guaranteed that the indices of each memory reference are independent, i.e., that &ref[idx1][idx2] == &ref[idx1'][idx2'] only if idx1 == idx1' and idx2 = idx2'; this is important for dependency analysis (and for this reason we also need to remember the list of indices, rather than trying to reconstruct them from the address). -- it would be guaranteed that the computations of the address do not overflow. -- possibly there could be a canonicalization pass that, given for (p = q; p < q + 100; p++) p->x = ... {represented the way described above} would transform it to for (p = q, i = 0; p < q + 100; p++, i++) {base = q, offset = offsetof(x), indices: lower = 0 upper = ? step = sizeof (*p) value = i} so that dependence analysis and friends do not have to distinguish between accesses through pointers and arrays Zdenek
Re: Valid gimple for MEM_REF
Hello, > > only gimple_vals (name or invariant). However, the expressions are > >matched in final_cleanup dump (after out-of-ssa and ter), so this no > >longer is the case. I think just the regular expressions need to be > >updated. > > Then IV-OPTs has an issue too but IV-OPTs dump gives: > D.1604_5 = MEM[base: (double *) &a, index: ivtmp.34_12]; > MEM[base: (double *) &c, index: ivtmp.34_12] = D.1604_5; > > the expression matching in final_cleanup was just a symptom of the issue. OK, I will have a look at that. Zdenek