On Sun, May 7, 2017 at 11:22 AM, Fredrik Hederstierna <fredrik.hederstie...@verisure.com> wrote: > Hi, > > I have a question about loop induction variables, related to > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67213 > > Consider a simple loop like > > int ix; > for (ix = 0; ix < 6; ix++) { > data[ix] = ix; > } > > In this case variable 'ix' is used as counting variable for array index, > but also used as value for storage, so in some sense its used in two > different "ways". > > Several target architectures might have support to auto-increment pointer > registers when storing or reading data, > but if same variable is used as both array index to pointer and as data it > might be more complicated? > > In the example above, it seems like GCC loop analyzer (for completely > peeling) thinks 'ix' is an induction variable that can be folded away: > from 'tree-ssa-loop-ivcanon.c': > > Loop 1 iterates 5 times. > Loop 1 iterates at most 5 times. > Estimating sizes for loop 1 > BB: 3, after_exit: 0 > size: 0 _4 = (char) i_9; > Induction variable computation will be folded away. > size: 1 data[i_9] = _4; > size: 1 i_6 = i_9 + 1; <----- OK? > Induction variable computation will be folded away. > size: 1 ivtmp_7 = ivtmp_1 - 1; > Induction variable computation will be folded away. > size: 2 if (ivtmp_7 != 0) > Exit condition will be eliminated in peeled copies. > BB: 4, after_exit: 1 > size: 5-4, last_iteration: 5-2 > Loop size: 5 > Estimated size after unrolling: 5 > > Then index 'ix' is considered to be a induction variable, but possibly it > cannot be simply folded since its used in other ways as well?
Well, but all the listed stmts _can_ be folded away because the starting value is zero and thus they "fold away" to constants. > Though completely-peeling loop resulted in longer code: > > > int i; > char _4; > unsigned int ivtmp_7; > char _12; > unsigned int ivtmp_15; > char _19; > unsigned int ivtmp_22; > char _26; > unsigned int ivtmp_29; > char _33; > unsigned int ivtmp_36; > char _40; > unsigned int ivtmp_43; > > <bb 2>: > _12 = 0; > data[0] = _12; > i_14 = 1; > ivtmp_15 = 5; > _19 = (char) i_14; > data[i_14] = _19; > i_21 = i_14 + 1; > ivtmp_22 = ivtmp_15 + 4294967295; > _26 = (char) i_21; > data[i_21] = _26; > i_28 = i_21 + 1; > ivtmp_29 = ivtmp_22 + 4294967295; > _33 = (char) i_28; > data[i_28] = _33; > i_35 = i_28 + 1; > ivtmp_36 = ivtmp_29 + 4294967295; > _40 = (char) i_35; > data[i_35] = _40; > i_42 = i_35 + 1; > ivtmp_43 = ivtmp_36 + 4294967295; > _4 = (char) i_42; > data[i_42] = _4; > i_6 = i_42 + 1; > ivtmp_7 = ivtmp_43 + 4294967295; > return; > > > instead of original and shorter > > int i; > unsigned int ivtmp_1; > char _4; > unsigned int ivtmp_7; > > <bb 2>: > > <bb 3>: > # i_9 = PHI <i_6(4), 0(2)> > # ivtmp_1 = PHI <ivtmp_7(4), 6(2)> > _4 = (char) i_9; > data[i_9] = _4; > i_6 = i_9 + 1; > ivtmp_7 = ivtmp_1 - 1; > if (ivtmp_7 != 0) > goto <bb 4>; > else > goto <bb 5>; > > <bb 4>: > goto <bb 3>; > > <bb 5>: > return; The cost metrics assume constant propagation happened which in the above code didn't yet: > i_14 = 1; > _19 = (char) i_14; ... So you are comparing (visually) apples and oranges. > > Example for ARM target machine code it became longer: > > 0000001c <test_iter_6>: > 1c: e59f3030 ldr r3, [pc, #48] ; 54 <test_iter_6+0x38> > 20: e3a02000 mov r2, #0 > 24: e5c32000 strb r2, [r3] > 28: e3a02001 mov r2, #1 > 2c: e5c32001 strb r2, [r3, #1] > 30: e3a02002 mov r2, #2 > 34: e5c32002 strb r2, [r3, #2] > 38: e3a02003 mov r2, #3 > 3c: e5c32003 strb r2, [r3, #3] > 40: e3a02004 mov r2, #4 > 44: e5c32004 strb r2, [r3, #4] > 48: e3a02005 mov r2, #5 > 4c: e5c32005 strb r2, [r3, #5] > 50: e12fff1e bx lr > 54: 00000000 .word 0x00000000 > > compared to if complete-loop-peeling was not done > > 0000001c <test_iter_6>: > 1c: e59f2014 ldr r2, [pc, #20] ; 38 <test_iter_6+0x1c> > 20: e3a03000 mov r3, #0 > 24: e7c33002 strb r3, [r3, r2] > 28: e2833001 add r3, r3, #1 > 2c: e3530006 cmp r3, #6 > 30: 1afffffb bne 24 <test_iter_6+0x8> > 34: e12fff1e bx lr > 38: 00000000 .word 0x00000000 > > Producing 15 instead of 8 words, giving ~100% larger code size. > > And same for x86 arch: > > f: c6 05 00 00 00 00 00 movb $0x0,0x0(%rip) # 16 > <test_iter_6+0x7> > 16: c6 05 00 00 00 00 01 movb $0x1,0x0(%rip) # 1d > <test_iter_6+0xe> > 1d: c6 05 00 00 00 00 02 movb $0x2,0x0(%rip) # 24 > <test_iter_6+0x15> > 24: c6 05 00 00 00 00 03 movb $0x3,0x0(%rip) # 2b > <test_iter_6+0x1c> > 2b: c6 05 00 00 00 00 04 movb $0x4,0x0(%rip) # 32 > <test_iter_6+0x23> > 32: c6 05 00 00 00 00 05 movb $0x5,0x0(%rip) # 39 > <test_iter_6+0x2a> > 39: c3 retq > > > Without complete-loop-peeling done: > > f: 31 c0 xor %eax,%eax > 11: 88 80 00 00 00 00 mov %al,0x0(%rax) > 17: 48 ff c0 inc %rax > 1a: 48 83 f8 06 cmp $0x6,%rax > 1e: 75 f1 jne 11 <test_iter_6+0x2> > 20: c3 retq > > > Producing 43 instead of 18 bytes, ~140% bigger code size. > > Should a variable be considered to be induction variable also when its used > both as LHS array index and RHS data value in same loop? > Since its cross-target for both ARM and x86, does it origin in some cost > estimation on how 'ix' or other induction variables will be folded or not in > the final target code? Yes, the "heuristic" is off in some cases (it adds some optimistic factor of followup optimization opportunities IIRC). But not for the reason you think. Richard. > Thanks, Kind Regards, > Fredrik >