On Tue, Dec 22, 2009 at 3:46 PM, Laurent Desnogues <laurent.desnog...@gmail.com> wrote: > On Tue, Dec 22, 2009 at 1:02 AM, Richard Henderson <r...@twiddle.net> wrote: >> On 12/21/2009 03:08 PM, Laurent Desnogues wrote: >>> >>> If you wanted to use movcond, you'd have to make >>> cond + move a special case... >> >> You'd certainly want the ARM front-end to use movcond more often than that. >> For instance: >> >> addeq r1,r2,r3 >> --> >> add_i32 tmp,r2,r3 >> movcond_i32 r1,ZF,0,tmp,r1,eq >> >> You'd want to continue to use a branch around if the instruction has side >> effects like cpu fault (e.g. load, store) or updating flags. >> >> It ought not be very hard to arrange for something like >> >> if (cond != 0xe) { >> if (may_use_movcond(insn)) { >> s->condlabel = -1; >> /* Save the true destination register. */ >> s->conddest = cpu_R[dest]; >> /* Implement the instruction into a temporary. */ >> cpu_R[dest] = tcg_temp_new(); >> } else { >> s->condlabel = gen_new_label(); >> ArmConditional cmp = gen_test_cc(cond ^ 1); >> tcg_gen_brcondi_i32(cmp.cond, cmp.reg, 0, s->condlabel); >> } >> s->condjmp = 1; >> } >> >> // ... implement the instruction as we currently do. >> >> if (s->condjmp) { >> if (s->condlabel == -1) { >> /* Conditionally move the temporary result into the >> true destination register. */ >> ArmConditional cmp = gen_test_cc(cond); >> tcg_gen_movcond_i32(cmp.cond, s->conddest, cmp.reg, 0, >> cpu_R[dest], s->conddest); >> tcg_temp_free(cpu_R[dest]); >> /* Restore the true destination register. */ >> cpu_R[dest] = s->conddest; >> } else { >> tcg_set_label(d->condlabel); >> } >> } > > I agree, that looks nice. But I'll let you dig into ARM instruction > encoding and see how to implement may_use_movcond and > getting the correct dest to save is not that cheap (and before > you get back to me, yes, you could only consider a small > subset of the instructions for which you want to do that :-). > > There's a point I have kept on insisting on that you keep on > not answering :-) How does all of that perform in practice? > We can discuss forever, as long as it isn't measured, we are > just guessing.
So I did measure it. Your code isn't correct: you can't replace the dest reg with a new temp since that would break instructions such as: addeq r0,r0,#1 All conditional data processing processing instructions are using movcond. Note my version of QEMU is ARM specific and contains several things that aren't in mainline: - Aurelien TCG optimizations (constant propagation and copy analysis) - lazy block context flags update - no temp flush on ld/st - most helpers for non-SIMD/VFP instructions are replaced with TCG code (using setcond for flag setting) - no signal handling. This version of qemu-arm is about 2x faster than mainline. Env: - HW: E6400 - OS: CentOS 5.4 64-bit - gcc: 4.1.2 - bench: SPEC2k gcc with expr.i input set With movcond: Translation buffer state: gen code size 4262752/33449984 TB count 35084/524288 TB avg target size 18 max=592 bytes TB avg host size 121 bytes (expansion ratio: 6.7) cross page TB count 0 (0%) direct jump count 20388 (58%) (2 jumps=17211 49%) Statistics: TB invalidate count 0 JIT cycles 628700508 (0.262 s at 2.4 GHz) translated TBs 35084 avg ops/TB 28.2 max=554 deleted ops/TB 5.09 (178672) avg temps/TB 28.88 max=54 total in TB size 639924 avg 18.2 total out TB size 4009329 avg 114.3 cycles/op 635.1 cycles/in byte 982.5 cycles/out byte 156.8 gen_interm time 13.2% gen_code time 86.8% const/code time 9.9% liveness/code time 13.9% real 0m15.944s user 0m15.512s sys 0m0.070s Without movcond: Translation buffer state: gen code size 4308640/33449984 TB count 35093/524288 TB avg target size 18 max=592 bytes TB avg host size 122 bytes (expansion ratio: 6.7) cross page TB count 0 (0%) direct jump count 20388 (58%) (2 jumps=17211 49%) Statistics: TB invalidate count 0 JIT cycles 673085430 (0.280 s at 2.4 GHz) translated TBs 35093 avg ops/TB 27.9 max=556 deleted ops/TB 4.79 (168080) avg temps/TB 28.77 max=34 total in TB size 640804 avg 18.3 total out TB size 4056125 avg 115.6 cycles/op 686.8 cycles/in byte 1050.4 cycles/out byte 165.9 gen_interm time 12.8% gen_code time 87.2% const/code time 9.3% liveness/code time 11.5% real 0m15.974s user 0m15.586s sys 0m0.060s Notes: - the change in number of TB's is expected for gcc since it outputs timing stats at the end - the cycles spent in TCG are very inaccurate (I never found them very useful...) - I slightly changed the movcond x86_64 generation not to generate useless mov when dest = vtrue = valse which happened at least once (due to copy analysis). All in all not much gain. For the sake of completeness some stats about movcond usage: number of TB with movcond : 2919 max number of movcond in a TB : 24 total number of movcond generated: 5738 total number of movcond executed : 163892368 Of course that's a single point, but one that is spending a rather big percentage of time its time in generated code; oprofile output: 21403 60.4570 anon (tgid:17348 range:0x601be000-0x621bf000) qemu-arm-32 anon (tgid:17348 range:0x601be000-0x621bf000) 6751 19.0695 qemu-arm-32 qemu-arm-32 cpu_arm_exec 2230 6.2991 anon (tgid:17348 range:0x6224a000-0x6224b000) qemu-arm-32 anon (tgid:17348 range:0x6224a000-0x6224b000) 303 0.8559 qemu-arm-32 qemu-arm-32 tcg_gen_code 65 0.1836 qemu-arm-32 qemu-arm-32 cpu_loop 61 0.1723 qemu-arm-32 qemu-arm-32 page_check_range 53 0.1497 libpthread-2.5.so qemu-arm-32 __pthread_cleanup_upto 49 0.1384 qemu-arm-32 qemu-arm-32 temp_save 36 0.1017 qemu-arm-32 qemu-arm-32 gen_intermediate_code Doesn't look encouraging, but I like the reduction in generated code size. Of course before drawing any conclusion, we need more measures, especially one with some QEMU system. Laurent