On Sun, Jan 09, 2011 at 03:24:44PM -0800, Richard Henderson wrote: > Something I noticed while working on that deposit patch for ia64 > was how annoying it is to bundle instructions by hand. And the > fact that it's been done incorrectly at least once. E.g. > > static inline void tcg_out_bswap64(TCGContext *s, TCGArg ret, TCGArg arg) > { > tcg_out_bundle(s, mII, > tcg_opc_m48(TCG_REG_P0, OPC_NOP_M48, 0), > tcg_opc_i18(TCG_REG_P0, OPC_NOP_I18, 0), > tcg_opc_i3 (TCG_REG_P0, OPC_MUX1_I3, ret, arg, 0xb)); > > Notice that there's an unnecessary stop bit in there.
Agreed this is prone to errors. > This patch has does too much all at once, I'll admit that. But > before I bother going back to split it into smaller pieces, I > wanted to get some feedback -- including whether it would be > considered at all. > > Some statistics that I gleaned from grepping -d out_asm. Both before > and after the rewrite, the only two linux-user targets that work are > i386 and alpha. Clearly there are still bugs to fix, but it's not > necessarily a regression. The problem with linux-user is that ia64 has 16kiB pages, and we don't really support this when emulating 4kiB guests. > Code size Stop bits > old new change old new change > i386/ls 2309712 1602736 -31% 215520 98698 -54% > alpha/ls 1088352 817888 -25% 106215 48440 -54% > > On the system side, I am able to boot the arm-test kernel. But > it's a bit harder to get repeatable statistics there. I assume > that's all timing issues, and what code gets run when, and how > that affects the TB cache. When writing the IA64 backend, I also started trying to bundle the instructions automatically, but stopped because it was getting nowhere. It's an ambitious project but if you can get it working, I am sure it'll really improve QEMU on IA64. The problems that need to be taken into account when automatically bundling the code are mostly due to the partial code retranslation (look at tcg_gen_code_search_pc()): - retranslating only a part of the ops (always starting at the beginning of a TB, but not up to the end) should not change the already generated code, even temporarily. - it should be possible to identify the op which called an helper or the softmmu code with __builtin_return_address(), and thus identify which part of the bundle. s->code_ptr should therefore be incremented the same way that __builtin_return_address() is. Other than that, it's always a trade-off between complexity of the code generator and the compactness of the generated code. It doesn't worth generated a 20% more optimized code if it takes 10 times more time to generate it. The problem with that is that you can't really know before trying, at least partially. I have tried to measure the speed improvement on the i386 target, but it triggers a tcg_abort() while booting the kernel, so it's not possible to conclude on the speed side yet. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurel...@aurel32.net http://www.aurel32.net