On 2016-06-23 20:48, Richard Henderson wrote: > Rather than rely on recursion during the middle of register allocation, > lower indirect registers to loads and stores off the indirect base into > plain temps. > > For an x86_64 host, with sufficient registers, this results in identical > code, modulo the actual register assignments. > > For an i686 host, with insufficient registers, this means that temps can > be (temporarily) spilled to the stack in order to satisfy an allocation. > This as opposed to the possibility of not being able to spill, to allocate > a register for the indirect base, in order to perform a spill. > > Signed-off-by: Richard Henderson <r...@twiddle.net> > --- > include/qemu/log.h | 1 + > tcg/optimize.c | 31 +----- > tcg/tcg.c | 306 > +++++++++++++++++++++++++++++++++++++++++++---------- > tcg/tcg.h | 4 + > util/log.c | 5 +- > 5 files changed, 263 insertions(+), 84 deletions(-)
This patch is a difficult one to review... On the purely technical side it does what it is supposed to do and I haven't found any issue, though it's probably very easy to miss one in this kind of code. I have done tests with various sparc images and I haven't found any obvious regression on an x86_64 host. Now on the less technical side, I really like the idea of being able to transform more or less in place the TCG instruction stream. Your more or less recent patches towards that direction are great. That said I am a bit worried that we loop many times on the various ops. We used to have one forward pass (optimizer) and one backward pass (liveness analysis). Your patch adds up to two additional passes (one forward and one backward), this clearly has a cost. Given that indirect registers bring a lot of performance I think it is worth it. Now I wonder if there is any way to do the lowering of registers earlier, I mean before the liveness analysis. This would probably generate plenty of useless ops, but that are later removed by the liveness analysis. Maybe you have already try that? I think it also depends on which direction we want to go with TCG, either plenty of small independent optimization passes, or keep the number of passes limited which means more complex code. Contrary to a compiler we have to do a much more difficult trade-off between the optimization time and the level of optimization. Nevertheless I think it's the correct way to go forward for now and this patch fixes real issues on hosts with limited registers. Maybe just add a note saying there *might* be better ways to do that. Reviewed-by: Aurelien Jarno <aurel...@aurel32.net> -- Aurelien Jarno GPG: 4096R/1DDD8C9B aurel...@aurel32.net http://www.aurel32.net