On 2012년 10월 10일 15:45, Aurelien Jarno wrote:
On Wed, Oct 10, 2012 at 01:17:36PM +0900, Yeongkyoon Lee wrote:
On 2012년 10월 10일 02:09, Aurelien Jarno wrote:
On Tue, Oct 09, 2012 at 06:55:58PM +0200, Paolo Bonzini wrote:
Il 09/10/2012 18:19, Aurelien Jarno ha scritto:
Instead of calling the MMU helper with an additional argument (7), and
then jump back (8) to the next code (4), what about pushing the address
of the next code (4) on the stack and use a jmp instead of the call. In
that case you don't need the extra argument to the helpers.
Maybe it wasn't very clear. This is based on the fact that call is
basically push %rip + jmp. Therefore we can fake the return address by
putting the value we want, here the address of the next code. This mean
that we don't need to pass the extra argument to the helper for the
return address, as GET_PC() would work correctly (it basically reads the
return address on the stack).
For other architectures, it might not be a push, but rather a move to
link register, basically put the return address where the calling
convention asks for.
OTOH I just realized it only works if the end of the slow path (moving
the value from the return address to the correct register). It might be
something doable.
Branch predictors will not oldschool tricks like this one. :)
Given it is only used in the slow path (ie the exception more than the
rule), branch prediction isn't that important there.
I had already considered the approach of using jmp and removing
extra argument for helper call.
However, the problem is that the helper needs the gen code addr used
by tb_find_pc() and cpu_restore_state().
That means the code addr in the helper can be actually said the addr
corresponding to QEMU_ld/st IR rather than the return addr.
In my LDST optimization, the helper call site is not in the code of
IR but in the end of TB.
GETPC() uses the return address to determine the call place, and as long
as the code at the end of the TB set a return address corresponding to
the one of the fast path instructions, tb_find_pc() will be able to find
the correct instruction.
That implies that at least one instruction at the end of the generated
code is shared between the slow path and the fast path, but in the other
hand it avoids having to different kind of mmu helpers.
How about nop instruction at the end of fast path as return address of
helper?
That means the change of "call helper" to "push addr of nop" and "jmp
helper".
Although I need to check the feasibility, it is expected to avoid helper
fragmentation and to make performance degradation to be minimum.