Leopold Toetsch wrote:

Nicholas Clark wrote:

Inside a cgoto core have 1 extra op - enter JITted section.

Or go the other way round: Run from JIT. If there is a sequence of non JITable ops, convert these to a CGP section, which returns to JIT when finished. This would save a lot of function calls to jit_normal_op.

I have thought about this, with a little help from ddd:

1)
opcode_t *
cgp_core(opcode_t *cur_op, struct Parrot_Interp *interpreter)
{
#ifdef __GNUC__
register opcode_t *cur_opcode asm ("esi") = cur_op;
#else
opcode_t *cur_opcode = cur_op;
#endif

This produces unoptimized almost the same code quality and speed as -O3.

The cur_opcode is in %esi, all operand access is done like in the posted -O3 example:
$ parrot -P mops.pbc
82.019517 M op/s

2) There is one new opcode:

B<jmp_to_eip inconst INT>

The argument is the native_ptr in JIT code, where to return from a section of non JITed code.

goto **(cur_opcode + 1);

The address is filled in by the JIT emit functions.

3) The Parrot_jit_begin() emits code to call cgp_core (alas setting up the same stack frame as cgp_core) and B<jmp_to_eip> back to the address after the function call

4) When there is a seqence (more then 1) non JITed ops, Parrot_jit_normal_op emits code to calculate %esi (the *cur_opcode) in the prederefed jump table and *jumps* there. The end of the section is above jmp_to_eip instruction. This implies, that after a non JITed section, the JITed section is at least two opcodes sized, to have room to fill in this jump.

5) non JITed branches do not fit very nicely in this scheme, but there are several possible ways to handle these:
- make all branches JITted
- generate another core (cgp_jit_core), which does the right thing
- always emit code by Parrot_jit_cpcf_op() for the last opcode in the section (cgp_core would only be used if the nonJITed section is >= 3 instructions)
- if both ends of the branch are non JITed sections do nothing, just stay in the cgp_core and patch the ends of these branches to return to JIT (brrr)

6) and finally the prederefed B<end> opcode gets jumped to by code emitted from Parrot_end_jit, to clean up the cgp_core stack frame and return from JIT.

So we would not have any function call overhead and getting the best performance by combining the 2 fastest run cores.

This approach would of course need some architecture/compiler specific hacks, but JIT is such a hack anyway. OTOH it is almost totally encapsulated in the architecture jit file, so it *can* be implemented but there is no need to do so.

Comments welcome,
leo



Reply via email to