Hi,

the second part is true for ARM as well. Cortex A8 and A9 has a buffer
with 8 entries to predict return addresses. If it fails (frequently), it
flush this buffer, and this becomes a big performance loss since it cannot
even predict even those return addresses, which was on the stack before
(ineffective to implement a better hw logic).

An example:

1) A calls B calls C calls D
2) put something to the link register, do a return
3) this flushes the return predictor -> return address of D, C, B, A also
cannot be predicted anymore.

I agree with the second part about moving big chunk of common code to
custom functions. This would basically be a Macro Assembler based inline
assembly, and practically would be a static function. Would it be possible
to generate assembly at compile time from such functions and link it to
the code? Would reduce the initial startup time of jsc, and probably the
resulting code would be faster as well, although this would be against the
FMW (favourite magic word = complexity), since the macro assembler should
also generate assembly.

Zoltan

> Hey Nathan,
>
> Absolutely - this is effective what we do for calls, where the bulk of the
> work of the slow case is performed by a shared routine.  We've
> experimented with using this technique more broadly in the early stages of
> developing the JIT, and back then there was a measurable performance
> degradation from the call overhead – but the code had changed a lot since
> then, and the tradeoffs and requirements may be different now
> (particularly across the varying hardware platforms the JIT has now been
> ported to).
>
> However returning to an offset to a return address is probably not a good
> plan.  Upon executing a call, processors commonly cache the return address
> of the call instruction in a circular buffer used to predict return
> destinations.  When it reaches the return instruction it pops a value from
> the return address stack to predict the destination of the return.  If you
> change the address you're going to get a mispredict and probably a pipe
> flush.  (We modify the return address in our exception handling path, but
> we don't expect exceptions to be high performance).  However bear in mind
> that there is no conditional call instruction on x86, so to eliminate the
> slow path altogether you'd have to litter the hot path with inverted
> branches over the calls out to the trampolines.  I'd suggest you'd be more
> likely to find success in keeping the hot path branching out to slow
> cases, and experiment with moving the bulk of the work of larger slow
> cases out into shared routines (which would be being called from the slow
> case).
>
> This is certainly an interesting area to investigate.
>
> cheers,
> G.
>
>
> On Jun 25, 2010, at 4:20 PM, Nathan Lawrence wrote:
>
>> The size of our JIT generated code is a memory known issue.  According
>> to Oliver the slow cases for some of our operations is on the order of
>> 128 bytes.  It occurred to me that we could reduce the JITed code by
>> only compiling the slow case once and having all of the subsequent
>> generated code jump to that specific slow case.  The issue with this is
>> our slow cases jump back to specific locations in the hot path, with
>> potentially different values on the stack, as opposed to a normal
>> function which returns back to a very specific state.  We can circumvent
>> this issue by hand writing the assembly to return to an offset of the
>> return address with the required state.
>>
>> What do people think?
>> -- Nathan
>> _______________________________________________
>> squirrelfish-dev mailing list
>> [email protected]
>> http://lists.webkit.org/mailman/listinfo.cgi/squirrelfish-dev
>
> _______________________________________________
> squirrelfish-dev mailing list
> [email protected]
> http://lists.webkit.org/mailman/listinfo.cgi/squirrelfish-dev


_______________________________________________
squirrelfish-dev mailing list
[email protected]
http://lists.webkit.org/mailman/listinfo.cgi/squirrelfish-dev

Reply via email to