On Jun 10, 2009, at 1:15 PM, Toshiyasu Morita wrote:

--- On Wed, 6/10/09, Geoffrey Garen <[email protected]> wrote:

>I'm having a hard time understanding from your comment what optimization changes you think are appropriate, but if you can produce a patch that implements > your idea, and shows a benefit on a benchmark, I'd be happy to review it.

Consider something like op_call.

This expands out to 95 inline instructions on the MIPS for just the slow case alone, of which 3 are functions calls to other functions. So this probably requires thousands of clock cycles to execute.

IMHO it doesn't make sense to inline op_call because:

[ I'm sorry, I've been away from a net connection, I may be replicating a couple of things ggaren & olliej have already said. ]

Okay! First up, have you tried turning off ENABLE_JIT_OPTIMIZE_CALL? If you do so, it should address the majority of your concerns, below (specifically, reducing code size, and removing the need for op_call to patch generated code).

Of course, we added the call optimizations because we measure them as a significant performance improvement, but feel free to test whether this is true on your platform, and once the MIPS JIT is in the tree we'd be happy to consider changes to the optimized mode that aid MIPS performance.

1. It's a huge amount of JIT code just to save three of four instructions at runtime (call, return, and maybe some register shuffling)

2. The code which is executed is thousands of instructions and saving three or four instructions is a microscopic net win.

4. It make the generated machine code MUCH larger because instead of having one copy of this function that is written in C/C++ and statically compiled, there are multiple copies of this code for every instance of op_call, which makes the instruction cache much less effective.

I think it's worth making sure you understand the optimization here. The majority of calls can be optimized, and having been optimized only run the sequence of instructions planted in the main generation pass. This code path is only a handful of instructions long, and introducing an extra call and return onto this path would almost certainly degrade performance (feel free to try doing so, and please so submit any patches that provide a memory saving, without significantly degrading performance). For such a short and performance critical fragment of code it clearly could make sense to tweak the code for specific platforms, and it may well provide a significant performance benefit to do so. We should certainly consider such patches.

The slow case JIT code is much longer, and less frequently executed. Introducing a call and return here to share code between calls definitely makes sense. The way you know we think that it, the JIT already works this way! The slow cases call out to a set of shared trampolines generated in privateCompileCTIMachineTrampolines. This is however, a work in progress, and we are currently still clearly generating far more code than we should be in the slow cases. More work should be done to unify the pre-linked and post-link slow case states, and to move work into the trampolines (this is something I may be looking at again fairly soon).

It is certainly valid to question whether the work performed by the machine trampolines is better in JIT generated code, or in C++ code that the compiler can optimize. In the early stages of its development the JIT was more a context threaded interpreter, calling out to C++ to perform almost all optimizations. We have migrated work into JIT generated code only where it has been a performance benefit to do so. Of course, that doesn't mean that we always got it right, or that the trade-offs haven't changed, or that the policy might not need to be tweaked on different platforms. Please feel free to experiment, and if you can produce patches that reduce the amount of work done in these JIT generated trampolines while improving performance then we'll be hugely appreciative (in fact, it needn't even be a performance win here – anything that doesn't degrade performance could be a nice simplification).

5. The generated machine code is weakly optimized, so instead of having calling code which is well-optimized by the C/C++ compiler for MIPS, it is executing weakly optimized dynamically generated code. Since the code is weakly optimized, it is also much larger than it should be, which also makes the instruction cache much less effective.

6. The JIT-generated code resides in the data cache, and must be flushed to main memory, then the instruction cache must be invalidated so the new code will load into the instruction cache. Because the WebKit JIT seems to do lazy compilation of functions at call time (instead of compiling all the functions in one pass), this requires the data cache to be flushed and the instruction cache to be invalided every time a new function is generated, which further degrades performance. This type of code generation strategy is ok for processors with unified caches (or pseudo-ounified on x86) but for RISC machines with separate instruction and data caches, it's really awful.

Naturally on ARMv7 we face the same issue, and the costs associated with cache flushing are significantly outweighed by the performance improvements provided by the associated optimizations. There is, however, a cost here, and one that we are certainly interested in reducing. There is potential to coalesce cache flush operations to reduce the overhead. For some of the values that are patched it may make sense to replace the instruction patching with constant pool loads, to make the values cheaper to update (of course, having a constant pool available to the code may be beneficial on all platforms, and is something we would be interested in introducing in a cross-platform fashion).

Of course, it may not prove possible to make the optimizations that are currently implemented through code patching make sense on all platforms. For this reason (and to assist in bringing up new platforms) there are #defines in Platform.h to allow the patching optimizations to be disabled. We will be happy to accept performance improvements to the non-patching code paths.

cheers,
G.



_______________________________________________
webkit-dev mailing list
[email protected]
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev

Reply via email to