> On Fri, Sep 21, 2001 at 02:24:43PM -0400, Dan Sugalski wrote:
> > Doing this by hand with -O3, you can see a speedup of around a factor of 45
> > over an unoptimised runops loop, so it's definitely worth doing in some
> > cases...
>
> Cool! Parrot JIT!
>
> But that tells us *just* how time-consuming our runops loop is.
> :(
>
I think the slowest part is the function-call. But this 45x performance
boost is probably measured via x86. SPARC has a very nice function-call
optimization (no-main-memory needed). Given that the inner loop probably
maxes out at 5 function calls deep (do->op->api_func->c_func..), SPARC is
optimal for parrot.
Simply going to a switch-archetecture gets rid of the "op-code"
function-call. The only remaining slowdown is the code-branching (which
thwarts pipelining / instruction-level-parallelism).
Either way, branching can be reduced by a clustering a sequence of
regularly used-ops together into macro-ops. I'm not of the opinion that
concatenating everything into one monolithic function is of use
(converting jsr/branch to gotos), since it doesn't extend well to dynamic
operation (redefining functions / dynamically loading code). A c-function
call isn't that much more destructive than a non-local goto, so I'd
recommend these "macro-ops" be comprised of 2-or-more op-codes grouped
into a function (propbably using a hash-coded function-name: F124x3ba for
example). Using hash-tables at compile-time, it would be possible to
reuse macro-ops throughout the code. The more code-reuse (through
variable-complexity statistical code-analysis; e.g. -O2, -O3, etc), the
better locality and cache-coherence, and the smaller the executable size.
What's more, it might even be possible to reuse these macro-ops in eval /
"use" statements if the compile-info is maintained until run-time.
The macro-ops can be used either as a free-lance collection of functions
(which are compiled and linked-in dynamically at run-time
(portability-issues?)), or as a monolithic switch-statement (of ops +
macro-ops). Note, there is little need to group the macro-ops together
into a monolithic function, since the entirety of the main-code could
become a single macro-op (which optionally inlines the other macro-ops).
Ideally the ultimate jit would generate assembly and concatenate this into
an existing executable memory segment (self-modifying code). (yeah right)
As a plan of attack, I can see the first stage producing a monolithic
function (as with the original request), then later breaking out
function-blocks (anything in brackets, including the eventual subroutine)
into macro-ops, then just for fun (or when there's time) performing
statistic analysis to maximize code-reuse.
As a further note, macro-op code-reuse would be most efficient if we had a
manner of generalizing parameters. This isn't a problem in stack-based
code (such as java), but we have to deal with it due to mappable
registers. Two code segments may only differ in the specific registers
utilized (based on compiler allocation), thereby thwarting consolidation.
As a possible solution, the macro-op is provided the complete set of
register-addresses and ignores those specified by the byte-code-generator.
Namely:
Original byte-code segment:
set I1, {constantA}
set I2, {constantB}
add I3, I1, I2
The macro-op _could_ be:
F12345(code,interp):
REG_INT(1) = {constantA};
REG_INT(2) = {constantB};
REG_INT(3) = REG_INT(1) + REG_INT(2)
Note how we've hard-coded the reg-addrs. This function takes no
parameters.
My proposal is:
F12345(code,interp)
REG_INT(P1) = P4;
REG_INT(P2) = P5;
REG_INT(P3) = REG_INT(P2) + REG_INT(P3);
The above can be interpreted as either a generic function with extended
args which will be called with F12345(code,interp, P1, P2, P3, P4, P5)
inside another macro-op, or via a custom-built p-code which acts in all
ways identical to our original p-code except for the dramatic expansion of
op-codes. What's interseting about this approach is that it can use the
same interpreter for optimized code. The main functinoal difference is
that we've reduced the size of the byte-code by several orders of
magnitude. Inner-loops can possibly fly (w/o any function-calls nor
non-local branches).
The calling sequence could be as follows:
edit foo.pl
perl6 foo.pl
compile foo.pl, generate pasm (output if -S option)
assemble pasm into pbc (output if -o option).
If (-Ox):
generate macro-ops from pbc
generate non-portable-pbc from macro-ops
compile macro-op-code to executable code (DLL/so where available)
output if -b option, else write to temp-file
if can-dynamically-load: load library; else: exec new-code
run interpreter on pbc
Note: you can invoke perl6 with .pl, .pasm, .pbc, .npbc:.nplib
(non-portable-byte-code and it's associated library)
Food for thought...