Re: [Avail] how is a method call efficiently compiled and dispatched?

Mark van Gulik Sat, 17 Jan 2015 22:54:07 -0800

I used to think of the technique I came up with as continuation passing style, 
but over the years, I’ve come to think of it as something different.  It’s 
closer to the style of machine code, where we get to decide how to use the 
instructions to implement function calls.

Level one was designed to be both trivial to work with and (when combined with 
primitives) computationally complete.  Since control flow constructs like 
subroutine calls always seem to fall short when attempting to add features like 
exception handling, coroutines, and backtracking, I decided to build them all 
from the same collection of parts.  The key feature was an immutable 
representation of continuations.  I could very easily build up conditional flow 
from polymorphic dispatch, and build loops from continuation invocation.  Let’s 
explore loops for a moment.  Here’s the implementation of “Repeat_”, which 
simply executes a function repeatedly, forever.

Public method "Repeat_" is
[
        action : []→⊤
|
        $loop;
        action();
        Restart loop
] : ⊥;

The "$loop;” declares a label, meant to be similar in style to a label in 
assembly language.  What it actually does in Avail is more powerful: It creates 
a “copy" of the continuation stack and assigns it to a variable (constant) 
named “loop”.  After the action is executed, the “Restart_” primitive is 
invoked, passing that continuation as an argument.  The very next level one 
instruction that will be executed after the restart is capturing a new “loop” 
continuation – because the continuation that got constructed was actually 
positioned at the very start of the “Repeat_”’s method’s body.

So that loops forever, or until something else causes the fiber’s current 
continuation to be hijacked.  Because the method never actually returns, it has 
type bottom (⊥).  Similarly, the call “Restart loop” has type bottom, which is 
why there is no semicolon after it.  Only statements are followed by semicolons 
(a statement being an expression of type top (⊤), such as the invocation of the 
top-valued function “action”).

If we wanted to be able to conditionally exit from that loop, we could simply 
perform “Exit loop” at some point.  Then we would have to change the return 
type of “Repeat_” to top.  Alternatively, if we want to exit with some value, 
we would strengthen “Repeat_”’s return type to any (or something stronger), and 
invoke “Exit_with_” instead, supplying the value to return from the call to 
“Repeat_”.

This is sort of a roundabout way of building a looping control structure, but I 
wanted to make sure the basic mechanism had maximum generality.  In languages 
like Eiffel, there is no way to exit multiple loops and conditionals in a 
single step.  Java allows named control structures, and Avail essentially does 
the same allowing user-created code full access to labels.

Conditional dispatch used to be done inside an “If_then_else_” primitive, but 
now that we’re inlining the dispatch tree, we simply use a method with two 
primitive implementations: One whose first argument is always true (typed as 
true’s type) and evaluates the second argument, and one whose first argument is 
always false and evaluates the third argument.  The level two code generates a 
test-and-branch to the false case, an invocation of the second argument, a jump 
past the false case, and then the false case: an invocation of the third 
argument, falling through to the same place the true case jumped.  However, 
there’s very almost nothing special about booleans and the conditional 
primitives.  We even use a similar technique for short-circuited and and or.  
In theory, an Avail user could define an extended boolean type that includes a 
new atom “maybe”, and have it run both alternatives.  The VM does produce the 
true and false atoms when, say, comparing two numbers, but in theory the use of 
“maybe” would still be perfectly type-safe.

So because we use primitives and polymorphic dispatch to implement loops and 
conditionals, we don’t actually need to do any branches or jumps in the level 
one code.  Therefore level one contains no control flow nybblecodes, other than 
L1_doCall.  Since every function that doesn’t exit early simply returns the 
last value produced, we don’t even bother with an L1 return instruction.  We 
used to, but since it only ever occurred as the last instruction of a function, 
it was pointless.  Now it’s simply implied at the end of every function.  I’m 
currently in the process of adding L1_doSuperCall (technically 
L1Ext_doSuperCall, since it requires the extension nybblecode), which also 
required me to create L1Ext_doGetType and L1Ext_doMakeTupleAndType.  A little 
while ago I added L1_doPermute for argument reordering.  A few years back we 
added L1Ext_doDuplicate to allow inline assignments like “Print: someTuple[(x 
:= 2+2)]”.  We don’t support it in the built-in grammar currently, but our 
macros tests are able to generate it.  It’s there to make it easier to support 
other languages.  The complete list of level one instructions is found in the 
enumeration L1Operation (and listed as methods in L1OperationDispatcher).

I’m sure you see how level one provides a super-clean layer for Avail code.  In 
theory, someone can write a compliant Avail interpreter that only ever 
interprets these nybblecodes.  In fact, all the Avail tools (e.g., a debugger) 
that we plan to implement will only need to operate on the level one 
instruction set.

However, in our current implementation there’s a layer below the visible 
surface called level two (L2).  Where the L1 instructions operate on a stack, 
L2 is designed as a stackless register-transfer language with an infinite 
number of non-windowed registers.  Optimizing L1 into alternative L1 would be 
pointless, since it would be unable to do much more than fold the occasional 
primitive, and couldn’t even *represent* code in which a method dispatch was 
inlined.  But L2 is designed to be easy to optimize and inline.  Since even 
basic control flow is done as method calls in Avail, we need to inline to get 
back some of the performance that we lost by keeping L1 so clean.  So L2 has 
jumping, testing-and-branching, continuation creation, continuation "exploding” 
(extract each field of the continuation into registers), and the idea of 
off-ramps and on-ramps.

Say we’re running some L2 code for function F (called the L2 chunk for F), and 
another fiber wants to add an implementation of some method that F calls.  Say 
we have a dispatch tree inlined in F’s L2 representation.  How do we coordinate 
the change so that the new method implementation is available in the current 
fiber?  Off-ramps.  We make it the responsibility of the L2Translator (that 
converts L1 -> L2) to explicitly poll for interrupt conditions at various 
points in the L2 code.  If the polling code sees the need to get into an 
L1-consistent state (say because someone wants to add a method implementation, 
or just to allow another fiber a chance to run for a while), it branches to 
some L2 code (an “offramp”) that is able to construct a suitable continuation 
(because that’s the code that the translator created), stuff it back into the 
running fiber, and process an interrupt.  While there are tasks queued that 
need to run while all fibers are in an L1-consistent state (e.g., adding a 
method implementation), the thread pool prevents fiber-execution tasks from 
running, and signals the existing fibers that someone wants L1-consistency.  
When the last running fiber becomes L1-consistent, all tasks requiring L1 
consistency are executed.  When the last one completes, the fibers are made 
runnable again, at which point they may diverge into a cloud of L2 
instructions, safe in the assumption that methods are not changing.

When a method implementation gets added (during an L1-consistent time), it 
iterates through its dependent L2 chunks and marks them as invalid.  When the 
fibers continue running again, they check if they’re in an invalid chunk, and 
if so, they switch to the default unoptimized L2 chunk (which is always valid 
as it doesn’t have a dependency on any methods).  The other side of the 
equation is that when a dispatch tree for a call to method M is written into 
chunk C, chunk C gets added to the set of dependencies of method M.  Then we 
have nice, happy safety guarantees for running L2 code without much fuss, 
including deoptimization for dynamic changes.

We make it the translator’s job to ensure L1 consistency at certain times.  
Right now that only happens right after a non-primitive call (i.e., when 
entering a non-primitive function), which should be more than adequate 
granularity, especially since we don’t have any long-running blocking 
primitives.  Since the L2 instruction that checks for an interrupt and the L2 
instructions that construct the reified L1 continuation(s) are ordinary L2 
code, they’re subject to the same type of optimizations as the rest of the L2 
chunk.  Since interrupts are supposed to be the infrequent case, and since 
continuations are conceptually immutable, once we support inlining of 
non-primitives we’ll be able to postpone creation of a whole series of stack 
frames (continuations) representing inlined functions.  Technically it’s not 
postponing, it’s pushing the responsibility for creating the frames at all into 
the offramp code, so that if an interrupt doesn’t happen (the very likely 
case), it doesn’t need to create the stack frames.

Now here’s the most fun part: Since the continuations are just data structures, 
we can use a general object escape analysis to optimize them, including eliding 
their creation in many cases.  And we can detect when a “Restart_” invocation 
is of a continuation that we created in the same chunk, converting it into a 
jump (and maybe a few register writes).  Similarly, function closures are just 
data structures that bind values with raw functions, so anywhere that such a 
closure “travels” inside an L2 chunk via inlining, we should be able to simply 
use the variables/values directly from the registers that provided the values 
that went into construction of the closure, rather than extract them from the 
closure.  If we end up eliminating all value-extractions from the closure and 
it doesn’t escape from the L2 chunk, well, the closure construction operation 
becomes a dead write to the register holding the closure.  Since it had no 
side-effects beyond creating and writing that closure into a register, we 
automatically remove that L2 instruction, saving the cost of constructing the 
closure.  As with continuations, the closure is just a data structure, so we 
get this benefit automatically by implementing general object escape analysis.

We don’t have most of these optimizations in place yet, but they’re "ripe for 
the picking”, once we do non-primitive inlining.

Now, consider what will be below L2.  The L2 code will already have had “high 
level optimizations” performed on it, such as eliminating the creation of 
objects that don’t escape, canceling primitive sequences (e.g., converting 
<a,b>[2] into b), and performing other semantically valid rewriting (e.g., 
replacing s∪s with s).  If we “transliterate" the L2 instructions through a 
simple translator into JVM or LLVM code, we can leverage most of the low-level 
optimizations of those substrates.  But we’ll eliminate the L2 interpretation 
costs (fetch an L2 instruction, dispatch it megamorphically, repeat).  For the 
JVM I’m thinking we’ll collect the registers used by an L2 chunk and create 
corresponding instance variables for them.  The chunk can be structured as a 
big switch statement with fall-throughs, one case per L2 instruction, starting 
with the main entry point for case 1.  If we return back into the L2 structure, 
we will usually do so through the Java stack.  If the Java stack gets too deep 
(say more than 20 stack levels), or truly reified L1 continuations have to be 
constructed for some other reason, we can throw a special Java exception.  The 
generated code can be wrapped in try/catch handlers that cause reification of 
the L1 continuations on the way out.  In order to do an Avail “return” into the 
code after such a reification, you simply invoke it again, passing an argument 
containing the L2 program counter.  The switch then takes you to the correct 
spot, just like you had never returned in the first place.  That gives the 
performance benefit of using the Java stack the vast majority of the time, 
without its usual limitations (when the Avail code wants to use deep 
recursion).  It even allows backtracking behavior, like returning into/from the 
same continuation multiple times.  I’m guessing that even before we have 
non-primitive inlining we should see a several-fold improvement in performance 
by generating JVM code from L2.

One last thing.  When we inline a non-primitive function, we won’t look at its 
L1 nybblecodes and recompute all the same optimizations that we figured out the 
last time we inlined that function.  Instead, we can grab the function’s L2 
implementation and directly embed it.  There will often be new optimization 
opportunities apparent when we recompute the type information for the 
*specific* types of the arguments supplied at the call site, but the *general* 
optimizations will already have been performed.  If the function we’re inlining 
had other functions inlined inside it, that’s even more work that we don’t have 
to do again.  We'll still get an opportunity to further optimize everything 
with the potentially stronger types of the arguments (such as eliminating 
now-unreachable cases from dispatch logic).  And later, if someone wants to 
inline a call to the function that we’re inlining into, they'll save having to 
redo all that hard work.

The L1 and L2 ideas date back to the early 90s.  I started by writing L1 in 
Smalltalk (Visualworks 2.0), then got an L2 running in Smalltalk as well.  Then 
I built a mechanical translator that was able to translate almost all of the 
Smalltalk code into C++ (including comments).  To get this to work, I provided 
some hints and alternative translation directives in the Smalltalk code via an 
#equivalentC: method that did nothing in Smalltalk.  I then got the L1 
interpreter running in C++.  It was approximately the same speed as the L2 code 
running in Smalltalk.  Technically, the C++ code was missing a few crucial 
pieces that Smalltalk supplied, like an incremental garbage collector.  The 
AvailObject representation in Smalltalk was basically a smart pointer into the 
C++ heap, where all Avail objects were allocated.  I was trying for several 
years to get the L2 code to run in C++, but due to a combination of buggy 
compilers, ZERO debugging support for C++, and, well, C++ being just so damned 
awful, I wasted most of a decade struggling pointlessly with it.  I think it 
was 2009 or 2010 when Todd convinced me that Java was up to the task, at which 
point I reworked the translator to produce Java code instead of C++.  At some 
point almost all of the code was in Java and looked mostly right (but with 
*thousands* of compilation errors), and Todd convinced me to pull the plug on 
Smalltalk.  We probably would have saved a bit of effort by throwing another 
month or two at tweaking the translator, but a combination of Eclipse, 
thousands of passes with regex, eyeballs, and sore fingers got things running 
in just a few months.  I have archives of the Smalltalk and C++ code, but 
they’re just a historical curiosity at this point.  The moment Todd convinced 
me the Java translation was good enough, the Java code became the master copy 
which we began editing by hand.

The mechanical translator was only ever written in Smalltalk, and I have no 
intention of reviving it for any purpose within Avail.  An L2 -> JVM or LLVM 
translator will be a very different beast (note: no source code).  However, at 
some point we will write another translator to translate our Java code that 
implements the Avail VM into a pidgin Avail called Avalanche.  Avalanche code 
is not intended to be directly executed; we'll scan through all the compiled 
code (and parse trees) to produce LLVM code to produce a new Avail VM.  When 
that VM runs well enough, we’ll archive the Java source, just like we did for 
the Smalltalk source.  Then we’ll edit Avalanche code when we want to update 
the VM.  Java has been an ENORMOUS leap above Smalltalk and C++ (especially 
C++), but we’re expecting much better performance from LLVM.  Static 
profile-guided optimization of much of the VM will certainly help the startup 
time.  We’ll also be able to get a custom garbage collector running again in 
LLVM to deal with the unique opportunities that Avail provides.

If you’d like, I could walk you through the L2 translation of the “Repeat_” 
method.  I’m sure it would be interesting.

> On Jan 17, 2015, at 1:09 PM, Robbert van Dalen <[email protected]> wrote:
> 
> Mark,
> 
> I don’t know what to say. All this is mind blowing.
> Some parts I don’t yet understand, but hopefully I will learn them on the way 
> (I might have to study some old papers on continuation passing style 
> compilers).
> I’m *definitely* going to try out the “-DavailDeveloper=true” flag, and peek 
> around.
> 
> Question: how much of the L1 and L2 code is inherited (machine translated?) 
> from C++ and Smalltalk?
> And how many machine translators are currently still in use? I reckon that 
> moving to LLVM will involve some machine translation.
> 
> cheers,
> Robbert.
> 
>> On 17 Jan 2015, at 04:28, Mark van Gulik <[email protected]> wrote:
>> 
>> When a method is running as level one code (also called nybblecodes), the 
>> call instruction (see L1Operation.L1_doCall) has operands that refer by 
>> index to two literals in the compiled code (also referred to as a "raw 
>> function" in Avail code).  The first of these literals is the actual method 
>> object to be invoked, and the second is the expected return type from the 
>> call.
>> 
>> I lied.  The first of those literals is really a message bundle.  A message 
>> bundle knows not only the method, but also the name being used to invoke it. 
>>  Remember, we can rename methods when they're imported.  We could refer 
>> directly to the method instead (and indeed we used to) and everything would 
>> work just fine, but in that case we would have to make a guess about what 
>> name to use when decompiling a function (from level one nybblecodes and 
>> literals into a block phrase, and then perhaps into a string).
>> 
>> So we can get from a call site in the raw function to a message bundle, and 
>> from there to its method.  Besides keeping track of a collection of 
>> definitions of the method (abstract, forward, macro, and method definition) 
>> and its semantic restrictions, a method also maintains a lazy decision tree. 
>>  Yup, you guessed it pretty much correctly.
>> 
>> We use a fairly simple algorithm to build the tree at the moment – at each 
>> decision point we examine all signatures of potentially applicable 
>> definitions and figure out which one is guaranteed to eliminate the most 
>> definitions, whether the test for the signature succeeds or fails.  That is, 
>> for each signature S we figure out how many signatures are disqualified if 
>> the arguments conform to S, and how many signatures are disqualified if the 
>> arguments do not conform to S.  We take the minimum of those two values and 
>> choose the S that has the largest minimum.  We break ties by using the 
>> maximum maximum.  That produces as close to a balanced structure as 
>> possible, even though your dynamic balance mechanism would be statistically 
>> superior – we’re just not collecting call statistics yet.  The best tree 
>> possible would maximize the entropy at each branch, so we should use the 
>> signature S that maximizes the minimum (fail versus succeed) total number of 
>> historic counts excluded by S.
>> 
>> When running level one code, we simply use (and expand branches as needed) 
>> the decision tree to find which method definition (and thereby which 
>> function) to invoke.  If there are multiple most-specific definitions then a 
>> runtime exception is raised.  Similarly, we raised an exception if there are 
>> no most-specific definitions, or if the most specific definition is an 
>> abstract or forward definition).
>> 
>> When we optimize level one code into level two, we first determine how many 
>> definitions are potentially applicable.  If it’s larger than some threshold 
>> (L2Translator.maxPolymorphismToInlineDispatch = 10), we generate the 
>> general, compact level two lookup code.  If it’s smaller, we construct a 
>> decision tree specific to the definitions reachable at that call site and 
>> generate level two decision-tree logic as conditional jumps.  We check a 
>> single argument at a time, and the code generator is very careful to keep 
>> track of what type information is known at each instruction in the generated 
>> decision tree.  We even special-case small enumeration types by using a 
>> small number of equality checks rather than a type test.  So at each leaf 
>> region of the decision tree logic we know the exact function is being 
>> invoked – or we know information about a failed lookup.
>> 
>> In theory, we should be able to inline functions quite nicely at such 
>> locations.  We currently only attempt to inline and fold primitive 
>> functions, but level two was designed to be inlined easily.  We just aren’t 
>> there yet.  Primitives are annotated with information about whether it’s 
>> safe to fold or inline them, and they also provide hooks that say what types 
>> they’ll produce given specific argument types.  These are stronger 
>> guarantees than what’s exposed to Avail, but they’re guarantees that we’re 
>> comfortable making directly in the VM.  We retain the right to strengthen or 
>> weaken them for performance/correctness reasons, so the Avail programmer 
>> must not see them.  They’re only used by the VM to optimize level two code.  
>> Semantic restrictions also strengthen expected return types, but those have 
>> to be checked explicitly at runtime since they might be wrong.  The 
>> VM-guaranteed type strengthening is supposed to always be right, otherwise 
>> it’s a VM bug.
>> 
>> Primitives provide hooks for other details, like determining whether it’s 
>> impossible/possible/mandatory that they could fail given particular argument 
>> types (e.g., 2+2, x-2, x+∞, ∞-∞, 3÷x, 3÷0).  They can also directly control 
>> the level two instructions that they generate, which is a great way to 
>> decentralize the optimizer.  Ultimately we’ll have other behavior like 
>> determining how they compose (e.g., <a,b>[2]+2+1), whether their execution 
>> can commute, be hoisted out of loops, etc.
>> 
>> If you want to get a glimpse into the L2 instruction set and its 
>> effectiveness, run the AvailWorkbench with the additional option 
>> “-DavailDeveloper=true”.  After running some code you can use the menu 
>> "Developer -> Generate VM Report” to see what megamorphic disatches weren’t 
>> expanded in place, how long is spent running which L2 instructions and 
>> primitives, and how expensive it is to check the return types from various 
>> primitives and non-primitives.  Be warned that the non-primitive return 
>> check statistics are fairly new and can be a bit verbose.
>> 
>> If you’re running in Eclipse, you can set a breakpoint for throws of 
>> AvailBreakpointException.  Or just set a code breakpoint on line 65 of 
>> P_257_BreakPoint.  Then you can see the interpreter state whenever Avail 
>> executes a “Breakpoint;” statement.  From there you should be able to poke 
>> around and see what the level two code looks like.  At the moment, the 11th 
>> execution of a (raw) function is what triggers translation to level two, so 
>> you could use an Avail “Breakpoint;” to get to the right area, then set a 
>> Java breakpoint in L2Translator.translate() and catch it translating the 
>> code you’re interested in.
>> 
>> The level two instruction set is still interpreted, so you could also 
>> breakpoint certain L2 operations to see them in action – and see what the 
>> surrounding level two code looks like.  Eventually, each L2Operation will 
>> know how to generate JVM or LLVM instructions for itself.  We’re not there 
>> yet.
>> 
>> I think the trickiest thing about following the level two code is that there 
>> are no subroutine calls.  The L2_Invoke operation does a jump, although it’s 
>> passed a continuation representing how to get back to what it was doing.  
>> Since any L2 code can be invalidated at any safe point, it would be wrong to 
>> pretend that it always returns to the point it was invoked from.  Instead, 
>> when it returns into a continuation, its level two information is examined; 
>> if the L2 chunk it’s returning into has been invalidated, it uses the stored 
>> L1 information instead.
>> 
>> Ok, that should be enough to get you started.  Oh, one more thing:  Follow 
>> the instructions in the Javadoc class comment for AvailObjectFieldHelper to 
>> tell the Eclipse debugger how to display Avail objects more appropriately.
>> 
>> Oh, and make sure to tell the mailing list the cool (or gross) stuff you 
>> find in there.  Some of that code is older than Java!
>> 
>> 
>> 
>>> On Jan 15, 2015, at 2:02 PM, Robbert van Dalen <[email protected]> 
>>> wrote:
>>> 
>>> Hi,
>>> 
>>> I want to know more about Avail’s method dispatching (at the call site).
>>> In the future, what would be the most efficient way to compile a method to 
>>> low level (machine) code?
>>> 
>>> My guess would be a simple if/then/else decision tree produced by a tracing 
>>> jitter (great for speculative branching), with statistically the most 
>>> common dispatch on top of the decision tree.
>>> But I wonder how a megamorphic dispatch on multiple precise and nested 
>>> types can be compiled into a simple if/then/else decision tree?
>>> 
>>> cheers,
>>> Robbert.
>> 
>

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [Avail] how is a method call efficiently compiled and dispatched?

Reply via email to