Re: [Avail] how is a method call efficiently compiled and dispatched?

Robbert van Dalen Sun, 18 Jan 2015 12:19:58 -0800

Mark,

Thank you for taking your time to write such an informative post.


That said, Avail's layered compiler technology is currently beyond my powers to 
fully comprehend.
I want to learn, but there is so much other stuff to take in.

From what I understand so far, Avail’s dynamic compiler appears to be a 
strongly typed version of Self’s.
That’s a big achievement, because Self is still considered to bolster one of 
the most advanced (compiler) technology to date (next to the JVM -server, which 
originated from Self),

I also like Avails ‘everything immutable’ as a default. It is very interesting 
to see that parvasive use of immutability leads to such a clean VM design 
(whilst still enabling efficient compilation).
Remember, 'all things immutable' is my pet peeve.

cheers,
Robbert.

> On 18 Jan 2015, at 00:06, Mark van Gulik <[email protected]> wrote:
> 
> I used to think of the technique I came up with as continuation passing 
> style, but over the years, I’ve come to think of it as something different.  
> It’s closer to the style of machine code, where we get to decide how to use 
> the instructions to implement function calls.
> 
> Level one was designed to be both trivial to work with and (when combined 
> with primitives) computationally complete.  Since control flow constructs 
> like subroutine calls always seem to fall short when attempting to add 
> features like exception handling, coroutines, and backtracking, I decided to 
> build them all from the same collection of parts.  The key feature was an 
> immutable representation of continuations.  I could very easily build up 
> conditional flow from polymorphic dispatch, and build loops from continuation 
> invocation.  Let’s explore loops for a moment.  Here’s the implementation of 
> “Repeat_”, which simply executes a function repeatedly, forever.
> 
> Public method "Repeat_" is
> [
>       action : []→⊤
> |
>       $loop;
>       action();
>       Restart loop
> ] : ⊥;
> 
> The "$loop;” declares a label, meant to be similar in style to a label in 
> assembly language.  What it actually does in Avail is more powerful: It 
> creates a “copy" of the continuation stack and assigns it to a variable 
> (constant) named “loop”.  After the action is executed, the “Restart_” 
> primitive is invoked, passing that continuation as an argument.  The very 
> next level one instruction that will be executed after the restart is 
> capturing a new “loop” continuation – because the continuation that got 
> constructed was actually positioned at the very start of the “Repeat_”’s 
> method’s body.
> 
> So that loops forever, or until something else causes the fiber’s current 
> continuation to be hijacked.  Because the method never actually returns, it 
> has type bottom (⊥).  Similarly, the call “Restart loop” has type bottom, 
> which is why there is no semicolon after it.  Only statements are followed by 
> semicolons (a statement being an expression of type top (⊤), such as the 
> invocation of the top-valued function “action”).
> 
> If we wanted to be able to conditionally exit from that loop, we could simply 
> perform “Exit loop” at some point.  Then we would have to change the return 
> type of “Repeat_” to top.  Alternatively, if we want to exit with some value, 
> we would strengthen “Repeat_”’s return type to any (or something stronger), 
> and invoke “Exit_with_” instead, supplying the value to return from the call 
> to “Repeat_”.
> 
> This is sort of a roundabout way of building a looping control structure, but 
> I wanted to make sure the basic mechanism had maximum generality.  In 
> languages like Eiffel, there is no way to exit multiple loops and 
> conditionals in a single step.  Java allows named control structures, and 
> Avail essentially does the same allowing user-created code full access to 
> labels.
> 
> Conditional dispatch used to be done inside an “If_then_else_” primitive, but 
> now that we’re inlining the dispatch tree, we simply use a method with two 
> primitive implementations: One whose first argument is always true (typed as 
> true’s type) and evaluates the second argument, and one whose first argument 
> is always false and evaluates the third argument.  The level two code 
> generates a test-and-branch to the false case, an invocation of the second 
> argument, a jump past the false case, and then the false case: an invocation 
> of the third argument, falling through to the same place the true case 
> jumped.  However, there’s very almost nothing special about booleans and the 
> conditional primitives.  We even use a similar technique for short-circuited 
> and and or.  In theory, an Avail user could define an extended boolean type 
> that includes a new atom “maybe”, and have it run both alternatives.  The VM 
> does produce the true and false atoms when, say, comparing two numbers, but 
> in theory the use of “maybe” would still be perfectly type-safe.
> 
> So because we use primitives and polymorphic dispatch to implement loops and 
> conditionals, we don’t actually need to do any branches or jumps in the level 
> one code.  Therefore level one contains no control flow nybblecodes, other 
> than L1_doCall.  Since every function that doesn’t exit early simply returns 
> the last value produced, we don’t even bother with an L1 return instruction.  
> We used to, but since it only ever occurred as the last instruction of a 
> function, it was pointless.  Now it’s simply implied at the end of every 
> function.  I’m currently in the process of adding L1_doSuperCall (technically 
> L1Ext_doSuperCall, since it requires the extension nybblecode), which also 
> required me to create L1Ext_doGetType and L1Ext_doMakeTupleAndType.  A little 
> while ago I added L1_doPermute for argument reordering.  A few years back we 
> added L1Ext_doDuplicate to allow inline assignments like “Print: someTuple[(x 
> := 2+2)]”.  We don’t support it in the built-in grammar currently, but our 
> macros tests are able to generate it.  It’s there to make it easier to 
> support other languages.  The complete list of level one instructions is 
> found in the enumeration L1Operation (and listed as methods in 
> L1OperationDispatcher).
> 
> I’m sure you see how level one provides a super-clean layer for Avail code.  
> In theory, someone can write a compliant Avail interpreter that only ever 
> interprets these nybblecodes.  In fact, all the Avail tools (e.g., a 
> debugger) that we plan to implement will only need to operate on the level 
> one instruction set.
> 
> However, in our current implementation there’s a layer below the visible 
> surface called level two (L2).  Where the L1 instructions operate on a stack, 
> L2 is designed as a stackless register-transfer language with an infinite 
> number of non-windowed registers.  Optimizing L1 into alternative L1 would be 
> pointless, since it would be unable to do much more than fold the occasional 
> primitive, and couldn’t even *represent* code in which a method dispatch was 
> inlined.  But L2 is designed to be easy to optimize and inline.  Since even 
> basic control flow is done as method calls in Avail, we need to inline to get 
> back some of the performance that we lost by keeping L1 so clean.  So L2 has 
> jumping, testing-and-branching, continuation creation, continuation 
> "exploding” (extract each field of the continuation into registers), and the 
> idea of off-ramps and on-ramps.
> 
> Say we’re running some L2 code for function F (called the L2 chunk for F), 
> and another fiber wants to add an implementation of some method that F calls. 
>  Say we have a dispatch tree inlined in F’s L2 representation.  How do we 
> coordinate the change so that the new method implementation is available in 
> the current fiber?  Off-ramps.  We make it the responsibility of the 
> L2Translator (that converts L1 -> L2) to explicitly poll for interrupt 
> conditions at various points in the L2 code.  If the polling code sees the 
> need to get into an L1-consistent state (say because someone wants to add a 
> method implementation, or just to allow another fiber a chance to run for a 
> while), it branches to some L2 code (an “offramp”) that is able to construct 
> a suitable continuation (because that’s the code that the translator 
> created), stuff it back into the running fiber, and process an interrupt.  
> While there are tasks queued that need to run while all fibers are in an 
> L1-consistent state (e.g., adding a method implementation), the thread pool 
> prevents fiber-execution tasks from running, and signals the existing fibers 
> that someone wants L1-consistency.  When the last running fiber becomes 
> L1-consistent, all tasks requiring L1 consistency are executed.  When the 
> last one completes, the fibers are made runnable again, at which point they 
> may diverge into a cloud of L2 instructions, safe in the assumption that 
> methods are not changing.
> 
> When a method implementation gets added (during an L1-consistent time), it 
> iterates through its dependent L2 chunks and marks them as invalid.  When the 
> fibers continue running again, they check if they’re in an invalid chunk, and 
> if so, they switch to the default unoptimized L2 chunk (which is always valid 
> as it doesn’t have a dependency on any methods).  The other side of the 
> equation is that when a dispatch tree for a call to method M is written into 
> chunk C, chunk C gets added to the set of dependencies of method M.  Then we 
> have nice, happy safety guarantees for running L2 code without much fuss, 
> including deoptimization for dynamic changes.
> 
> We make it the translator’s job to ensure L1 consistency at certain times.  
> Right now that only happens right after a non-primitive call (i.e., when 
> entering a non-primitive function), which should be more than adequate 
> granularity, especially since we don’t have any long-running blocking 
> primitives.  Since the L2 instruction that checks for an interrupt and the L2 
> instructions that construct the reified L1 continuation(s) are ordinary L2 
> code, they’re subject to the same type of optimizations as the rest of the L2 
> chunk.  Since interrupts are supposed to be the infrequent case, and since 
> continuations are conceptually immutable, once we support inlining of 
> non-primitives we’ll be able to postpone creation of a whole series of stack 
> frames (continuations) representing inlined functions.  Technically it’s not 
> postponing, it’s pushing the responsibility for creating the frames at all 
> into the offramp code, so that if an interrupt doesn’t happen (the very 
> likely case), it doesn’t need to create the stack frames.
> 
> Now here’s the most fun part: Since the continuations are just data 
> structures, we can use a general object escape analysis to optimize them, 
> including eliding their creation in many cases.  And we can detect when a 
> “Restart_” invocation is of a continuation that we created in the same chunk, 
> converting it into a jump (and maybe a few register writes).  Similarly, 
> function closures are just data structures that bind values with raw 
> functions, so anywhere that such a closure “travels” inside an L2 chunk via 
> inlining, we should be able to simply use the variables/values directly from 
> the registers that provided the values that went into construction of the 
> closure, rather than extract them from the closure.  If we end up eliminating 
> all value-extractions from the closure and it doesn’t escape from the L2 
> chunk, well, the closure construction operation becomes a dead write to the 
> register holding the closure.  Since it had no side-effects beyond creating 
> and writing that closure into a register, we automatically remove that L2 
> instruction, saving the cost of constructing the closure.  As with 
> continuations, the closure is just a data structure, so we get this benefit 
> automatically by implementing general object escape analysis.
> 
> We don’t have most of these optimizations in place yet, but they’re "ripe for 
> the picking”, once we do non-primitive inlining.
> 
> Now, consider what will be below L2.  The L2 code will already have had “high 
> level optimizations” performed on it, such as eliminating the creation of 
> objects that don’t escape, canceling primitive sequences (e.g., converting 
> <a,b>[2] into b), and performing other semantically valid rewriting (e.g., 
> replacing s∪s with s).  If we “transliterate" the L2 instructions through a 
> simple translator into JVM or LLVM code, we can leverage most of the 
> low-level optimizations of those substrates.  But we’ll eliminate the L2 
> interpretation costs (fetch an L2 instruction, dispatch it megamorphically, 
> repeat).  For the JVM I’m thinking we’ll collect the registers used by an L2 
> chunk and create corresponding instance variables for them.  The chunk can be 
> structured as a big switch statement with fall-throughs, one case per L2 
> instruction, starting with the main entry point for case 1.  If we return 
> back into the L2 structure, we will usually do so through the Java stack.  If 
> the Java stack gets too deep (say more than 20 stack levels), or truly 
> reified L1 continuations have to be constructed for some other reason, we can 
> throw a special Java exception.  The generated code can be wrapped in 
> try/catch handlers that cause reification of the L1 continuations on the way 
> out.  In order to do an Avail “return” into the code after such a 
> reification, you simply invoke it again, passing an argument containing the 
> L2 program counter.  The switch then takes you to the correct spot, just like 
> you had never returned in the first place.  That gives the performance 
> benefit of using the Java stack the vast majority of the time, without its 
> usual limitations (when the Avail code wants to use deep recursion).  It even 
> allows backtracking behavior, like returning into/from the same continuation 
> multiple times.  I’m guessing that even before we have non-primitive inlining 
> we should see a several-fold improvement in performance by generating JVM 
> code from L2.
> 
> One last thing.  When we inline a non-primitive function, we won’t look at 
> its L1 nybblecodes and recompute all the same optimizations that we figured 
> out the last time we inlined that function.  Instead, we can grab the 
> function’s L2 implementation and directly embed it.  There will often be new 
> optimization opportunities apparent when we recompute the type information 
> for the *specific* types of the arguments supplied at the call site, but the 
> *general* optimizations will already have been performed.  If the function 
> we’re inlining had other functions inlined inside it, that’s even more work 
> that we don’t have to do again.  We'll still get an opportunity to further 
> optimize everything with the potentially stronger types of the arguments 
> (such as eliminating now-unreachable cases from dispatch logic).  And later, 
> if someone wants to inline a call to the function that we’re inlining into, 
> they'll save having to redo all that hard work.
> 
> The L1 and L2 ideas date back to the early 90s.  I started by writing L1 in 
> Smalltalk (Visualworks 2.0), then got an L2 running in Smalltalk as well.  
> Then I built a mechanical translator that was able to translate almost all of 
> the Smalltalk code into C++ (including comments).  To get this to work, I 
> provided some hints and alternative translation directives in the Smalltalk 
> code via an #equivalentC: method that did nothing in Smalltalk.  I then got 
> the L1 interpreter running in C++.  It was approximately the same speed as 
> the L2 code running in Smalltalk.  Technically, the C++ code was missing a 
> few crucial pieces that Smalltalk supplied, like an incremental garbage 
> collector.  The AvailObject representation in Smalltalk was basically a smart 
> pointer into the C++ heap, where all Avail objects were allocated.  I was 
> trying for several years to get the L2 code to run in C++, but due to a 
> combination of buggy compilers, ZERO debugging support for C++, and, well, 
> C++ being just so damned awful, I wasted most of a decade struggling 
> pointlessly with it.  I think it was 2009 or 2010 when Todd convinced me that 
> Java was up to the task, at which point I reworked the translator to produce 
> Java code instead of C++.  At some point almost all of the code was in Java 
> and looked mostly right (but with *thousands* of compilation errors), and 
> Todd convinced me to pull the plug on Smalltalk.  We probably would have 
> saved a bit of effort by throwing another month or two at tweaking the 
> translator, but a combination of Eclipse, thousands of passes with regex, 
> eyeballs, and sore fingers got things running in just a few months.  I have 
> archives of the Smalltalk and C++ code, but they’re just a historical 
> curiosity at this point.  The moment Todd convinced me the Java translation 
> was good enough, the Java code became the master copy which we began editing 
> by hand.
> 
> The mechanical translator was only ever written in Smalltalk, and I have no 
> intention of reviving it for any purpose within Avail.  An L2 -> JVM or LLVM 
> translator will be a very different beast (note: no source code).  However, 
> at some point we will write another translator to translate our Java code 
> that implements the Avail VM into a pidgin Avail called Avalanche.  Avalanche 
> code is not intended to be directly executed; we'll scan through all the 
> compiled code (and parse trees) to produce LLVM code to produce a new Avail 
> VM.  When that VM runs well enough, we’ll archive the Java source, just like 
> we did for the Smalltalk source.  Then we’ll edit Avalanche code when we want 
> to update the VM.  Java has been an ENORMOUS leap above Smalltalk and C++ 
> (especially C++), but we’re expecting much better performance from LLVM.  
> Static profile-guided optimization of much of the VM will certainly help the 
> startup time.  We’ll also be able to get a custom garbage collector running 
> again in LLVM to deal with the unique opportunities that Avail provides.
> 
> If you’d like, I could walk you through the L2 translation of the “Repeat_” 
> method.  I’m sure it would be interesting.

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [Avail] how is a method call efficiently compiled and dispatched?

Reply via email to