Re: [Avail] how is a method call efficiently compiled and dispatched?

Mark van Gulik Sun, 18 Jan 2015 14:31:29 -0800

By the way, I went ahead and added an L2_RESTART_CONTINUATION instruction, and 
got primitive P_058_RestartContinuation (the “Restart_” primitive) to generate 
it.  It avoids having to create the calling continuation, since the primitive 
was just going to throw that away anyhow.  It all works, and seemed to speed up 
the system by a decent amount, but I can’t be certain.  I’ll check it in with 
my implementation of super calls when they’re fully working.



> On Jan 18, 2015, at 12:58 PM, Robbert van Dalen <[email protected]> wrote:
> 
> Mark,
> 
> Thank you for taking your time to write such an informative post.
> 
> That said, Avail's layered compiler technology is currently beyond my powers 
> to fully comprehend.
> I want to learn, but there is so much other stuff to take in.
> 
> From what I understand so far, Avail’s dynamic compiler appears to be a 
> strongly typed version of Self’s.
> That’s a big achievement, because Self is still considered to bolster one of 
> the most advanced (compiler) technology to date (next to the JVM -server, 
> which originated from Self),
> 
> I also like Avails ‘everything immutable’ as a default. It is very 
> interesting to see that parvasive use of immutability leads to such a clean 
> VM design (whilst still enabling efficient compilation).
> Remember, 'all things immutable' is my pet peeve.
> 
> cheers,
> Robbert.
> 
>> On 18 Jan 2015, at 00:06, Mark van Gulik <[email protected]> wrote:
>> 
>> I used to think of the technique I came up with as continuation passing 
>> style, but over the years, I’ve come to think of it as something different.  
>> It’s closer to the style of machine code, where we get to decide how to use 
>> the instructions to implement function calls.
>> 
>> Level one was designed to be both trivial to work with and (when combined 
>> with primitives) computationally complete.  Since control flow constructs 
>> like subroutine calls always seem to fall short when attempting to add 
>> features like exception handling, coroutines, and backtracking, I decided to 
>> build them all from the same collection of parts.  The key feature was an 
>> immutable representation of continuations.  I could very easily build up 
>> conditional flow from polymorphic dispatch, and build loops from 
>> continuation invocation.  Let’s explore loops for a moment.  Here’s the 
>> implementation of “Repeat_”, which simply executes a function repeatedly, 
>> forever.
>> 
>> Public method "Repeat_" is
>> [
>>      action : []→⊤
>> |
>>      $loop;
>>      action();
>>      Restart loop
>> ] : ⊥;
>> 
>> The "$loop;” declares a label, meant to be similar in style to a label in 
>> assembly language.  What it actually does in Avail is more powerful: It 
>> creates a “copy" of the continuation stack and assigns it to a variable 
>> (constant) named “loop”.  After the action is executed, the “Restart_” 
>> primitive is invoked, passing that continuation as an argument.  The very 
>> next level one instruction that will be executed after the restart is 
>> capturing a new “loop” continuation – because the continuation that got 
>> constructed was actually positioned at the very start of the “Repeat_”’s 
>> method’s body.
>> 
>> So that loops forever, or until something else causes the fiber’s current 
>> continuation to be hijacked.  Because the method never actually returns, it 
>> has type bottom (⊥).  Similarly, the call “Restart loop” has type bottom, 
>> which is why there is no semicolon after it.  Only statements are followed 
>> by semicolons (a statement being an expression of type top (⊤), such as the 
>> invocation of the top-valued function “action”).
>> 
>> If we wanted to be able to conditionally exit from that loop, we could 
>> simply perform “Exit loop” at some point.  Then we would have to change the 
>> return type of “Repeat_” to top.  Alternatively, if we want to exit with 
>> some value, we would strengthen “Repeat_”’s return type to any (or something 
>> stronger), and invoke “Exit_with_” instead, supplying the value to return 
>> from the call to “Repeat_”.
>> 
>> This is sort of a roundabout way of building a looping control structure, 
>> but I wanted to make sure the basic mechanism had maximum generality.  In 
>> languages like Eiffel, there is no way to exit multiple loops and 
>> conditionals in a single step.  Java allows named control structures, and 
>> Avail essentially does the same allowing user-created code full access to 
>> labels.
>> 
>> Conditional dispatch used to be done inside an “If_then_else_” primitive, 
>> but now that we’re inlining the dispatch tree, we simply use a method with 
>> two primitive implementations: One whose first argument is always true 
>> (typed as true’s type) and evaluates the second argument, and one whose 
>> first argument is always false and evaluates the third argument.  The level 
>> two code generates a test-and-branch to the false case, an invocation of the 
>> second argument, a jump past the false case, and then the false case: an 
>> invocation of the third argument, falling through to the same place the true 
>> case jumped.  However, there’s very almost nothing special about booleans 
>> and the conditional primitives.  We even use a similar technique for 
>> short-circuited and and or.  In theory, an Avail user could define an 
>> extended boolean type that includes a new atom “maybe”, and have it run both 
>> alternatives.  The VM does produce the true and false atoms when, say, 
>> comparing two numbers, but in theory the use of “maybe” would still be 
>> perfectly type-safe.
>> 
>> So because we use primitives and polymorphic dispatch to implement loops and 
>> conditionals, we don’t actually need to do any branches or jumps in the 
>> level one code.  Therefore level one contains no control flow nybblecodes, 
>> other than L1_doCall.  Since every function that doesn’t exit early simply 
>> returns the last value produced, we don’t even bother with an L1 return 
>> instruction.  We used to, but since it only ever occurred as the last 
>> instruction of a function, it was pointless.  Now it’s simply implied at the 
>> end of every function.  I’m currently in the process of adding 
>> L1_doSuperCall (technically L1Ext_doSuperCall, since it requires the 
>> extension nybblecode), which also required me to create L1Ext_doGetType and 
>> L1Ext_doMakeTupleAndType.  A little while ago I added L1_doPermute for 
>> argument reordering.  A few years back we added L1Ext_doDuplicate to allow 
>> inline assignments like “Print: someTuple[(x := 2+2)]”.  We don’t support it 
>> in the built-in grammar currently, but our macros tests are able to generate 
>> it.  It’s there to make it easier to support other languages.  The complete 
>> list of level one instructions is found in the enumeration L1Operation (and 
>> listed as methods in L1OperationDispatcher).
>> 
>> I’m sure you see how level one provides a super-clean layer for Avail code.  
>> In theory, someone can write a compliant Avail interpreter that only ever 
>> interprets these nybblecodes.  In fact, all the Avail tools (e.g., a 
>> debugger) that we plan to implement will only need to operate on the level 
>> one instruction set.
>> 
>> However, in our current implementation there’s a layer below the visible 
>> surface called level two (L2).  Where the L1 instructions operate on a 
>> stack, L2 is designed as a stackless register-transfer language with an 
>> infinite number of non-windowed registers.  Optimizing L1 into alternative 
>> L1 would be pointless, since it would be unable to do much more than fold 
>> the occasional primitive, and couldn’t even *represent* code in which a 
>> method dispatch was inlined.  But L2 is designed to be easy to optimize and 
>> inline.  Since even basic control flow is done as method calls in Avail, we 
>> need to inline to get back some of the performance that we lost by keeping 
>> L1 so clean.  So L2 has jumping, testing-and-branching, continuation 
>> creation, continuation "exploding” (extract each field of the continuation 
>> into registers), and the idea of off-ramps and on-ramps.
>> 
>> Say we’re running some L2 code for function F (called the L2 chunk for F), 
>> and another fiber wants to add an implementation of some method that F 
>> calls.  Say we have a dispatch tree inlined in F’s L2 representation.  How 
>> do we coordinate the change so that the new method implementation is 
>> available in the current fiber?  Off-ramps.  We make it the responsibility 
>> of the L2Translator (that converts L1 -> L2) to explicitly poll for 
>> interrupt conditions at various points in the L2 code.  If the polling code 
>> sees the need to get into an L1-consistent state (say because someone wants 
>> to add a method implementation, or just to allow another fiber a chance to 
>> run for a while), it branches to some L2 code (an “offramp”) that is able to 
>> construct a suitable continuation (because that’s the code that the 
>> translator created), stuff it back into the running fiber, and process an 
>> interrupt.  While there are tasks queued that need to run while all fibers 
>> are in an L1-consistent state (e.g., adding a method implementation), the 
>> thread pool prevents fiber-execution tasks from running, and signals the 
>> existing fibers that someone wants L1-consistency.  When the last running 
>> fiber becomes L1-consistent, all tasks requiring L1 consistency are 
>> executed.  When the last one completes, the fibers are made runnable again, 
>> at which point they may diverge into a cloud of L2 instructions, safe in the 
>> assumption that methods are not changing.
>> 
>> When a method implementation gets added (during an L1-consistent time), it 
>> iterates through its dependent L2 chunks and marks them as invalid.  When 
>> the fibers continue running again, they check if they’re in an invalid 
>> chunk, and if so, they switch to the default unoptimized L2 chunk (which is 
>> always valid as it doesn’t have a dependency on any methods).  The other 
>> side of the equation is that when a dispatch tree for a call to method M is 
>> written into chunk C, chunk C gets added to the set of dependencies of 
>> method M.  Then we have nice, happy safety guarantees for running L2 code 
>> without much fuss, including deoptimization for dynamic changes.
>> 
>> We make it the translator’s job to ensure L1 consistency at certain times.  
>> Right now that only happens right after a non-primitive call (i.e., when 
>> entering a non-primitive function), which should be more than adequate 
>> granularity, especially since we don’t have any long-running blocking 
>> primitives.  Since the L2 instruction that checks for an interrupt and the 
>> L2 instructions that construct the reified L1 continuation(s) are ordinary 
>> L2 code, they’re subject to the same type of optimizations as the rest of 
>> the L2 chunk.  Since interrupts are supposed to be the infrequent case, and 
>> since continuations are conceptually immutable, once we support inlining of 
>> non-primitives we’ll be able to postpone creation of a whole series of stack 
>> frames (continuations) representing inlined functions.  Technically it’s not 
>> postponing, it’s pushing the responsibility for creating the frames at all 
>> into the offramp code, so that if an interrupt doesn’t happen (the very 
>> likely case), it doesn’t need to create the stack frames.
>> 
>> Now here’s the most fun part: Since the continuations are just data 
>> structures, we can use a general object escape analysis to optimize them, 
>> including eliding their creation in many cases.  And we can detect when a 
>> “Restart_” invocation is of a continuation that we created in the same 
>> chunk, converting it into a jump (and maybe a few register writes).  
>> Similarly, function closures are just data structures that bind values with 
>> raw functions, so anywhere that such a closure “travels” inside an L2 chunk 
>> via inlining, we should be able to simply use the variables/values directly 
>> from the registers that provided the values that went into construction of 
>> the closure, rather than extract them from the closure.  If we end up 
>> eliminating all value-extractions from the closure and it doesn’t escape 
>> from the L2 chunk, well, the closure construction operation becomes a dead 
>> write to the register holding the closure.  Since it had no side-effects 
>> beyond creating and writing that closure into a register, we automatically 
>> remove that L2 instruction, saving the cost of constructing the closure.  As 
>> with continuations, the closure is just a data structure, so we get this 
>> benefit automatically by implementing general object escape analysis.
>> 
>> We don’t have most of these optimizations in place yet, but they’re "ripe 
>> for the picking”, once we do non-primitive inlining.
>> 
>> Now, consider what will be below L2.  The L2 code will already have had 
>> “high level optimizations” performed on it, such as eliminating the creation 
>> of objects that don’t escape, canceling primitive sequences (e.g., 
>> converting <a,b>[2] into b), and performing other semantically valid 
>> rewriting (e.g., replacing s∪s with s).  If we “transliterate" the L2 
>> instructions through a simple translator into JVM or LLVM code, we can 
>> leverage most of the low-level optimizations of those substrates.  But we’ll 
>> eliminate the L2 interpretation costs (fetch an L2 instruction, dispatch it 
>> megamorphically, repeat).  For the JVM I’m thinking we’ll collect the 
>> registers used by an L2 chunk and create corresponding instance variables 
>> for them.  The chunk can be structured as a big switch statement with 
>> fall-throughs, one case per L2 instruction, starting with the main entry 
>> point for case 1.  If we return back into the L2 structure, we will usually 
>> do so through the Java stack.  If the Java stack gets too deep (say more 
>> than 20 stack levels), or truly reified L1 continuations have to be 
>> constructed for some other reason, we can throw a special Java exception.  
>> The generated code can be wrapped in try/catch handlers that cause 
>> reification of the L1 continuations on the way out.  In order to do an Avail 
>> “return” into the code after such a reification, you simply invoke it again, 
>> passing an argument containing the L2 program counter.  The switch then 
>> takes you to the correct spot, just like you had never returned in the first 
>> place.  That gives the performance benefit of using the Java stack the vast 
>> majority of the time, without its usual limitations (when the Avail code 
>> wants to use deep recursion).  It even allows backtracking behavior, like 
>> returning into/from the same continuation multiple times.  I’m guessing that 
>> even before we have non-primitive inlining we should see a several-fold 
>> improvement in performance by generating JVM code from L2.
>> 
>> One last thing.  When we inline a non-primitive function, we won’t look at 
>> its L1 nybblecodes and recompute all the same optimizations that we figured 
>> out the last time we inlined that function.  Instead, we can grab the 
>> function’s L2 implementation and directly embed it.  There will often be new 
>> optimization opportunities apparent when we recompute the type information 
>> for the *specific* types of the arguments supplied at the call site, but the 
>> *general* optimizations will already have been performed.  If the function 
>> we’re inlining had other functions inlined inside it, that’s even more work 
>> that we don’t have to do again.  We'll still get an opportunity to further 
>> optimize everything with the potentially stronger types of the arguments 
>> (such as eliminating now-unreachable cases from dispatch logic).  And later, 
>> if someone wants to inline a call to the function that we’re inlining into, 
>> they'll save having to redo all that hard work.
>> 
>> The L1 and L2 ideas date back to the early 90s.  I started by writing L1 in 
>> Smalltalk (Visualworks 2.0), then got an L2 running in Smalltalk as well.  
>> Then I built a mechanical translator that was able to translate almost all 
>> of the Smalltalk code into C++ (including comments).  To get this to work, I 
>> provided some hints and alternative translation directives in the Smalltalk 
>> code via an #equivalentC: method that did nothing in Smalltalk.  I then got 
>> the L1 interpreter running in C++.  It was approximately the same speed as 
>> the L2 code running in Smalltalk.  Technically, the C++ code was missing a 
>> few crucial pieces that Smalltalk supplied, like an incremental garbage 
>> collector.  The AvailObject representation in Smalltalk was basically a 
>> smart pointer into the C++ heap, where all Avail objects were allocated.  I 
>> was trying for several years to get the L2 code to run in C++, but due to a 
>> combination of buggy compilers, ZERO debugging support for C++, and, well, 
>> C++ being just so damned awful, I wasted most of a decade struggling 
>> pointlessly with it.  I think it was 2009 or 2010 when Todd convinced me 
>> that Java was up to the task, at which point I reworked the translator to 
>> produce Java code instead of C++.  At some point almost all of the code was 
>> in Java and looked mostly right (but with *thousands* of compilation 
>> errors), and Todd convinced me to pull the plug on Smalltalk.  We probably 
>> would have saved a bit of effort by throwing another month or two at 
>> tweaking the translator, but a combination of Eclipse, thousands of passes 
>> with regex, eyeballs, and sore fingers got things running in just a few 
>> months.  I have archives of the Smalltalk and C++ code, but they’re just a 
>> historical curiosity at this point.  The moment Todd convinced me the Java 
>> translation was good enough, the Java code became the master copy which we 
>> began editing by hand.
>> 
>> The mechanical translator was only ever written in Smalltalk, and I have no 
>> intention of reviving it for any purpose within Avail.  An L2 -> JVM or LLVM 
>> translator will be a very different beast (note: no source code).  However, 
>> at some point we will write another translator to translate our Java code 
>> that implements the Avail VM into a pidgin Avail called Avalanche.  
>> Avalanche code is not intended to be directly executed; we'll scan through 
>> all the compiled code (and parse trees) to produce LLVM code to produce a 
>> new Avail VM.  When that VM runs well enough, we’ll archive the Java source, 
>> just like we did for the Smalltalk source.  Then we’ll edit Avalanche code 
>> when we want to update the VM.  Java has been an ENORMOUS leap above 
>> Smalltalk and C++ (especially C++), but we’re expecting much better 
>> performance from LLVM.  Static profile-guided optimization of much of the VM 
>> will certainly help the startup time.  We’ll also be able to get a custom 
>> garbage collector running again in LLVM to deal with the unique 
>> opportunities that Avail provides.
>> 
>> If you’d like, I could walk you through the L2 translation of the “Repeat_” 
>> method.  I’m sure it would be interesting.
>

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [Avail] how is a method call efficiently compiled and dispatched?

Reply via email to