Mark, Thank you for taking your time to write such an informative post.
That said, Avail's layered compiler technology is currently beyond my powers to fully comprehend. I want to learn, but there is so much other stuff to take in. From what I understand so far, Avail’s dynamic compiler appears to be a strongly typed version of Self’s. That’s a big achievement, because Self is still considered to bolster one of the most advanced (compiler) technology to date (next to the JVM -server, which originated from Self), I also like Avails ‘everything immutable’ as a default. It is very interesting to see that parvasive use of immutability leads to such a clean VM design (whilst still enabling efficient compilation). Remember, 'all things immutable' is my pet peeve. cheers, Robbert. > On 18 Jan 2015, at 00:06, Mark van Gulik <[email protected]> wrote: > > I used to think of the technique I came up with as continuation passing > style, but over the years, I’ve come to think of it as something different. > It’s closer to the style of machine code, where we get to decide how to use > the instructions to implement function calls. > > Level one was designed to be both trivial to work with and (when combined > with primitives) computationally complete. Since control flow constructs > like subroutine calls always seem to fall short when attempting to add > features like exception handling, coroutines, and backtracking, I decided to > build them all from the same collection of parts. The key feature was an > immutable representation of continuations. I could very easily build up > conditional flow from polymorphic dispatch, and build loops from continuation > invocation. Let’s explore loops for a moment. Here’s the implementation of > “Repeat_”, which simply executes a function repeatedly, forever. > > Public method "Repeat_" is > [ > action : []→⊤ > | > $loop; > action(); > Restart loop > ] : ⊥; > > The "$loop;” declares a label, meant to be similar in style to a label in > assembly language. What it actually does in Avail is more powerful: It > creates a “copy" of the continuation stack and assigns it to a variable > (constant) named “loop”. After the action is executed, the “Restart_” > primitive is invoked, passing that continuation as an argument. The very > next level one instruction that will be executed after the restart is > capturing a new “loop” continuation – because the continuation that got > constructed was actually positioned at the very start of the “Repeat_”’s > method’s body. > > So that loops forever, or until something else causes the fiber’s current > continuation to be hijacked. Because the method never actually returns, it > has type bottom (⊥). Similarly, the call “Restart loop” has type bottom, > which is why there is no semicolon after it. Only statements are followed by > semicolons (a statement being an expression of type top (⊤), such as the > invocation of the top-valued function “action”). > > If we wanted to be able to conditionally exit from that loop, we could simply > perform “Exit loop” at some point. Then we would have to change the return > type of “Repeat_” to top. Alternatively, if we want to exit with some value, > we would strengthen “Repeat_”’s return type to any (or something stronger), > and invoke “Exit_with_” instead, supplying the value to return from the call > to “Repeat_”. > > This is sort of a roundabout way of building a looping control structure, but > I wanted to make sure the basic mechanism had maximum generality. In > languages like Eiffel, there is no way to exit multiple loops and > conditionals in a single step. Java allows named control structures, and > Avail essentially does the same allowing user-created code full access to > labels. > > Conditional dispatch used to be done inside an “If_then_else_” primitive, but > now that we’re inlining the dispatch tree, we simply use a method with two > primitive implementations: One whose first argument is always true (typed as > true’s type) and evaluates the second argument, and one whose first argument > is always false and evaluates the third argument. The level two code > generates a test-and-branch to the false case, an invocation of the second > argument, a jump past the false case, and then the false case: an invocation > of the third argument, falling through to the same place the true case > jumped. However, there’s very almost nothing special about booleans and the > conditional primitives. We even use a similar technique for short-circuited > and and or. In theory, an Avail user could define an extended boolean type > that includes a new atom “maybe”, and have it run both alternatives. The VM > does produce the true and false atoms when, say, comparing two numbers, but > in theory the use of “maybe” would still be perfectly type-safe. > > So because we use primitives and polymorphic dispatch to implement loops and > conditionals, we don’t actually need to do any branches or jumps in the level > one code. Therefore level one contains no control flow nybblecodes, other > than L1_doCall. Since every function that doesn’t exit early simply returns > the last value produced, we don’t even bother with an L1 return instruction. > We used to, but since it only ever occurred as the last instruction of a > function, it was pointless. Now it’s simply implied at the end of every > function. I’m currently in the process of adding L1_doSuperCall (technically > L1Ext_doSuperCall, since it requires the extension nybblecode), which also > required me to create L1Ext_doGetType and L1Ext_doMakeTupleAndType. A little > while ago I added L1_doPermute for argument reordering. A few years back we > added L1Ext_doDuplicate to allow inline assignments like “Print: someTuple[(x > := 2+2)]”. We don’t support it in the built-in grammar currently, but our > macros tests are able to generate it. It’s there to make it easier to > support other languages. The complete list of level one instructions is > found in the enumeration L1Operation (and listed as methods in > L1OperationDispatcher). > > I’m sure you see how level one provides a super-clean layer for Avail code. > In theory, someone can write a compliant Avail interpreter that only ever > interprets these nybblecodes. In fact, all the Avail tools (e.g., a > debugger) that we plan to implement will only need to operate on the level > one instruction set. > > However, in our current implementation there’s a layer below the visible > surface called level two (L2). Where the L1 instructions operate on a stack, > L2 is designed as a stackless register-transfer language with an infinite > number of non-windowed registers. Optimizing L1 into alternative L1 would be > pointless, since it would be unable to do much more than fold the occasional > primitive, and couldn’t even *represent* code in which a method dispatch was > inlined. But L2 is designed to be easy to optimize and inline. Since even > basic control flow is done as method calls in Avail, we need to inline to get > back some of the performance that we lost by keeping L1 so clean. So L2 has > jumping, testing-and-branching, continuation creation, continuation > "exploding” (extract each field of the continuation into registers), and the > idea of off-ramps and on-ramps. > > Say we’re running some L2 code for function F (called the L2 chunk for F), > and another fiber wants to add an implementation of some method that F calls. > Say we have a dispatch tree inlined in F’s L2 representation. How do we > coordinate the change so that the new method implementation is available in > the current fiber? Off-ramps. We make it the responsibility of the > L2Translator (that converts L1 -> L2) to explicitly poll for interrupt > conditions at various points in the L2 code. If the polling code sees the > need to get into an L1-consistent state (say because someone wants to add a > method implementation, or just to allow another fiber a chance to run for a > while), it branches to some L2 code (an “offramp”) that is able to construct > a suitable continuation (because that’s the code that the translator > created), stuff it back into the running fiber, and process an interrupt. > While there are tasks queued that need to run while all fibers are in an > L1-consistent state (e.g., adding a method implementation), the thread pool > prevents fiber-execution tasks from running, and signals the existing fibers > that someone wants L1-consistency. When the last running fiber becomes > L1-consistent, all tasks requiring L1 consistency are executed. When the > last one completes, the fibers are made runnable again, at which point they > may diverge into a cloud of L2 instructions, safe in the assumption that > methods are not changing. > > When a method implementation gets added (during an L1-consistent time), it > iterates through its dependent L2 chunks and marks them as invalid. When the > fibers continue running again, they check if they’re in an invalid chunk, and > if so, they switch to the default unoptimized L2 chunk (which is always valid > as it doesn’t have a dependency on any methods). The other side of the > equation is that when a dispatch tree for a call to method M is written into > chunk C, chunk C gets added to the set of dependencies of method M. Then we > have nice, happy safety guarantees for running L2 code without much fuss, > including deoptimization for dynamic changes. > > We make it the translator’s job to ensure L1 consistency at certain times. > Right now that only happens right after a non-primitive call (i.e., when > entering a non-primitive function), which should be more than adequate > granularity, especially since we don’t have any long-running blocking > primitives. Since the L2 instruction that checks for an interrupt and the L2 > instructions that construct the reified L1 continuation(s) are ordinary L2 > code, they’re subject to the same type of optimizations as the rest of the L2 > chunk. Since interrupts are supposed to be the infrequent case, and since > continuations are conceptually immutable, once we support inlining of > non-primitives we’ll be able to postpone creation of a whole series of stack > frames (continuations) representing inlined functions. Technically it’s not > postponing, it’s pushing the responsibility for creating the frames at all > into the offramp code, so that if an interrupt doesn’t happen (the very > likely case), it doesn’t need to create the stack frames. > > Now here’s the most fun part: Since the continuations are just data > structures, we can use a general object escape analysis to optimize them, > including eliding their creation in many cases. And we can detect when a > “Restart_” invocation is of a continuation that we created in the same chunk, > converting it into a jump (and maybe a few register writes). Similarly, > function closures are just data structures that bind values with raw > functions, so anywhere that such a closure “travels” inside an L2 chunk via > inlining, we should be able to simply use the variables/values directly from > the registers that provided the values that went into construction of the > closure, rather than extract them from the closure. If we end up eliminating > all value-extractions from the closure and it doesn’t escape from the L2 > chunk, well, the closure construction operation becomes a dead write to the > register holding the closure. Since it had no side-effects beyond creating > and writing that closure into a register, we automatically remove that L2 > instruction, saving the cost of constructing the closure. As with > continuations, the closure is just a data structure, so we get this benefit > automatically by implementing general object escape analysis. > > We don’t have most of these optimizations in place yet, but they’re "ripe for > the picking”, once we do non-primitive inlining. > > Now, consider what will be below L2. The L2 code will already have had “high > level optimizations” performed on it, such as eliminating the creation of > objects that don’t escape, canceling primitive sequences (e.g., converting > <a,b>[2] into b), and performing other semantically valid rewriting (e.g., > replacing s∪s with s). If we “transliterate" the L2 instructions through a > simple translator into JVM or LLVM code, we can leverage most of the > low-level optimizations of those substrates. But we’ll eliminate the L2 > interpretation costs (fetch an L2 instruction, dispatch it megamorphically, > repeat). For the JVM I’m thinking we’ll collect the registers used by an L2 > chunk and create corresponding instance variables for them. The chunk can be > structured as a big switch statement with fall-throughs, one case per L2 > instruction, starting with the main entry point for case 1. If we return > back into the L2 structure, we will usually do so through the Java stack. If > the Java stack gets too deep (say more than 20 stack levels), or truly > reified L1 continuations have to be constructed for some other reason, we can > throw a special Java exception. The generated code can be wrapped in > try/catch handlers that cause reification of the L1 continuations on the way > out. In order to do an Avail “return” into the code after such a > reification, you simply invoke it again, passing an argument containing the > L2 program counter. The switch then takes you to the correct spot, just like > you had never returned in the first place. That gives the performance > benefit of using the Java stack the vast majority of the time, without its > usual limitations (when the Avail code wants to use deep recursion). It even > allows backtracking behavior, like returning into/from the same continuation > multiple times. I’m guessing that even before we have non-primitive inlining > we should see a several-fold improvement in performance by generating JVM > code from L2. > > One last thing. When we inline a non-primitive function, we won’t look at > its L1 nybblecodes and recompute all the same optimizations that we figured > out the last time we inlined that function. Instead, we can grab the > function’s L2 implementation and directly embed it. There will often be new > optimization opportunities apparent when we recompute the type information > for the *specific* types of the arguments supplied at the call site, but the > *general* optimizations will already have been performed. If the function > we’re inlining had other functions inlined inside it, that’s even more work > that we don’t have to do again. We'll still get an opportunity to further > optimize everything with the potentially stronger types of the arguments > (such as eliminating now-unreachable cases from dispatch logic). And later, > if someone wants to inline a call to the function that we’re inlining into, > they'll save having to redo all that hard work. > > The L1 and L2 ideas date back to the early 90s. I started by writing L1 in > Smalltalk (Visualworks 2.0), then got an L2 running in Smalltalk as well. > Then I built a mechanical translator that was able to translate almost all of > the Smalltalk code into C++ (including comments). To get this to work, I > provided some hints and alternative translation directives in the Smalltalk > code via an #equivalentC: method that did nothing in Smalltalk. I then got > the L1 interpreter running in C++. It was approximately the same speed as > the L2 code running in Smalltalk. Technically, the C++ code was missing a > few crucial pieces that Smalltalk supplied, like an incremental garbage > collector. The AvailObject representation in Smalltalk was basically a smart > pointer into the C++ heap, where all Avail objects were allocated. I was > trying for several years to get the L2 code to run in C++, but due to a > combination of buggy compilers, ZERO debugging support for C++, and, well, > C++ being just so damned awful, I wasted most of a decade struggling > pointlessly with it. I think it was 2009 or 2010 when Todd convinced me that > Java was up to the task, at which point I reworked the translator to produce > Java code instead of C++. At some point almost all of the code was in Java > and looked mostly right (but with *thousands* of compilation errors), and > Todd convinced me to pull the plug on Smalltalk. We probably would have > saved a bit of effort by throwing another month or two at tweaking the > translator, but a combination of Eclipse, thousands of passes with regex, > eyeballs, and sore fingers got things running in just a few months. I have > archives of the Smalltalk and C++ code, but they’re just a historical > curiosity at this point. The moment Todd convinced me the Java translation > was good enough, the Java code became the master copy which we began editing > by hand. > > The mechanical translator was only ever written in Smalltalk, and I have no > intention of reviving it for any purpose within Avail. An L2 -> JVM or LLVM > translator will be a very different beast (note: no source code). However, > at some point we will write another translator to translate our Java code > that implements the Avail VM into a pidgin Avail called Avalanche. Avalanche > code is not intended to be directly executed; we'll scan through all the > compiled code (and parse trees) to produce LLVM code to produce a new Avail > VM. When that VM runs well enough, we’ll archive the Java source, just like > we did for the Smalltalk source. Then we’ll edit Avalanche code when we want > to update the VM. Java has been an ENORMOUS leap above Smalltalk and C++ > (especially C++), but we’re expecting much better performance from LLVM. > Static profile-guided optimization of much of the VM will certainly help the > startup time. We’ll also be able to get a custom garbage collector running > again in LLVM to deal with the unique opportunities that Avail provides. > > If you’d like, I could walk you through the L2 translation of the “Repeat_” > method. I’m sure it would be interesting.
signature.asc
Description: Message signed with OpenPGP using GPGMail
