Hi Robbert, Some primitives will be (or already are) inlined into somewhat more lightweight instructions than a primitive invocation. Those specialized L2 instruction implementations directly invoke subprimitives (methods on AvailObject), rather than use the primitive mechanism at all. L3 should be able to get rid of the generic argument passing and simply invoke these subprimitives with values from specific local L3 variables (e.g., Java local variables), leaving to the substrate the problem of how to lift some of these variables into CPU registers.
For infallible primitives (and primitives that are VM-guaranteed not to fail for the types of arguments being provided), some of the generic primitive invocation machinery is omitted by L2 infallible primitive invokers. Even most ordinary fallible primitives can run without reifying the top continuation unless it happens to fail that time (branching past the reification code on success). At some point we'll be addressing: -L3 primitive calling conventions -inlining non-primitives -common subexpression elimination -using int registers with specialized paths that fail out to the general boxed objects when they overflow. That will sometimes require overflow checks in the machine code, but the branch predictions should be nearly perfect. -more dedicated L2 instructions (which produce custom L3 code) for primitives that profiling shows us are the worst bottlenecks. We'll also build a mechanism that avoids most construction of L1 continuations, using substrate (e.g., Java or LLVM) stack frames instead. If reification is needed, we'll throw a ReificationException. Each stack frame method (generated code) catches the exception, builds a suitable L1 continuation, then rethrows the exception. Those new (mutable) continuations are then assembled (in reverse) to build the real continuation chain. We'll throw the exception when we need to do a non-inlined Restart or Exit (e.g., for backtracking), to throw an Avail exception, when the substrate's stack just gets too deep (say 30 calls for safety), to allow a fiber context switch (explicit yield, timeslice preemption, timer expiration, fiber parking for synchronization), or to allow an L1-safe operation to run (e.g., adding method definitions, which can invalidate L2 code, which therefore must not be running). We can either say those reified frames must continue running as interpreted L1, or provide an entry point to specify the L1 pc where it should continue (exploding data from the continuation back into local variables). A switch statement on the L1 pc should be useful. This mechanism allows function invocation to simply *be* the function calls (or method invocation) of the substrate... but without the limitations that are normally implied (bounded stack, no backtracking, exception processing at the whim of the substrate language). With that in mind, we may rewrite some particularly "hot" tuple manipulation code by allocating an array on the stack (C's alloca) to hold the tuple elements. That only works if the tuple operations can all be rewritten to equivalent array operations. If the conceptual tuple never escapes due to a ReificationException, we never have to build a tuple for it — or even garbage collect its storage, since it was stack-allocated. The immutability of tuples will also in theory allow us to collapse together passes of (stable, pure) mapping functions and filters. An untimely reification can just pretend the operations hadn't started yet, or use tree and subrange mechanisms to "fake" an intermediate state when possible. Subranges and tree tuples might also be useful for secretly parallelizing (stable, pure) operations on very large tuples. Massively multi-core CPUs are coming soon, and that would give us a nice boost. Finally, based on the snippet you included, we could save some of the cost of the binary search of the child subtuple index in tree tuples by keeping a "finger" slot. We would read the int finger field (without synchronization), then check if our requested subscript would fall within the child tuple at that finger index. If not, we would do the binary search to find the subtuple, then write its subscript back into the finger field (skipping synchronization). Looping sequentially over a tree tuple would be a bit quicker, but random access would involve only a few extra comparisons. I'm sure the current lookup cost is thoroughly dominated by other code, but it might be a pathway for you to try tweaking the VM. :-) Alternatively, you could extract buffers of data from the tree tuple, then process those flat, fast buffers. I believe we try to do that when we need a sequence of byte buffers in the VM (e.g., for file I/O). The buffer filling operations figure out which subtrees need to transfer their bytes into which regions of the buffer, then recurse on each child. The leaves (flat tuples) have customized tight loops for doing their part of the work. This technique avoids most of the tree traversal cost. You could also look at the way our tuple iterators work (in Java), which is similar. Make sure to make the tuple immutable if you're thinking of stashing an opportunistic <iterator, subscript> pair in the tree tuple instead of a finger. Also be aware that the tuple iterators are not thread-safe. Enjoy! -Mark > On Mar 5, 2015, at 3:58 PM, Robbert van Dalen <[email protected]> wrote: > > Hi, > > I’m interested to know whether Avail’s (L3?) compiler will be capable of > inlining higher order functions that traverse tuples in a tight loop. > Especially I wonder whether map and fold (*the* 'big data’ operators) can be > efficiently compiled and implemented with the current set of primitives. > > Currently, indexing a tuple is a VM primitive that has the following > (interpreter) implementation: > > (P_131_TupleAt.java) > > public Result attempt ( > final List<AvailObject> args, > final Interpreter interpreter, > final boolean skipReturnCheck) > { > assert args.size() == 2; > final A_Tuple tuple = args.get(0); > final A_Number indexObject = args.get(1); > if (!indexObject.isInt()) > { > return > interpreter.primitiveFailure(E_SUBSCRIPT_OUT_OF_BOUNDS); > } > final int index = indexObject.extractInt(); > if (index > tuple.tupleSize()) > { > return > interpreter.primitiveFailure(E_SUBSCRIPT_OUT_OF_BOUNDS); > } > return interpreter.primitiveSuccess( > tuple.tupleAt(index)); > } > > (TreeTupleDescriptor.java) > > AvailObject o_TupleAt (final AvailObject object, final int index) > { > final int childSubscript = childSubscriptForIndex(object, > index); > final int offset = offsetForChildSubscript(object, > childSubscript); > final A_Tuple child = object.slot(SUBTUPLE_AT_, childSubscript); > return child.tupleAt(index - offset); > } > > How would all this code be translated to L3 and aggressively inlined at the > L3 level, so that traversing and indexing tuples in a tight loop gets > maximally inlined? > > cheers, > Robbert. >
