Hi Robbert,

Some primitives will be (or already are) inlined into somewhat more lightweight 
instructions than a primitive invocation.  Those specialized L2 instruction 
implementations directly invoke subprimitives (methods on AvailObject), rather 
than use the primitive mechanism at all.  L3 should be able to get rid of the 
generic argument passing and simply invoke these subprimitives with values from 
specific local L3 variables (e.g., Java local variables), leaving to the 
substrate the problem of how to lift some of these variables into CPU registers.

For infallible primitives (and primitives that are VM-guaranteed not to fail 
for the types of arguments being provided), some of the generic primitive 
invocation machinery is omitted by L2 infallible primitive invokers.  Even most 
ordinary fallible primitives can run without reifying the top continuation 
unless it happens to fail that time (branching past the reification code on 
success).

At some point we'll be addressing:
-L3 primitive calling conventions
-inlining non-primitives
-common subexpression elimination
-using int registers with specialized paths that fail out to the general boxed 
objects when they overflow.  That will sometimes require overflow checks in the 
machine code, but the branch predictions should be nearly perfect.
-more dedicated L2 instructions (which produce custom L3 code) for primitives 
that profiling shows us are the worst bottlenecks.

We'll also build a mechanism that avoids most construction of L1 continuations, 
using substrate (e.g., Java or LLVM) stack frames instead.  If reification is 
needed, we'll throw a ReificationException.  Each stack frame method (generated 
code) catches the exception, builds a suitable L1 continuation, then rethrows 
the exception.  Those new (mutable) continuations are then assembled (in 
reverse) to build the real continuation chain.  We'll throw the exception when 
we need to do a non-inlined Restart or Exit (e.g., for backtracking), to throw 
an Avail exception, when the substrate's stack just gets too deep (say 30 calls 
for safety), to allow a fiber context switch (explicit yield, timeslice 
preemption, timer expiration, fiber parking for synchronization), or to allow 
an L1-safe operation to run (e.g., adding method definitions, which can 
invalidate L2 code, which therefore must not be running).

We can either say those reified frames must continue running as interpreted L1, 
or provide an entry point to specify the L1 pc where it should continue 
(exploding data from the continuation back into local variables).  A switch 
statement on the L1 pc should be useful.

This mechanism allows function invocation to simply *be* the function calls (or 
method invocation) of the substrate... but without the limitations that are 
normally implied (bounded stack, no backtracking, exception processing at the 
whim of the substrate language).

With that in mind, we may rewrite some particularly "hot" tuple manipulation 
code by allocating an array on the stack (C's alloca) to hold the tuple 
elements.  That only works if the tuple operations can all be rewritten to 
equivalent array operations.  If the conceptual tuple never escapes due to a 
ReificationException, we never have to build a tuple for it — or even garbage 
collect its storage, since it was stack-allocated.

The immutability of tuples will also in theory allow us to collapse together 
passes of (stable, pure) mapping functions and filters.  An untimely 
reification can just pretend the operations hadn't started yet, or use tree and 
subrange mechanisms to "fake" an intermediate state when possible.  Subranges 
and tree tuples might also be useful for secretly parallelizing (stable, pure) 
operations on very large tuples.  Massively multi-core CPUs are coming soon, 
and that would give us a nice boost.

Finally, based on the snippet you included, we could save some of the cost of 
the binary search of the child subtuple index in tree tuples by keeping a 
"finger" slot.  We would read the int finger field (without synchronization), 
then check if our requested subscript would fall within the child tuple at that 
finger index.  If not, we would do the binary search to find the subtuple, then 
write its subscript back into the finger field (skipping synchronization).  
Looping sequentially over a tree tuple would be a bit quicker, but random 
access would involve only a few extra comparisons.  I'm sure the current lookup 
cost is thoroughly dominated by other code, but it might be a pathway for you 
to try tweaking the VM.  :-)

Alternatively, you could extract buffers of data from the tree tuple, then 
process those flat, fast buffers.  I believe we try to do that when we need a 
sequence of byte buffers in the VM (e.g., for file I/O). The buffer filling 
operations figure out which subtrees need to transfer their bytes into which 
regions of the buffer, then recurse on each child.  The leaves (flat tuples) 
have customized tight loops for doing their part of the work.  This technique 
avoids most of the tree traversal cost.

You could also look at the way our tuple iterators work (in Java), which is 
similar.  Make sure to make the tuple immutable if you're thinking of stashing 
an opportunistic <iterator, subscript> pair in the tree tuple instead of a 
finger.  Also be aware that the tuple iterators are not thread-safe.

Enjoy!
-Mark



> On Mar 5, 2015, at 3:58 PM, Robbert van Dalen <[email protected]> wrote:
> 
> Hi,
> 
> I’m interested to know whether Avail’s (L3?) compiler will be capable of 
> inlining higher order functions that traverse tuples in a tight loop.
> Especially I wonder whether map and fold (*the* 'big data’ operators) can be 
> efficiently compiled and implemented with the current set of primitives.
> 
> Currently, indexing a tuple is a VM primitive that has the following 
> (interpreter) implementation:
> 
> (P_131_TupleAt.java)
> 
>       public Result attempt (
>               final List<AvailObject> args,
>               final Interpreter interpreter,
>               final boolean skipReturnCheck)
>       {
>               assert args.size() == 2;
>               final A_Tuple tuple = args.get(0);
>               final A_Number indexObject = args.get(1);
>               if (!indexObject.isInt())
>               {
>                       return 
> interpreter.primitiveFailure(E_SUBSCRIPT_OUT_OF_BOUNDS);
>               }
>               final int index = indexObject.extractInt();
>               if (index > tuple.tupleSize())
>               {
>                       return 
> interpreter.primitiveFailure(E_SUBSCRIPT_OUT_OF_BOUNDS);
>               }
>               return interpreter.primitiveSuccess(
>                       tuple.tupleAt(index));
>       }
> 
> (TreeTupleDescriptor.java)
> 
>       AvailObject o_TupleAt (final AvailObject object, final int index)
>       {
>               final int childSubscript = childSubscriptForIndex(object, 
> index);
>               final int offset = offsetForChildSubscript(object, 
> childSubscript);
>               final A_Tuple child = object.slot(SUBTUPLE_AT_, childSubscript);
>               return child.tupleAt(index - offset);
>       }
> 
> How would all this code be translated to L3 and aggressively inlined at the 
> L3 level, so that traversing and indexing tuples in a tight loop gets 
> maximally inlined?
> 
> cheers,
> Robbert.
> 

Reply via email to