Re: [Avail] The efficient mapping or folding of a tuple

Robbert van Dalen Sat, 07 Mar 2015 02:11:56 -0800

Mark,

Many thanks for your elaborate reply. The buffer/finger approach sounds 
promising to speed things up.
What I’m after is to efficiently support map and fold, considering the 
following (pseudo) Java code:


        interface JTuple<T>
        {
                public <R> JTuple<R> map (Function<T,R> m);
                public <R> JTuple<R> fold (BiFunction<T,R,R> f);
        }

// Int version

        interface IntJTuple extends JTuple<Int>
        {
                public <R> JTuple<R> map (IntFunction<R> m);
                public <R> JTuple<R> fold (IntBiFunction <R,R> f);
        }

.. the downside (as alway)  is that we need to implement the combinatorial 
explosion of all Java’s primitive types.

That said, efficient implementations of JTuple would have a native backing Java 
array.
Another implementing would have a backing balanced (n-way) tree.

So I guess what I’m asking is whether something similar can be done in Avail?

Can tupleMap and tupleFold be added as primitive methods to A_Tuple?
From what I’ve learned so far is, that's probably not an option.

cheers,
Robbert.

> On 06 Mar 2015, at 19:55, Mark van Gulik <[email protected]> wrote:
> 
> Hi Robbert,
> 
> Some primitives will be (or already are) inlined into somewhat more 
> lightweight instructions than a primitive invocation.  Those specialized L2 
> instruction implementations directly invoke subprimitives (methods on 
> AvailObject), rather than use the primitive mechanism at all.  L3 should be 
> able to get rid of the generic argument passing and simply invoke these 
> subprimitives with values from specific local L3 variables (e.g., Java local 
> variables), leaving to the substrate the problem of how to lift some of these 
> variables into CPU registers.
> 
> For infallible primitives (and primitives that are VM-guaranteed not to fail 
> for the types of arguments being provided), some of the generic primitive 
> invocation machinery is omitted by L2 infallible primitive invokers.  Even 
> most ordinary fallible primitives can run without reifying the top 
> continuation unless it happens to fail that time (branching past the 
> reification code on success).
> 
> At some point we'll be addressing:
> -L3 primitive calling conventions
> -inlining non-primitives
> -common subexpression elimination
> -using int registers with specialized paths that fail out to the general 
> boxed objects when they overflow.  That will sometimes require overflow 
> checks in the machine code, but the branch predictions should be nearly 
> perfect.
> -more dedicated L2 instructions (which produce custom L3 code) for primitives 
> that profiling shows us are the worst bottlenecks.
> 
> We'll also build a mechanism that avoids most construction of L1 
> continuations, using substrate (e.g., Java or LLVM) stack frames instead.  If 
> reification is needed, we'll throw a ReificationException.  Each stack frame 
> method (generated code) catches the exception, builds a suitable L1 
> continuation, then rethrows the exception.  Those new (mutable) continuations 
> are then assembled (in reverse) to build the real continuation chain.  We'll 
> throw the exception when we need to do a non-inlined Restart or Exit (e.g., 
> for backtracking), to throw an Avail exception, when the substrate's stack 
> just gets too deep (say 30 calls for safety), to allow a fiber context switch 
> (explicit yield, timeslice preemption, timer expiration, fiber parking for 
> synchronization), or to allow an L1-safe operation to run (e.g., adding 
> method definitions, which can invalidate L2 code, which therefore must not be 
> running).
> 
> We can either say those reified frames must continue running as interpreted 
> L1, or provide an entry point to specify the L1 pc where it should continue 
> (exploding data from the continuation back into local variables).  A switch 
> statement on the L1 pc should be useful.
> 
> This mechanism allows function invocation to simply *be* the function calls 
> (or method invocation) of the substrate... but without the limitations that 
> are normally implied (bounded stack, no backtracking, exception processing at 
> the whim of the substrate language).
> 
> With that in mind, we may rewrite some particularly "hot" tuple manipulation 
> code by allocating an array on the stack (C's alloca) to hold the tuple 
> elements.  That only works if the tuple operations can all be rewritten to 
> equivalent array operations.  If the conceptual tuple never escapes due to a 
> ReificationException, we never have to build a tuple for it — or even garbage 
> collect its storage, since it was stack-allocated.
> 
> The immutability of tuples will also in theory allow us to collapse together 
> passes of (stable, pure) mapping functions and filters.  An untimely 
> reification can just pretend the operations hadn't started yet, or use tree 
> and subrange mechanisms to "fake" an intermediate state when possible.  
> Subranges and tree tuples might also be useful for secretly parallelizing 
> (stable, pure) operations on very large tuples.  Massively multi-core CPUs 
> are coming soon, and that would give us a nice boost.
> 
> Finally, based on the snippet you included, we could save some of the cost of 
> the binary search of the child subtuple index in tree tuples by keeping a 
> "finger" slot.  We would read the int finger field (without synchronization), 
> then check if our requested subscript would fall within the child tuple at 
> that finger index.  If not, we would do the binary search to find the 
> subtuple, then write its subscript back into the finger field (skipping 
> synchronization).  Looping sequentially over a tree tuple would be a bit 
> quicker, but random access would involve only a few extra comparisons.  I'm 
> sure the current lookup cost is thoroughly dominated by other code, but it 
> might be a pathway for you to try tweaking the VM.  :-)
> 
> Alternatively, you could extract buffers of data from the tree tuple, then 
> process those flat, fast buffers.  I believe we try to do that when we need a 
> sequence of byte buffers in the VM (e.g., for file I/O). The buffer filling 
> operations figure out which subtrees need to transfer their bytes into which 
> regions of the buffer, then recurse on each child.  The leaves (flat tuples) 
> have customized tight loops for doing their part of the work.  This technique 
> avoids most of the tree traversal cost.
> 
> You could also look at the way our tuple iterators work (in Java), which is 
> similar.  Make sure to make the tuple immutable if you're thinking of 
> stashing an opportunistic <iterator, subscript> pair in the tree tuple 
> instead of a finger.  Also be aware that the tuple iterators are not 
> thread-safe.
> 
> Enjoy!
> -Mark

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [Avail] The efficient mapping or folding of a tuple

Reply via email to