The implementations of tuples are entirely my own (and Leslie's for the
interval and repeated tuples).
However, the tree implementation is "just" a BTree implementation with a
minimum non-root node size of 16 and a maximum size of 64. Normally a BTree
will have at most a ratio of 2 between min and max, but that's just to keep
disk pages nice and dense. No need for that in memory. But using a mechanism
like that rather than some sort of binary tree saves about half the memory cost
(and 90% of the L1 miss cost when it's cold), because of the fact that it uses
a simple array for its elements and subtrees. A tree tuple with a million
elements will have between 4 and 6 layers (log to the bases 16 and 64). Each
element has its usual reference pointing to it from its leaf tuple, but if
there are 16 elements in each leaf then the immediate parent tree node is
holding an eight-byte reference to the leaf, for an amortized cost of half a
bit per element. Higher nodes contribute significantly less amortized cost.
The leaf tuple's header and incidental fields (like the cached hash) contribute
a couple more bits per entry.
On the other hand, even a perfectly balanced binary tree has a leaf node for
every element, a parent node for every two, a grandparent for every four, etc.
so there are 2n complete nodes for each element. Assuming a tiny eight-byte
header, that's 2+(3/2)+(3/4) = 5 eight-byte words for each element. So it adds
400% overhead beyond an array representation. It also stresses branch
prediction to navigate downward to an element (although the binary searches
within an array are similar). And each node eats a separate entry in the
processor's on-chip L1 associative cache. And the cost for the garbage
collector to manage it goes up considerably. And the algorithms I've seen
would involve around twenty new nodes to be constructed for a simple insert or
replacement ("persistently") for a million element tree. That's probably a bit
less space than an update of the 16-64 BTree; it's hard to say which would
really be more expensive overall.
> On Apr 10, 2015, at 3:41 PM, Mark van Gulik <[email protected]> wrote:
>
> Excellent!
>
> As you point out, L2 and L3 are not there yet, but the base algorithms and
> representations are where we’ve focused our efforts so far... but there’s
> nothing stopping us from reaching the same performance some day by compiling
> down to the JVM, and we should be able to exceed it by compiling to LLVM. I
> feel LLVM will be much faster because the JVM has to meet specific safety
> concerns that we can take care of in other ways. For example, the bytecode
> verifier knows nothing about generics, and simply relies on the compiler
> producing a mostly right program, leaving dynamic casts all over the place.
> That’s probably a good thing, since Java generics aren’t even close to sound,
> so even if the linking architecture did anything with them it would still
> require lots of dynamic checks. Avail has some of the same issues with
> genericity, but I’m guessing our primitives already provide a much larger
> collection of useful guarantees than the JVM and JRE. And as soon as we
> start to inline generics we’ll get back not only the cost of those checks,
> but probably some vastly wider opportunities related to the semantics of the
> built-in types. Java just has integral types, garbage collected type-safe
> pointers, and structs with volatile and final annotations. It’s a great
> substrate for low-level programs, but it doesn’t really have any facilities
> for scaling up when these low-level parts are combined. Inlining and
> statistics-driven polymorphic dispatching are about it, but they mostly just
> keep programmers from doing stuff like inlining on their own (a goal of Avail
> that predates Java).
>
> Just so you’re aware, we’re expecting to get some of our best optimizations
> out of the looping, folding, and conditional operations, or really any
> operations that do control inversion by passing functions and/or
> continuations. We’ll strongly bias inlining decisions based on whether a
> function or continuation is being passed to external code. By preferring to
> inline those, we can not only fold away their construction, but also convert
> things like “Restart_” into mere jumps, opening the door for traditional loop
> optimizations.
>
>
>
>
>> On Apr 10, 2015, at 3:09 PM, Robbert van Dalen <[email protected]> wrote:
>>
>> Mark,
>>
>> I’ve ‘ported’ the TreeTuple implementation to Scala to learn more about its
>> inner workings.
>> I have yet to find a major performance issue: your TreeTuple code is pretty
>> fit!
>> I wonder what is the source of the implementation: or did you invent it
>> yourself?
>>
>> Anyway, I’m doing some (performance) experiments with map and fold over
>> (Tree)Tuples.
>> The scala version performance is pretty good (using specialisation and by
>> leveraging some JVM magic):
>>
>> Mapping over a 100 million element TreeTuple (add 1 to all elements) - using
>> 128-wide native array-backed sub-Tuples - takes ~5 seconds
>> Folding over the same TreeTuple (sum all elements) takes ~0.3 seconds.
>>
>> Of course, the Avail version is orders-of-magnitude slower as the L2 and L3
>> compilers are not yet there.
>> Consider that a benchmark for us to shoot at!
>>
>> cheers,
>> Robbert.
>