No, I think the cumulative sizes should actually stick around completely.  If 
the leaf (flat) tuples each contain 16 elements or more, the overhead for the 
cumulative sizis adds up to about half a bit per tuple element, but saves us 
about an order of magnitude of compares per lookup (although I'm pretty sure 
that doesn't dominate our costs anyhow).  As for L1 coherence, we could 
probably reorder them into van Em de Boas trees or something to maximize the 
coherence of the reads.  Although adding a "finger" index should get us even 
better savings in almost any actually occurring usage.  That's just a slot that 
said which subtree index we accessed last time.  We check that first, and fall 
back on binary search if the current lookup isn't in that subtree's range — 
then save the index that we reached for next time, hoping to leverage some 
degree of coherence out of the access pattern.  Linear scans should save a 
boatload, and even things like quicksort usually have only a couple of active 
pointers into the structure, which move smoothly.  Everything except the top 
node would have a >90% hit rate.


> On Apr 11, 2015, at 1:57 PM, Robbert van Dalen <[email protected]> wrote:
> 
> Mark,
> 
> Now that I’ve thought about it….
> May be we shouldn’t drop the cumulative size array entirely, but reduce it to 
> 3 slots (first element size, middle cumulative element size, last cumulative 
> element size).
> That way we efficiently cater for the standard use cases such as prepend and 
> append and replacement.
> 
> What do you think?
> 
>> On 11 Apr 2015, at 21:40, Robbert van Dalen <[email protected]> wrote:
>> 
>> Having experimented with 5 different immutable Tuple implementations or so, 
>> I can say with convidence that your BTree implementation is superior to all 
>> of them in terms of memory footprint and L1 caching.
>> That said, the memory footprint can even be more reduced if the cumulative 
>> sizes array is dropped. Without it you’ll lose binary search (only within a 
>> maximum size 64 element tuple), but that isn’t to bad considering the fact 
>> that the L1 cache doesn’t like binary searches anyway.
>> 
>> I’m also still not sure whether your BTree doesn’t degenerate after certain 
>> patterns of repeated inserts, appends, slices, etc.
>> Or that certain operations like append and slice aren’t always O(log(n)) 
>> time, but worse in certain obscure corner-cases.
>> I guess, theoretically, it should never degenerate *if* its invariant is 
>> always kept: i.e. that it must always be a constant height tree.
>> 
>> But what’s *really* interesting about Avail’s tuples is the choice of using 
>> an (incremental) hashing algorithm, that is invariant to a tuple’s 
>> representation.
>> (Enchilada also has incremental hashes, but they are cryptographic which 
>> means that different Tuple representations of the same data end up with 
>> different hashes).
>> 
>> Unfortunately, in Avail, there are currently two tuple implementations - 
>> ReverseTuple and SmallIntervalTuple - that don’t have incremental hashes, 
>> invariant of representation.
>> Because how would one quickly calculate the (reverse) hash of a ReverseTuple?
>> The same goes for a big interval of integers.
>> 
>> cheers,
>> Robbert.
>> 
>> 
>>> On 11 Apr 2015, at 18:51, Mark van Gulik <[email protected]> wrote:
>>> 
>>> The implementations of tuples are entirely my own (and Leslie's for the 
>>> interval and repeated tuples).
>>> 
>>> However, the tree implementation is "just" a BTree implementation with a 
>>> minimum non-root node size of 16 and a maximum size of 64.  Normally a 
>>> BTree will have at most a ratio of 2 between min and max, but that's just 
>>> to keep disk pages nice and dense.  No need for that in memory.  But using 
>>> a mechanism like that rather than some sort of binary tree saves about half 
>>> the memory cost (and 90% of the L1 miss cost when it's cold), because of 
>>> the fact that it uses a simple array for its elements and subtrees.  A tree 
>>> tuple with a million elements will have between 4 and 6 layers (log to the 
>>> bases 16 and 64).  Each element has its usual reference pointing to it from 
>>> its leaf tuple, but if there are 16 elements in each leaf then the 
>>> immediate parent tree node is holding an eight-byte reference to the leaf, 
>>> for an amortized cost of half a bit per element.  Higher nodes contribute 
>>> significantly less amortized cost.  The leaf tuple's header and incidental 
>>> fields (like the cached hash) contribute a couple more bits per entry.
>>> 
>>> On the other hand, even a perfectly balanced binary tree has a leaf node 
>>> for every element, a parent node for every two, a grandparent for every 
>>> four, etc.  so there are 2n complete nodes for each element.  Assuming a 
>>> tiny eight-byte header, that's 2+(3/2)+(3/4) = 5 eight-byte words for each 
>>> element.  So it adds 400% overhead beyond an array representation.  It also 
>>> stresses branch prediction to navigate downward to an element (although the 
>>> binary searches within an array are similar).  And each node eats a 
>>> separate entry in the processor's on-chip L1 associative cache.  And the 
>>> cost for the garbage collector to manage it goes up considerably.  And the 
>>> algorithms I've seen would involve around twenty new nodes to be 
>>> constructed for a simple insert or replacement ("persistently") for a 
>>> million element tree.  That's probably a bit less space than an update of 
>>> the 16-64 BTree; it's hard to say which would really be more expensive 
>>> overall.
>>> 
>>> 
>>>> On Apr 10, 2015, at 3:41 PM, Mark van Gulik <[email protected]> wrote:
>>>> 
>>>> Excellent!
>>>> 
>>>> As you point out, L2 and L3 are not there yet, but the base algorithms and 
>>>> representations are where we’ve focused our efforts so far... but there’s 
>>>> nothing stopping us from reaching the same performance some day by 
>>>> compiling down to the JVM, and we should be able to exceed it by compiling 
>>>> to LLVM.  I feel LLVM will be much faster because the JVM has to meet 
>>>> specific safety concerns that we can take care of in other ways.  For 
>>>> example, the bytecode verifier knows nothing about generics, and simply 
>>>> relies on the compiler producing a mostly right program, leaving dynamic 
>>>> casts all over the place.  That’s probably a good thing, since Java 
>>>> generics aren’t even close to sound, so even if the linking architecture 
>>>> did anything with them it would still require lots of dynamic checks.  
>>>> Avail has some of the same issues with genericity, but I’m guessing our 
>>>> primitives already provide a much larger collection of useful guarantees 
>>>> than the JVM and JRE.  And as soon as we start to inline generics we’ll 
>>>> get back not only the cost of those checks, but probably some vastly wider 
>>>> opportunities related to the semantics of the built-in types.  Java just 
>>>> has integral types, garbage collected type-safe pointers, and structs with 
>>>> volatile and final annotations.  It’s a great substrate for low-level 
>>>> programs, but it doesn’t really have any facilities for scaling up when 
>>>> these low-level parts are combined.  Inlining and statistics-driven 
>>>> polymorphic dispatching are about it, but they mostly just keep 
>>>> programmers from doing stuff like inlining on their own (a goal of Avail 
>>>> that predates Java).
>>>> 
>>>> Just so you’re aware, we’re expecting to get some of our best 
>>>> optimizations out of the looping, folding, and conditional operations, or 
>>>> really any operations that do control inversion by passing functions 
>>>> and/or continuations.  We’ll strongly bias inlining decisions based on 
>>>> whether a function or continuation is being passed to external code.  By 
>>>> preferring to inline those, we can not only fold away their construction, 
>>>> but also convert things like “Restart_” into mere jumps, opening the door 
>>>> for traditional loop optimizations.
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Apr 10, 2015, at 3:09 PM, Robbert van Dalen <[email protected]> 
>>>>> wrote:
>>>>> 
>>>>> Mark,
>>>>> 
>>>>> I’ve ‘ported’ the TreeTuple implementation to Scala to learn more about 
>>>>> its inner workings.
>>>>> I have yet to find a major performance issue: your TreeTuple code is 
>>>>> pretty fit!
>>>>> I wonder what is the source of the implementation: or did you invent it 
>>>>> yourself?
>>>>> 
>>>>> Anyway, I’m doing some (performance) experiments with map and fold over 
>>>>> (Tree)Tuples.
>>>>> The scala version performance is pretty good (using specialisation and by 
>>>>> leveraging some JVM magic):
>>>>> 
>>>>> Mapping over a 100 million element TreeTuple (add 1 to all elements) - 
>>>>> using 128-wide native array-backed sub-Tuples - takes ~5 seconds
>>>>> Folding over the same TreeTuple (sum all elements) takes ~0.3 seconds.
>>>>> 
>>>>> Of course, the Avail version is orders-of-magnitude slower as the L2 and 
>>>>> L3 compilers are not yet there.
>>>>> Consider that a benchmark for us to shoot at!
>>>>> 
>>>>> cheers,
>>>>> Robbert.
> 

Reply via email to