If we got rid of the pre-toggling of tuple elements' hash values and did an add 
instead, and if integers used a polynomial to hash instead of trying to be well 
shuffled, we would be able to compute interval tuple hashes in *way* less than 
linear time.


> On Apr 11, 2015, at 1:40 PM, Robbert van Dalen <[email protected]> wrote:
> 
> Having experimented with 5 different immutable Tuple implementations or so, I 
> can say with convidence that your BTree implementation is superior to all of 
> them in terms of memory footprint and L1 caching.
> That said, the memory footprint can even be more reduced if the cumulative 
> sizes array is dropped. Without it you’ll lose binary search (only within a 
> maximum size 64 element tuple), but that isn’t to bad considering the fact 
> that the L1 cache doesn’t like binary searches anyway.
> 
> I’m also still not sure whether your BTree doesn’t degenerate after certain 
> patterns of repeated inserts, appends, slices, etc.
> Or that certain operations like append and slice aren’t always O(log(n)) 
> time, but worse in certain obscure corner-cases.
> I guess, theoretically, it should never degenerate *if* its invariant is 
> always kept: i.e. that it must always be a constant height tree.
> 
> But what’s *really* interesting about Avail’s tuples is the choice of using 
> an (incremental) hashing algorithm, that is invariant to a tuple’s 
> representation.
> (Enchilada also has incremental hashes, but they are cryptographic which 
> means that different Tuple representations of the same data end up with 
> different hashes).
> 
> Unfortunately, in Avail, there are currently two tuple implementations - 
> ReverseTuple and SmallIntervalTuple - that don’t have incremental hashes, 
> invariant of representation.
> Because how would one quickly calculate the (reverse) hash of a ReverseTuple?
> The same goes for a big interval of integers.
> 
> cheers,
> Robbert.
> 
> 
>> On 11 Apr 2015, at 18:51, Mark van Gulik <[email protected]> wrote:
>> 
>> The implementations of tuples are entirely my own (and Leslie's for the 
>> interval and repeated tuples).
>> 
>> However, the tree implementation is "just" a BTree implementation with a 
>> minimum non-root node size of 16 and a maximum size of 64.  Normally a BTree 
>> will have at most a ratio of 2 between min and max, but that's just to keep 
>> disk pages nice and dense.  No need for that in memory.  But using a 
>> mechanism like that rather than some sort of binary tree saves about half 
>> the memory cost (and 90% of the L1 miss cost when it's cold), because of the 
>> fact that it uses a simple array for its elements and subtrees.  A tree 
>> tuple with a million elements will have between 4 and 6 layers (log to the 
>> bases 16 and 64).  Each element has its usual reference pointing to it from 
>> its leaf tuple, but if there are 16 elements in each leaf then the immediate 
>> parent tree node is holding an eight-byte reference to the leaf, for an 
>> amortized cost of half a bit per element.  Higher nodes contribute 
>> significantly less amortized cost.  The leaf tuple's header and incidental 
>> fields (like the cached hash) contribute a couple more bits per entry.
>> 
>> On the other hand, even a perfectly balanced binary tree has a leaf node for 
>> every element, a parent node for every two, a grandparent for every four, 
>> etc.  so there are 2n complete nodes for each element.  Assuming a tiny 
>> eight-byte header, that's 2+(3/2)+(3/4) = 5 eight-byte words for each 
>> element.  So it adds 400% overhead beyond an array representation.  It also 
>> stresses branch prediction to navigate downward to an element (although the 
>> binary searches within an array are similar).  And each node eats a separate 
>> entry in the processor's on-chip L1 associative cache.  And the cost for the 
>> garbage collector to manage it goes up considerably.  And the algorithms 
>> I've seen would involve around twenty new nodes to be constructed for a 
>> simple insert or replacement ("persistently") for a million element tree.  
>> That's probably a bit less space than an update of the 16-64 BTree; it's 
>> hard to say which would really be more expensive overall.
>> 
>> 
>>> On Apr 10, 2015, at 3:41 PM, Mark van Gulik <[email protected]> wrote:
>>> 
>>> Excellent!
>>> 
>>> As you point out, L2 and L3 are not there yet, but the base algorithms and 
>>> representations are where we’ve focused our efforts so far... but there’s 
>>> nothing stopping us from reaching the same performance some day by 
>>> compiling down to the JVM, and we should be able to exceed it by compiling 
>>> to LLVM.  I feel LLVM will be much faster because the JVM has to meet 
>>> specific safety concerns that we can take care of in other ways.  For 
>>> example, the bytecode verifier knows nothing about generics, and simply 
>>> relies on the compiler producing a mostly right program, leaving dynamic 
>>> casts all over the place.  That’s probably a good thing, since Java 
>>> generics aren’t even close to sound, so even if the linking architecture 
>>> did anything with them it would still require lots of dynamic checks.  
>>> Avail has some of the same issues with genericity, but I’m guessing our 
>>> primitives already provide a much larger collection of useful guarantees 
>>> than the JVM and JRE.  And as soon as we start to inline generics we’ll get 
>>> back not only the cost of those checks, but probably some vastly wider 
>>> opportunities related to the semantics of the built-in types.  Java just 
>>> has integral types, garbage collected type-safe pointers, and structs with 
>>> volatile and final annotations.  It’s a great substrate for low-level 
>>> programs, but it doesn’t really have any facilities for scaling up when 
>>> these low-level parts are combined.  Inlining and statistics-driven 
>>> polymorphic dispatching are about it, but they mostly just keep programmers 
>>> from doing stuff like inlining on their own (a goal of Avail that predates 
>>> Java).
>>> 
>>> Just so you’re aware, we’re expecting to get some of our best optimizations 
>>> out of the looping, folding, and conditional operations, or really any 
>>> operations that do control inversion by passing functions and/or 
>>> continuations.  We’ll strongly bias inlining decisions based on whether a 
>>> function or continuation is being passed to external code.  By preferring 
>>> to inline those, we can not only fold away their construction, but also 
>>> convert things like “Restart_” into mere jumps, opening the door for 
>>> traditional loop optimizations.
>>> 
>>> 
>>> 
>>> 
>>>> On Apr 10, 2015, at 3:09 PM, Robbert van Dalen <[email protected]> 
>>>> wrote:
>>>> 
>>>> Mark,
>>>> 
>>>> I’ve ‘ported’ the TreeTuple implementation to Scala to learn more about 
>>>> its inner workings.
>>>> I have yet to find a major performance issue: your TreeTuple code is 
>>>> pretty fit!
>>>> I wonder what is the source of the implementation: or did you invent it 
>>>> yourself?
>>>> 
>>>> Anyway, I’m doing some (performance) experiments with map and fold over 
>>>> (Tree)Tuples.
>>>> The scala version performance is pretty good (using specialisation and by 
>>>> leveraging some JVM magic):
>>>> 
>>>> Mapping over a 100 million element TreeTuple (add 1 to all elements) - 
>>>> using 128-wide native array-backed sub-Tuples - takes ~5 seconds
>>>> Folding over the same TreeTuple (sum all elements) takes ~0.3 seconds.
>>>> 
>>>> Of course, the Avail version is orders-of-magnitude slower as the L2 and 
>>>> L3 compilers are not yet there.
>>>> Consider that a benchmark for us to shoot at!
>>>> 
>>>> cheers,
>>>> Robbert.
> 

Reply via email to