Re: [Avail] TreeTuple

Robbert van Dalen Mon, 13 Apr 2015 14:19:50 -0700

Mark,

Yeah, but there is no hope for a ReverseTuple’s hash to be less than linear.


Oh, I’ve been digging around and found this :)
http://lists.squeakfoundation.org/pipermail/squeak-dev/2000-August/006848.html

Back then, you were using rolling hashes to quickly update the hash of two 
concatenated ropes. Cool!
So I guess you already knew about Rabin-Karp rolling hashes way back in 2000?
 (http://en.wikipedia.org/wiki/Rolling_hash)

Interestingly, the wikipedia page also talks about the Cyclic polynomial 
rolling hash algorithm, which has even better randomness properties.

cheers,
Robbert.

> On 11 Apr 2015, at 23:16, Mark van Gulik <[email protected]> wrote:
> 
> If we got rid of the pre-toggling of tuple elements' hash values and did an 
> add instead, and if integers used a polynomial to hash instead of trying to 
> be well shuffled, we would be able to compute interval tuple hashes in *way* 
> less than linear time.
> 
> 
>> On Apr 11, 2015, at 1:40 PM, Robbert van Dalen <[email protected]> wrote:
>> 
>> Having experimented with 5 different immutable Tuple implementations or so, 
>> I can say with convidence that your BTree implementation is superior to all 
>> of them in terms of memory footprint and L1 caching.
>> That said, the memory footprint can even be more reduced if the cumulative 
>> sizes array is dropped. Without it you’ll lose binary search (only within a 
>> maximum size 64 element tuple), but that isn’t to bad considering the fact 
>> that the L1 cache doesn’t like binary searches anyway.
>> 
>> I’m also still not sure whether your BTree doesn’t degenerate after certain 
>> patterns of repeated inserts, appends, slices, etc.
>> Or that certain operations like append and slice aren’t always O(log(n)) 
>> time, but worse in certain obscure corner-cases.
>> I guess, theoretically, it should never degenerate *if* its invariant is 
>> always kept: i.e. that it must always be a constant height tree.
>> 
>> But what’s *really* interesting about Avail’s tuples is the choice of using 
>> an (incremental) hashing algorithm, that is invariant to a tuple’s 
>> representation.
>> (Enchilada also has incremental hashes, but they are cryptographic which 
>> means that different Tuple representations of the same data end up with 
>> different hashes).
>> 
>> Unfortunately, in Avail, there are currently two tuple implementations - 
>> ReverseTuple and SmallIntervalTuple - that don’t have incremental hashes, 
>> invariant of representation.
>> Because how would one quickly calculate the (reverse) hash of a ReverseTuple?
>> The same goes for a big interval of integers.
>> 
>> cheers,
>> Robbert.
>> 
>> 
>>> On 11 Apr 2015, at 18:51, Mark van Gulik <[email protected]> wrote:
>>> 
>>> The implementations of tuples are entirely my own (and Leslie's for the 
>>> interval and repeated tuples).
>>> 
>>> However, the tree implementation is "just" a BTree implementation with a 
>>> minimum non-root node size of 16 and a maximum size of 64.  Normally a 
>>> BTree will have at most a ratio of 2 between min and max, but that's just 
>>> to keep disk pages nice and dense.  No need for that in memory.  But using 
>>> a mechanism like that rather than some sort of binary tree saves about half 
>>> the memory cost (and 90% of the L1 miss cost when it's cold), because of 
>>> the fact that it uses a simple array for its elements and subtrees.  A tree 
>>> tuple with a million elements will have between 4 and 6 layers (log to the 
>>> bases 16 and 64).  Each element has its usual reference pointing to it from 
>>> its leaf tuple, but if there are 16 elements in each leaf then the 
>>> immediate parent tree node is holding an eight-byte reference to the leaf, 
>>> for an amortized cost of half a bit per element.  Higher nodes contribute 
>>> significantly less amortized cost.  The leaf tuple's header and incidental 
>>> fields (like the cached hash) contribute a couple more bits per entry.
>>> 
>>> On the other hand, even a perfectly balanced binary tree has a leaf node 
>>> for every element, a parent node for every two, a grandparent for every 
>>> four, etc.  so there are 2n complete nodes for each element.  Assuming a 
>>> tiny eight-byte header, that's 2+(3/2)+(3/4) = 5 eight-byte words for each 
>>> element.  So it adds 400% overhead beyond an array representation.  It also 
>>> stresses branch prediction to navigate downward to an element (although the 
>>> binary searches within an array are similar).  And each node eats a 
>>> separate entry in the processor's on-chip L1 associative cache.  And the 
>>> cost for the garbage collector to manage it goes up considerably.  And the 
>>> algorithms I've seen would involve around twenty new nodes to be 
>>> constructed for a simple insert or replacement ("persistently") for a 
>>> million element tree.  That's probably a bit less space than an update of 
>>> the 16-64 BTree; it's hard to say which would really be more expensive 
>>> overall.
>>> 
>>> 
>>>> On Apr 10, 2015, at 3:41 PM, Mark van Gulik <[email protected]> wrote:
>>>> 
>>>> Excellent!
>>>> 
>>>> As you point out, L2 and L3 are not there yet, but the base algorithms and 
>>>> representations are where we’ve focused our efforts so far... but there’s 
>>>> nothing stopping us from reaching the same performance some day by 
>>>> compiling down to the JVM, and we should be able to exceed it by compiling 
>>>> to LLVM.  I feel LLVM will be much faster because the JVM has to meet 
>>>> specific safety concerns that we can take care of in other ways.  For 
>>>> example, the bytecode verifier knows nothing about generics, and simply 
>>>> relies on the compiler producing a mostly right program, leaving dynamic 
>>>> casts all over the place.  That’s probably a good thing, since Java 
>>>> generics aren’t even close to sound, so even if the linking architecture 
>>>> did anything with them it would still require lots of dynamic checks.  
>>>> Avail has some of the same issues with genericity, but I’m guessing our 
>>>> primitives already provide a much larger collection of useful guarantees 
>>>> than the JVM and JRE.  And as soon as we start to inline generics we’ll 
>>>> get back not only the cost of those checks, but probably some vastly wider 
>>>> opportunities related to the semantics of the built-in types.  Java just 
>>>> has integral types, garbage collected type-safe pointers, and structs with 
>>>> volatile and final annotations.  It’s a great substrate for low-level 
>>>> programs, but it doesn’t really have any facilities for scaling up when 
>>>> these low-level parts are combined.  Inlining and statistics-driven 
>>>> polymorphic dispatching are about it, but they mostly just keep 
>>>> programmers from doing stuff like inlining on their own (a goal of Avail 
>>>> that predates Java).
>>>> 
>>>> Just so you’re aware, we’re expecting to get some of our best 
>>>> optimizations out of the looping, folding, and conditional operations, or 
>>>> really any operations that do control inversion by passing functions 
>>>> and/or continuations.  We’ll strongly bias inlining decisions based on 
>>>> whether a function or continuation is being passed to external code.  By 
>>>> preferring to inline those, we can not only fold away their construction, 
>>>> but also convert things like “Restart_” into mere jumps, opening the door 
>>>> for traditional loop optimizations.
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Apr 10, 2015, at 3:09 PM, Robbert van Dalen <[email protected]> 
>>>>> wrote:
>>>>> 
>>>>> Mark,
>>>>> 
>>>>> I’ve ‘ported’ the TreeTuple implementation to Scala to learn more about 
>>>>> its inner workings.
>>>>> I have yet to find a major performance issue: your TreeTuple code is 
>>>>> pretty fit!
>>>>> I wonder what is the source of the implementation: or did you invent it 
>>>>> yourself?
>>>>> 
>>>>> Anyway, I’m doing some (performance) experiments with map and fold over 
>>>>> (Tree)Tuples.
>>>>> The scala version performance is pretty good (using specialisation and by 
>>>>> leveraging some JVM magic):
>>>>> 
>>>>> Mapping over a 100 million element TreeTuple (add 1 to all elements) - 
>>>>> using 128-wide native array-backed sub-Tuples - takes ~5 seconds
>>>>> Folding over the same TreeTuple (sum all elements) takes ~0.3 seconds.
>>>>> 
>>>>> Of course, the Avail version is orders-of-magnitude slower as the L2 and 
>>>>> L3 compilers are not yet there.
>>>>> Consider that a benchmark for us to shoot at!
>>>>> 
>>>>> cheers,
>>>>> Robbert.
>>

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [Avail] TreeTuple

Reply via email to