If we got rid of the pre-toggling of tuple elements' hash values and did an add instead, and if integers used a polynomial to hash instead of trying to be well shuffled, we would be able to compute interval tuple hashes in *way* less than linear time.
> On Apr 11, 2015, at 1:40 PM, Robbert van Dalen <[email protected]> wrote: > > Having experimented with 5 different immutable Tuple implementations or so, I > can say with convidence that your BTree implementation is superior to all of > them in terms of memory footprint and L1 caching. > That said, the memory footprint can even be more reduced if the cumulative > sizes array is dropped. Without it you’ll lose binary search (only within a > maximum size 64 element tuple), but that isn’t to bad considering the fact > that the L1 cache doesn’t like binary searches anyway. > > I’m also still not sure whether your BTree doesn’t degenerate after certain > patterns of repeated inserts, appends, slices, etc. > Or that certain operations like append and slice aren’t always O(log(n)) > time, but worse in certain obscure corner-cases. > I guess, theoretically, it should never degenerate *if* its invariant is > always kept: i.e. that it must always be a constant height tree. > > But what’s *really* interesting about Avail’s tuples is the choice of using > an (incremental) hashing algorithm, that is invariant to a tuple’s > representation. > (Enchilada also has incremental hashes, but they are cryptographic which > means that different Tuple representations of the same data end up with > different hashes). > > Unfortunately, in Avail, there are currently two tuple implementations - > ReverseTuple and SmallIntervalTuple - that don’t have incremental hashes, > invariant of representation. > Because how would one quickly calculate the (reverse) hash of a ReverseTuple? > The same goes for a big interval of integers. > > cheers, > Robbert. > > >> On 11 Apr 2015, at 18:51, Mark van Gulik <[email protected]> wrote: >> >> The implementations of tuples are entirely my own (and Leslie's for the >> interval and repeated tuples). >> >> However, the tree implementation is "just" a BTree implementation with a >> minimum non-root node size of 16 and a maximum size of 64. Normally a BTree >> will have at most a ratio of 2 between min and max, but that's just to keep >> disk pages nice and dense. No need for that in memory. But using a >> mechanism like that rather than some sort of binary tree saves about half >> the memory cost (and 90% of the L1 miss cost when it's cold), because of the >> fact that it uses a simple array for its elements and subtrees. A tree >> tuple with a million elements will have between 4 and 6 layers (log to the >> bases 16 and 64). Each element has its usual reference pointing to it from >> its leaf tuple, but if there are 16 elements in each leaf then the immediate >> parent tree node is holding an eight-byte reference to the leaf, for an >> amortized cost of half a bit per element. Higher nodes contribute >> significantly less amortized cost. The leaf tuple's header and incidental >> fields (like the cached hash) contribute a couple more bits per entry. >> >> On the other hand, even a perfectly balanced binary tree has a leaf node for >> every element, a parent node for every two, a grandparent for every four, >> etc. so there are 2n complete nodes for each element. Assuming a tiny >> eight-byte header, that's 2+(3/2)+(3/4) = 5 eight-byte words for each >> element. So it adds 400% overhead beyond an array representation. It also >> stresses branch prediction to navigate downward to an element (although the >> binary searches within an array are similar). And each node eats a separate >> entry in the processor's on-chip L1 associative cache. And the cost for the >> garbage collector to manage it goes up considerably. And the algorithms >> I've seen would involve around twenty new nodes to be constructed for a >> simple insert or replacement ("persistently") for a million element tree. >> That's probably a bit less space than an update of the 16-64 BTree; it's >> hard to say which would really be more expensive overall. >> >> >>> On Apr 10, 2015, at 3:41 PM, Mark van Gulik <[email protected]> wrote: >>> >>> Excellent! >>> >>> As you point out, L2 and L3 are not there yet, but the base algorithms and >>> representations are where we’ve focused our efforts so far... but there’s >>> nothing stopping us from reaching the same performance some day by >>> compiling down to the JVM, and we should be able to exceed it by compiling >>> to LLVM. I feel LLVM will be much faster because the JVM has to meet >>> specific safety concerns that we can take care of in other ways. For >>> example, the bytecode verifier knows nothing about generics, and simply >>> relies on the compiler producing a mostly right program, leaving dynamic >>> casts all over the place. That’s probably a good thing, since Java >>> generics aren’t even close to sound, so even if the linking architecture >>> did anything with them it would still require lots of dynamic checks. >>> Avail has some of the same issues with genericity, but I’m guessing our >>> primitives already provide a much larger collection of useful guarantees >>> than the JVM and JRE. And as soon as we start to inline generics we’ll get >>> back not only the cost of those checks, but probably some vastly wider >>> opportunities related to the semantics of the built-in types. Java just >>> has integral types, garbage collected type-safe pointers, and structs with >>> volatile and final annotations. It’s a great substrate for low-level >>> programs, but it doesn’t really have any facilities for scaling up when >>> these low-level parts are combined. Inlining and statistics-driven >>> polymorphic dispatching are about it, but they mostly just keep programmers >>> from doing stuff like inlining on their own (a goal of Avail that predates >>> Java). >>> >>> Just so you’re aware, we’re expecting to get some of our best optimizations >>> out of the looping, folding, and conditional operations, or really any >>> operations that do control inversion by passing functions and/or >>> continuations. We’ll strongly bias inlining decisions based on whether a >>> function or continuation is being passed to external code. By preferring >>> to inline those, we can not only fold away their construction, but also >>> convert things like “Restart_” into mere jumps, opening the door for >>> traditional loop optimizations. >>> >>> >>> >>> >>>> On Apr 10, 2015, at 3:09 PM, Robbert van Dalen <[email protected]> >>>> wrote: >>>> >>>> Mark, >>>> >>>> I’ve ‘ported’ the TreeTuple implementation to Scala to learn more about >>>> its inner workings. >>>> I have yet to find a major performance issue: your TreeTuple code is >>>> pretty fit! >>>> I wonder what is the source of the implementation: or did you invent it >>>> yourself? >>>> >>>> Anyway, I’m doing some (performance) experiments with map and fold over >>>> (Tree)Tuples. >>>> The scala version performance is pretty good (using specialisation and by >>>> leveraging some JVM magic): >>>> >>>> Mapping over a 100 million element TreeTuple (add 1 to all elements) - >>>> using 128-wide native array-backed sub-Tuples - takes ~5 seconds >>>> Folding over the same TreeTuple (sum all elements) takes ~0.3 seconds. >>>> >>>> Of course, the Avail version is orders-of-magnitude slower as the L2 and >>>> L3 compilers are not yet there. >>>> Consider that a benchmark for us to shoot at! >>>> >>>> cheers, >>>> Robbert. >
