Having experimented with 5 different immutable Tuple implementations or so, I can say with convidence that your BTree implementation is superior to all of them in terms of memory footprint and L1 caching. That said, the memory footprint can even be more reduced if the cumulative sizes array is dropped. Without it you’ll lose binary search (only within a maximum size 64 element tuple), but that isn’t to bad considering the fact that the L1 cache doesn’t like binary searches anyway.
I’m also still not sure whether your BTree doesn’t degenerate after certain patterns of repeated inserts, appends, slices, etc. Or that certain operations like append and slice aren’t always O(log(n)) time, but worse in certain obscure corner-cases. I guess, theoretically, it should never degenerate *if* its invariant is always kept: i.e. that it must always be a constant height tree. But what’s *really* interesting about Avail’s tuples is the choice of using an (incremental) hashing algorithm, that is invariant to a tuple’s representation. (Enchilada also has incremental hashes, but they are cryptographic which means that different Tuple representations of the same data end up with different hashes). Unfortunately, in Avail, there are currently two tuple implementations - ReverseTuple and SmallIntervalTuple - that don’t have incremental hashes, invariant of representation. Because how would one quickly calculate the (reverse) hash of a ReverseTuple? The same goes for a big interval of integers. cheers, Robbert. > On 11 Apr 2015, at 18:51, Mark van Gulik <[email protected]> wrote: > > The implementations of tuples are entirely my own (and Leslie's for the > interval and repeated tuples). > > However, the tree implementation is "just" a BTree implementation with a > minimum non-root node size of 16 and a maximum size of 64. Normally a BTree > will have at most a ratio of 2 between min and max, but that's just to keep > disk pages nice and dense. No need for that in memory. But using a > mechanism like that rather than some sort of binary tree saves about half the > memory cost (and 90% of the L1 miss cost when it's cold), because of the fact > that it uses a simple array for its elements and subtrees. A tree tuple with > a million elements will have between 4 and 6 layers (log to the bases 16 and > 64). Each element has its usual reference pointing to it from its leaf > tuple, but if there are 16 elements in each leaf then the immediate parent > tree node is holding an eight-byte reference to the leaf, for an amortized > cost of half a bit per element. Higher nodes contribute significantly less > amortized cost. The leaf tuple's header and incidental fields (like the > cached hash) contribute a couple more bits per entry. > > On the other hand, even a perfectly balanced binary tree has a leaf node for > every element, a parent node for every two, a grandparent for every four, > etc. so there are 2n complete nodes for each element. Assuming a tiny > eight-byte header, that's 2+(3/2)+(3/4) = 5 eight-byte words for each > element. So it adds 400% overhead beyond an array representation. It also > stresses branch prediction to navigate downward to an element (although the > binary searches within an array are similar). And each node eats a separate > entry in the processor's on-chip L1 associative cache. And the cost for the > garbage collector to manage it goes up considerably. And the algorithms I've > seen would involve around twenty new nodes to be constructed for a simple > insert or replacement ("persistently") for a million element tree. That's > probably a bit less space than an update of the 16-64 BTree; it's hard to say > which would really be more expensive overall. > > > On Apr 10, 2015, at 3:41 PM, Mark van Gulik <[email protected]> wrote: > >> Excellent! >> >> As you point out, L2 and L3 are not there yet, but the base algorithms and >> representations are where we’ve focused our efforts so far... but there’s >> nothing stopping us from reaching the same performance some day by compiling >> down to the JVM, and we should be able to exceed it by compiling to LLVM. I >> feel LLVM will be much faster because the JVM has to meet specific safety >> concerns that we can take care of in other ways. For example, the bytecode >> verifier knows nothing about generics, and simply relies on the compiler >> producing a mostly right program, leaving dynamic casts all over the place. >> That’s probably a good thing, since Java generics aren’t even close to >> sound, so even if the linking architecture did anything with them it would >> still require lots of dynamic checks. Avail has some of the same issues >> with genericity, but I’m guessing our primitives already provide a much >> larger collection of useful guarantees than the JVM and JRE. And as soon as >> we start to inline generics we’ll get back not only the cost of those >> checks, but probably some vastly wider opportunities related to the >> semantics of the built-in types. Java just has integral types, garbage >> collected type-safe pointers, and structs with volatile and final >> annotations. It’s a great substrate for low-level programs, but it doesn’t >> really have any facilities for scaling up when these low-level parts are >> combined. Inlining and statistics-driven polymorphic dispatching are about >> it, but they mostly just keep programmers from doing stuff like inlining on >> their own (a goal of Avail that predates Java). >> >> Just so you’re aware, we’re expecting to get some of our best optimizations >> out of the looping, folding, and conditional operations, or really any >> operations that do control inversion by passing functions and/or >> continuations. We’ll strongly bias inlining decisions based on whether a >> function or continuation is being passed to external code. By preferring to >> inline those, we can not only fold away their construction, but also convert >> things like “Restart_” into mere jumps, opening the door for traditional >> loop optimizations. >> >> >> >> >>> On Apr 10, 2015, at 3:09 PM, Robbert van Dalen <[email protected]> >>> wrote: >>> >>> Mark, >>> >>> I’ve ‘ported’ the TreeTuple implementation to Scala to learn more about its >>> inner workings. >>> I have yet to find a major performance issue: your TreeTuple code is pretty >>> fit! >>> I wonder what is the source of the implementation: or did you invent it >>> yourself? >>> >>> Anyway, I’m doing some (performance) experiments with map and fold over >>> (Tree)Tuples. >>> The scala version performance is pretty good (using specialisation and by >>> leveraging some JVM magic): >>> >>> Mapping over a 100 million element TreeTuple (add 1 to all elements) - >>> using 128-wide native array-backed sub-Tuples - takes ~5 seconds >>> Folding over the same TreeTuple (sum all elements) takes ~0.3 seconds. >>> >>> Of course, the Avail version is orders-of-magnitude slower as the L2 and L3 >>> compilers are not yet there. >>> Consider that a benchmark for us to shoot at! >>> >>> cheers, >>> Robbert. >>> >>> >>> >>
signature.asc
Description: Message signed with OpenPGP using GPGMail
