(This is entirely unverified by experiment.) Stuff that’s already well-known about hardware ----------------------------------------------
Flash chips are a kind of write-once read-many (“WORM”) memory: you *can* erase them, but only huge chunks at a time; and erasing them is much more expensive than reading them. They support reasonably random-access reading, though, and also writing to places that have been erased and not yet written again. Typically an SPI flash chip needs 40 clock cycles (say, 1.5μs, or 4μs if you’re running SPI off a 20MHz AVR whose SPI clock is 10MHz) to start reading or writing at a new address, and then delivers one bit per clock cycle; so the bandwidth-latency product is on the order of five bytes. By comparison, the bandwidth-latency product for the RAM on my desktop computer is about 40 bytes. There are other kinds of WORM storage that have been considered and mostly rejected. CD-Rs are WORM. Older WORM optical discs were in use on a number of computers in the 80s and 90s. Two-photon three-dimensional storage crystals are WORM. So, for that matter, are pen on paper, chisel on stone, brush on papyrus, and stylus on wax tablet. It is very often the case that WORM memory is much easier to design and build than memory that can be rewritten an arbitrarily large number of times. This suggests using algorithms and data structures designed for WORM memory, designed to minimize overwriting, updating, and memory writing in general. Stuff that’s probably already well-known about WORM data structures ------------------------------------------------------------------- I’ve probably even written some of this before. Chris Okasaki’s book about “purely functional data structures” is all pretty applicable. (His thesis is online and supposedly has much the same information.) Basically you can turn stuff into trees and make new tree roots as necessary. Git works this way too. So do the “rope” and “cord” string libraries that come with the SGI STL and the Boehm garbage collector. So does the NetApp WAFL filesystem. This approach has some advantages even in read-write RAM: it makes it easy to provide efficient undo, it works well with generational garbage collectors, it can integrate data stored in ROM or read-only mapped from an executable seamlessly, and you never have to copy data to avoid aliasing bugs. Norm Hardy pointed out, I think, that Bloom filters work nicely in a WORM where you can set individual bits one at a time. This doesn’t work very well if there's some kind of ECC coding on top of the memory, because that prevents you from using it that way, and NAND flash typically doesn’t let you access it this way. And I’m not sure if even modern NOR flash is reliable when accessed this way. Another fairly general approach to updatable variables is to allocate contiguous slots for a number of versions of them. To update the variable, you write a new version of its value in the first empty slot. To read its value, you read through all the slots in order until you come to a slot that’s empty, then use the value in the slot before that. “Empty” could mean all 1s, all 0s, or having an invalid checksum, depending on the hardware. You can stick space for one or two pointers at the end in case you run out of slots, pointing to further updates. If you use two pointers, you can prune off the left subtree when it gets too big, allowing the access to the latest data value to be O(lg N) in the number of updates thus far instead of O(N), like so: 1 / \ 2 3 / \ 4 \ / \ 7 5 6 / \ 8 15 / \ 9 12 / \ / \ 10 11 13 14 This is a particular instance of a general technique: data structures with null pointers that can later be overwritten with non-null pointers, as long as the “null” bit pattern happens to be the “blank” bit pattern of the underlying storage medium. For example, you can implement a linked list that you can append to by overwriting the null pointer that terminates the list with a pointer to a new node (including a fresh null pointer). As an application of this, you can implement the usual hash table with separate chaining in more or less the same way you normally would: the table itself starts full of null pointers, as do the next-pointers of hash-table entries linked from it. But to insert a new item, you have to follow the hash chain to its end and tag on the new item there, instead of inserting it at the beginning of the chain as you normally would. You may even be able to delete items from it: just overwrite the key in the hash-table entry (in NOR flash, for example, you can set it to all zeroes). But obviously it won’t be very efficient if you keep doing that. Stuff that’s already well-known about sorting --------------------------------------------- This probably won’t be news to anybody who passed CS 101. Most of it is stuff I learned from Wikipedia. (In some cases, from Wikipedia pages I wrote myself, but still.) For sorting small amounts of data, the preferred algorithm on current machines is insertion sort: void insertion_sort(int *numbers, int len) { for (int sorted_size = 0; sorted_size < len; sorted_size++) { int next_item = numbers[sorted_size]; /* Find the place to insert the next item */ int pos; for (pos = 0; pos < sorted_size; pos++) if (numbers[pos] > next_item) break; /* Move over other items to make room */ for (int ii = sorted_size; ii > pos; ii--) numbers[ii] = numbers[ii-1]; /* Insert that item */ numbers[pos] = next_item; } } This requires about ¼len² comparisons and about ¼len² writes, because it makes len comparison passes that increase linearly from 0 to a max of len, for a total of len * (len-1) / 2, and on each pass it stops, on average, halfway through; and then it makes len copying passes that start, on average, halfway through a sequence of a size that increases linearly from 0 up to len. As Wikipedia or any undergraduate algorithms course will tell you [0], this is Θ(N²) and so it’s no good for tens of items, but for short sequences it’s about as fast as you can get. The normal approach is to use a more efficient recursively-defined algorithm like mergesort or quicksort for big arrays, but those algorithms have substeps that involve sorting smaller chunks, and insertion sort is the method of choice for sorting small enough chunks. Supposedly the typical crossover point where that makes sense is somewhere around 8–20 elements. Another algorithm that’s roughly equivalent is selection sort: void selection_sort(int *numbers, int len) { for (int sorted_size = 0; sorted_size < len; sorted_size++) { /* Find the smallest item */ int smallest = sorted_size; for (int ii = sorted_size; ii < len; ii++) if (numbers[ii] < numbers[smallest]) smallest = ii; /* Swap it into place */ int tmp = numbers[sorted_size]; numbers[sorted_size] = numbers[smallest]; numbers[smallest] = tmp; } } It’s also Θ(N²) and very simple, but it does more like ½len² comparisons and then 2·len memory writes, so about the same performance. Well, this is clearly much more suitable for an environment where writing costs a lot more than reading! (Wikipedia will also tell you that.) (Naturally, in a WORM environment, you write the sorted list somewhere other than on top of the original list. This would seem to cut the number of writes down from 2·len to just len, but you need to mark items somehow to indicate that they’ve already been copied to the output. So you need another parallel array: a bitmap if possible; if not, a bytemap if possible; if not, a binary tree. Unfortunately, a binary tree is almost certain to be so big that you’d be better off with mergesort.) ---- The binary heap is a data structure that’s useful for sorting. It’s a binary tree stored row by row: first the root, then the two children of the root, then their four children, then their eight children, and so on, until we’re out of nodes. And it maintains the condition that each node always “beats” any children it may have, so the item at the root beats all the nodes in the tree. Here "beats" could be “is greater than or equal to” (a “max-heap”) or “is less than or equal to” (a “min-heap”), depending on what you want to do with the heap. The clever bit is that you can take the root out of the tree and re-establish this condition, or add another value to the tree and re-establish it, in logarithmic time, basically by shifting items up and down the tree to make room, maintaining the perfectly balanced shape. So you can efficiently pour data into the tree and suck out whatever the top data is. To add an item, you start at a newly added position at the bottom and walk up the tree toward the root until you find a position where the new item can’t go any higher (because its parent already beats it); here’s a version for a max-heap: void heap_push(int *numbers, int *len, int num) { int pos = (*len)++; /* Bubble the number up from the bottom. */ while (pos > 0 && num >= numbers[(pos-1)/2]) { numbers[pos] = numbers[(pos-1)/2]; pos = (pos-1)/2; } numbers[pos] = num; } To remove the top item, you take the item off the end of the heap and conceptually stick it at the root, and then walk down the tree from the root, always promoting the best child by swapping it, until you find a place where the item from the end beats both of its children — probably pretty low, but hey, man, it’s totally logarithmic. int heap_pop(int *numbers, int *len) { int rv = numbers[0]; /* Remove the number at the top of the heap; */ int num = numbers[--*len]; /* replace it with the one from end of heap. */ /* Trickle it down to its correct position. */ int pos = 0, kidpos = 1; while (kidpos < *len) { if (kidpos + 1 < *len && numbers[kidpos+1] > numbers[kidpos]) kidpos++; if (num >= numbers[kidpos]) break; numbers[pos] = numbers[kidpos]; pos = kidpos; kidpos = 2*pos + 1; /* The left child. */ } numbers[pos] = num; return rv; } So with those two fairly simple bits of code, you can very easily write a sorting algorithm that achieves optimal worst-case and average-case performance, Θ(N lg N), by first putting all your items into a heap, and then taking them out, one by one: void slow_heapsort(int *numbers, int len) { int lenb = 0; for (int ii = 0; ii < len; ii++) heap_push(numbers, &lenb, numbers[ii]); for (int ii = len; ii > 0; ii--) numbers[ii-1] = heap_pop(numbers, &lenb); } Unlike the other Θ(N lg N) sorting algorithms — mergesort, quicksort, and library sort — heapsort runs in constant space. (It turns out there’s a faster way of building the initial heap, and a usually-faster way to do the ordinary heap-pop operation, but they don’t affect my points here, so I’m leaving them out.) But it also turns out that heaps are very useful for things other than heapsort. For example, they’re widely used in mergesort. Stuff that might be new about sorting ------------------------------------- Well, suppose you have 200 bytes or so of RAM space to use to sort your data with, in addition to your Flash or other WORM storage. Can you do better? First, if random reads from the WORM are reasonably efficient, you should probably sort pointers or array indices, instead of full data records. So the rest of this section is devoted to the case where the things you’re copying around are relatively small. Consider a hybrid of selection sort and heapsort. Instead of maintaining a single smallest item, you can maintain a min-heap of, say, 50 items in your RAM. You walk through the unsorted part of the list to find the 50 smallest items in it, requiring N WORM reads and about 3·N swaps in your heap, and then you write out those 50 items in 50 WORM writes and about 300 swaps. You’re still doing the same number of writes as before (one per item) but you’re only doing one-fiftieth the reads from the WORM. But the reads will still dominate for a large enough array, since they’re Θ(N²). And for a large enough array, you’ll still want to move to an O(N lg N) algorithm like mergesort. But mergesort needs to write stuff all the time. Where’s the tradeoff point? Picking vaguely plausible factors, suppose that a single heap swap takes about 6 clock cycles, reading in a random item from the WORM takes about 128 clock cycles, reading in an additional sequential item takes about 32 cycles, writing an item to the WORM costs something equivalent to 512 clock cycles, and you have enough RAM for 50 items. Sorting 1024 elements purely with mergesort would require two passes: one pass to turn those elements into 100-or-so-element sorted runs (about 128 + 1024·32 + 1024·512 cycles of I/O, plus about 1024·18 cycles of heap operations) and a second pass to merge them into the final result (another similar cost). Roughly 1048576 cycles, essentially all due to the write cost. (If you do a mergesort on pairs of sorted runs instead of ten or eleven at a time, you’d need five passes instead of two, which would cost 2½ times as much. So don’t do that. Getting to sorted runs of 100 with an external mergesort would also require five passes, so don’t do that either. Instead sort items in RAM as long as possible, reading through input while maintaining a heap that shrinks to make space for input values that are too low to output to this sorted run. Worst-case, this gives you sorted runs of 50; average-case, it gives you 100; if the input is already sorted or nearly so, the entire thing passes through as a single sorted run.) (For the rest of this, I’m going to optimistically assume that the cost of maintaining a bitmap of already-written-out input items is small, because it doesn’t require allocating any extra space; it only requires clearing bits in already-allocated space.) Sorting them with this heap/selection sort hybrid would require 21 passes, reading 1024 items and 1024 bitmap entries each time, but only writing out 1024 items. The 43000 sequential reads cost 42·128 + 43000·32 cycles, and the associated 21000 heap operations cost 21000·6 operations. Total is about 1.5 million. Writing the output still costs 524288 cycles. So we spent 1.5 million cycles to save half a million. The breakeven point is around 250 items, you need 5 passes, with 2500 sequential reads and heap operations costing about 2500·(32+6) ≈ 100k cycles; writing the output costs about 125k cycles for a total of about 225k. Straight mergesort would require two passes of writing and 250k cycles. The improvement is greater on smaller amounts of data. Consider a 100-item sequence that happened not to be sortable with the entirely-in-memory streaming heap approach mentioned earlier (e.g. because the value that must appear first in the output occurs after position #50 in the input). Mergesort requires 200 writes and two passes of reading; heap/selection sort requires 100 writes plus two passes of reading. (Plus the work to maintain the bitmap.) So in this cost model, this heap/selection hybrid is faster than mergesort from about 50 to 100 items up to about 250 items. So what does that mean? It means you get some kind of win from the following strategy: - For collections of up to 50 items, sort in memory. - For collections of up to 100 or maybe 1000 items, try to sort in a single pass in memory first before trying anything else. - For collections from 50 to 250 items that can’t be sorted in a single pass in memory, use the heap/selection hybrid directly. - For collections from 250 to 125000 items, do an initial pass of the heap/selection hybrid on segments of the input, then an up-to-50-way merge. - For larger collections, likewise, plus further merging passes. How many segments? There’s no reason to do segments smaller than 50 items, since you can sort those in RAM. Fewer, bigger segments allow you to do a little more input buffering during the merge phase, which could speed it up a little, but increase the time wasted due to needing more passes over the input. One thing I’ve been glossing over is that mergesort can keep just the input items in its heap; heap/selection sort also needs to keep their indices in the input, so that it can write them into the bitmap at the end, which means that its heap can store fewer items in the same number of bytes. You could keep *only* the indices, but then comparisons between records in the heap require going out to read keys from the WORM, which would be a substantial slowdown. These threshold numbers (50, 100, 1000, 250, 125000) depend on the particular characteristics of your CPU, RAM, and WORM. A sufficiently high cost of space allocation will drive up the threshold at which the heap/selection hybrid is cheaper than mergesort. A WORM that doesn’t provide a reasonably efficient way to mark input items as copied to output (setting a byte, at least) will make heap/selection sort slower than mergesort for even small sizes of input, since it could more than double its output size. Large seek times for reads could counteract this by making fifty-way merges impractically slow, forcing mergesort to buffer more and use more passes. Heap/selection sort is even more brutally sequential in its memory access patterns than a many-way mergesort. Lazy sorting ------------ In a lot of cases it’s probably adequate to do a merge every time you need to access the supposedly-sorted data, rather than writing an entire new sorted copy of it; this is the standard “side file” technique from transaction processing. Lucene more or less uses this approach. [0] This phrase isn’t here to make you feel dumb if you don’t know this stuff. This is here to make sure that you know I don’t think I’m smart for knowing it. -- To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol