sorting in WORM memory (and other algorithms and data structures)

Kragen Javier Sitaker Mon, 19 Jul 2010 00:37:06 -0700

(This is entirely unverified by experiment.)

Stuff that’s already well-known about hardware
----------------------------------------------


Flash chips are a kind of write-once read-many (“WORM”) memory: you
*can* erase them, but only huge chunks at a time; and erasing them is
much more expensive than reading them.

They support reasonably random-access reading, though, and also
writing to places that have been erased and not yet written
again. Typically an SPI flash chip needs 40 clock cycles (say, 1.5μs,
or 4μs if you’re running SPI off a 20MHz AVR whose SPI clock is 10MHz)
to start reading or writing at a new address, and then delivers one
bit per clock cycle; so the bandwidth-latency product is on the order
of five bytes. By comparison, the bandwidth-latency product for the
RAM on my desktop computer is about 40 bytes.

There are other kinds of WORM storage that have been considered and
mostly rejected. CD-Rs are WORM. Older WORM optical discs were in use
on a number of computers in the 80s and 90s. Two-photon
three-dimensional storage crystals are WORM. So, for that matter, are
pen on paper, chisel on stone, brush on papyrus, and stylus on wax
tablet. It is very often the case that WORM memory is much easier to
design and build than memory that can be rewritten an arbitrarily
large number of times.

This suggests using algorithms and data structures designed for WORM
memory, designed to minimize overwriting, updating, and memory writing
in general.

Stuff that’s probably already well-known about WORM data structures
-------------------------------------------------------------------

I’ve probably even written some of this before.

Chris Okasaki’s book about “purely functional data structures” is all
pretty applicable. (His thesis is online and supposedly has much the
same information.) Basically you can turn stuff into trees and make
new tree roots as necessary. Git works this way too. So do the “rope”
and “cord” string libraries that come with the SGI STL and the Boehm
garbage collector. So does the NetApp WAFL filesystem. This approach
has some advantages even in read-write RAM: it makes it easy to
provide efficient undo, it works well with generational garbage
collectors, it can integrate data stored in ROM or read-only mapped
from an executable seamlessly, and you never have to copy data to
avoid aliasing bugs.

Norm Hardy pointed out, I think, that Bloom filters work nicely in a
WORM where you can set individual bits one at a time. This doesn’t
work very well if there's some kind of ECC coding on top of the
memory, because that prevents you from using it that way, and NAND
flash typically doesn’t let you access it this way. And I’m not sure
if even modern NOR flash is reliable when accessed this way.

Another fairly general approach to updatable variables is to allocate
contiguous slots for a number of versions of them. To update the
variable, you write a new version of its value in the first empty
slot. To read its value, you read through all the slots in order until
you come to a slot that’s empty, then use the value in the slot before
that. “Empty” could mean all 1s, all 0s, or having an invalid
checksum, depending on the hardware.

You can stick space for one or two pointers at the end in case you run
out of slots, pointing to further updates. If you use two pointers,
you can prune off the left subtree when it gets too big, allowing the
access to the latest data value to be O(lg N) in the number of updates
thus far instead of O(N), like so:

      1
     / \
    2   3
       / \
      4    \
     / \    7
    5   6  / \
          8   15
         / \
       9    12
      / \   / \
     10 11 13 14

This is a particular instance of a general technique: data structures
with null pointers that can later be overwritten with non-null
pointers, as long as the “null” bit pattern happens to be the “blank”
bit pattern of the underlying storage medium.

For example, you can implement a linked list that you can append to by
overwriting the null pointer that terminates the list with a pointer
to a new node (including a fresh null pointer). As an application of
this, you can implement the usual hash table with separate chaining in
more or less the same way you normally would: the table itself starts
full of null pointers, as do the next-pointers of hash-table entries
linked from it. But to insert a new item, you have to follow the hash
chain to its end and tag on the new item there, instead of inserting
it at the beginning of the chain as you normally would.

You may even be able to delete items from it: just overwrite the key
in the hash-table entry (in NOR flash, for example, you can set it to
all zeroes). But obviously it won’t be very efficient if you keep
doing that.

Stuff that’s already well-known about sorting
---------------------------------------------

This probably won’t be news to anybody who passed CS 101. Most of it
is stuff I learned from Wikipedia. (In some cases, from Wikipedia
pages I wrote myself, but still.)

For sorting small amounts of data, the preferred algorithm on current
machines is insertion sort:

    void insertion_sort(int *numbers, int len) {
      for (int sorted_size = 0; sorted_size < len; sorted_size++) {
        int next_item = numbers[sorted_size];

        /* Find the place to insert the next item */
        int pos;
        for (pos = 0; pos < sorted_size; pos++)
          if (numbers[pos] > next_item) break;

        /* Move over other items to make room */
        for (int ii = sorted_size; ii > pos; ii--) numbers[ii] = numbers[ii-1];

        /* Insert that item */
        numbers[pos] = next_item;
      }
    }

This requires about ¼len² comparisons and about ¼len² writes, because
it makes len comparison passes that increase linearly from 0 to a max
of len, for a total of len * (len-1) / 2, and on each pass it stops,
on average, halfway through; and then it makes len copying passes that
start, on average, halfway through a sequence of a size that increases
linearly from 0 up to len.

As Wikipedia or any undergraduate algorithms course will tell you [0],
this is Θ(N²) and so it’s no good for tens of items, but for short
sequences it’s about as fast as you can get. The normal approach is to
use a more efficient recursively-defined algorithm like mergesort or
quicksort for big arrays, but those algorithms have substeps that
involve sorting smaller chunks, and insertion sort is the method of
choice for sorting small enough chunks. Supposedly the typical
crossover point where that makes sense is somewhere around 8–20
elements.

Another algorithm that’s roughly equivalent is selection sort:

    void selection_sort(int *numbers, int len) {
      for (int sorted_size = 0; sorted_size < len; sorted_size++) {
        /* Find the smallest item */
        int smallest = sorted_size;
        for (int ii = sorted_size; ii < len; ii++)
          if (numbers[ii] < numbers[smallest]) smallest = ii;

        /* Swap it into place */
        int tmp = numbers[sorted_size];
        numbers[sorted_size] = numbers[smallest];
        numbers[smallest] = tmp;
      }
    }

It’s also Θ(N²) and very simple, but it does more like ½len²
comparisons and then 2·len memory writes, so about the same
performance.

Well, this is clearly much more suitable for an environment where
writing costs a lot more than reading! (Wikipedia will also tell you
that.)

(Naturally, in a WORM environment, you write the sorted list somewhere
other than on top of the original list. This would seem to cut the
number of writes down from 2·len to just len, but you need to mark
items somehow to indicate that they’ve already been copied to the
output. So you need another parallel array: a bitmap if possible; if
not, a bytemap if possible; if not, a binary tree. Unfortunately, a
binary tree is almost certain to be so big that you’d be better off
with mergesort.)

----

The binary heap is a data structure that’s useful for sorting.  It’s a
binary tree stored row by row: first the root, then the two children
of the root, then their four children, then their eight children, and
so on, until we’re out of nodes. And it maintains the condition that
each node always “beats” any children it may have, so the item at the
root beats all the nodes in the tree.

Here "beats" could be “is greater than or equal to” (a “max-heap”) or
“is less than or equal to” (a “min-heap”), depending on what you want
to do with the heap.

The clever bit is that you can take the root out of the tree and
re-establish this condition, or add another value to the tree and
re-establish it, in logarithmic time, basically by shifting items up
and down the tree to make room, maintaining the perfectly balanced
shape.  So you can efficiently pour data into the tree and suck out
whatever the top data is.

To add an item, you start at a newly added position at the bottom and
walk up the tree toward the root until you find a position where the
new item can’t go any higher (because its parent already beats it);
here’s a version for a max-heap:

    void heap_push(int *numbers, int *len, int num) {
      int pos = (*len)++;

      /* Bubble the number up from the bottom. */
      while (pos > 0 && num >= numbers[(pos-1)/2]) {
        numbers[pos] = numbers[(pos-1)/2];
        pos = (pos-1)/2;
      }
      numbers[pos] = num;
    }

To remove the top item, you take the item off the end of the heap and
conceptually stick it at the root, and then walk down the tree from
the root, always promoting the best child by swapping it, until you
find a place where the item from the end beats both of its children —
probably pretty low, but hey, man, it’s totally logarithmic.

    int heap_pop(int *numbers, int *len) {
      int rv = numbers[0];  /* Remove the number at the top of the heap; */
      int num = numbers[--*len]; /* replace it with the one from end of heap. */

      /* Trickle it down to its correct position. */
      int pos = 0, kidpos = 1;
      while (kidpos < *len) {
        if (kidpos + 1 < *len && numbers[kidpos+1] > numbers[kidpos])
          kidpos++;
        if (num >= numbers[kidpos]) break;

        numbers[pos] = numbers[kidpos];
        pos = kidpos;
        kidpos = 2*pos + 1;         /* The left child. */
      }

      numbers[pos] = num;

      return rv;
    }

So with those two fairly simple bits of code, you can very easily
write a sorting algorithm that achieves optimal worst-case and
average-case performance, Θ(N lg N), by first putting all your items
into a heap, and then taking them out, one by one:

    void slow_heapsort(int *numbers, int len) {
      int lenb = 0;
      for (int ii = 0; ii < len; ii++) heap_push(numbers, &lenb, numbers[ii]);
      for (int ii = len; ii > 0; ii--) numbers[ii-1] = heap_pop(numbers, &lenb);
    }

Unlike the other Θ(N lg N) sorting algorithms — mergesort, quicksort,
and library sort — heapsort runs in constant space.

(It turns out there’s a faster way of building the initial heap, and a
usually-faster way to do the ordinary heap-pop operation, but they
don’t affect my points here, so I’m leaving them out.)

But it also turns out that heaps are very useful for things other than
heapsort. For example, they’re widely used in mergesort.

Stuff that might be new about sorting
-------------------------------------

Well, suppose you have 200 bytes or so of RAM space to use to sort
your data with, in addition to your Flash or other WORM storage.
Can you do better?

First, if random reads from the WORM are reasonably efficient, you
should probably sort pointers or array indices, instead of full data
records. So the rest of this section is devoted to the case where the
things you’re copying around are relatively small.

Consider a hybrid of selection sort and heapsort. Instead of
maintaining a single smallest item, you can maintain a min-heap of,
say, 50 items in your RAM.  You walk through the unsorted part of the
list to find the 50 smallest items in it, requiring N WORM reads and
about 3·N swaps in your heap, and then you write out those 50 items in
50 WORM writes and about 300 swaps.

You’re still doing the same number of writes as before (one per item)
but you’re only doing one-fiftieth the reads from the WORM.  But the
reads will still dominate for a large enough array, since they’re
Θ(N²).

And for a large enough array, you’ll still want to move to an
O(N lg N) algorithm like mergesort.  But mergesort needs to write
stuff all the time. Where’s the tradeoff point?

Picking vaguely plausible factors, suppose that a single heap swap
takes about 6 clock cycles, reading in a random item from the WORM
takes about 128 clock cycles, reading in an additional sequential item
takes about 32 cycles, writing an item to the WORM costs something
equivalent to 512 clock cycles, and you have enough RAM for 50 items.

Sorting 1024 elements purely with mergesort would require two passes:
one pass to turn those elements into 100-or-so-element sorted runs
(about 128 + 1024·32 + 1024·512 cycles of I/O, plus about 1024·18
cycles of heap operations) and a second pass to merge them into the
final result (another similar cost). Roughly 1048576 cycles,
essentially all due to the write cost.

(If you do a mergesort on pairs of sorted runs instead of ten or
eleven at a time, you’d need five passes instead of two, which would
cost 2½ times as much. So don’t do that. Getting to sorted runs of 100
with an external mergesort would also require five passes, so don’t do
that either. Instead sort items in RAM as long as possible, reading
through input while maintaining a heap that shrinks to make space for
input values that are too low to output to this sorted run.
Worst-case, this gives you sorted runs of 50; average-case, it gives
you 100; if the input is already sorted or nearly so, the entire thing
passes through as a single sorted run.)

(For the rest of this, I’m going to optimistically assume that the
cost of maintaining a bitmap of already-written-out input items is
small, because it doesn’t require allocating any extra space; it only
requires clearing bits in already-allocated space.)

Sorting them with this heap/selection sort hybrid would require 21
passes, reading 1024 items and 1024 bitmap entries each time, but only
writing out 1024 items. The 43000 sequential reads cost 42·128 +
43000·32 cycles, and the associated 21000 heap operations cost 21000·6
operations. Total is about 1.5 million. Writing the output still costs
524288 cycles. So we spent 1.5 million cycles to save half a million.

The breakeven point is around 250 items, you need 5 passes, with 2500
sequential reads and heap operations costing about 2500·(32+6) ≈ 100k
cycles; writing the output costs about 125k cycles for a total of
about 225k. Straight mergesort would require two passes of writing and
250k cycles.

The improvement is greater on smaller amounts of data. Consider a
100-item sequence that happened not to be sortable with the
entirely-in-memory streaming heap approach mentioned earlier
(e.g. because the value that must appear first in the output occurs
after position #50 in the input).  Mergesort requires 200 writes and
two passes of reading; heap/selection sort requires 100 writes plus
two passes of reading. (Plus the work to maintain the bitmap.)

So in this cost model, this heap/selection hybrid is faster than
mergesort from about 50 to 100 items up to about 250 items.

So what does that mean?  It means you get some kind of win from the
following strategy:

- For collections of up to 50 items, sort in memory.
- For collections of up to 100 or maybe 1000 items, try to sort in
  a single pass in memory first before trying anything else.
- For collections from 50 to 250 items that can’t be sorted in a
  single pass in memory, use the heap/selection hybrid directly.
- For collections from 250 to 125000 items, do an initial pass of the
  heap/selection hybrid on segments of the input, then an up-to-50-way
  merge.
- For larger collections, likewise, plus further merging passes.

How many segments? There’s no reason to do segments smaller than 50
items, since you can sort those in RAM. Fewer, bigger segments allow
you to do a little more input buffering during the merge phase, which
could speed it up a little, but increase the time wasted due to
needing more passes over the input.

One thing I’ve been glossing over is that mergesort can keep just the
input items in its heap; heap/selection sort also needs to keep their
indices in the input, so that it can write them into the bitmap at the
end, which means that its heap can store fewer items in the same
number of bytes. You could keep *only* the indices, but then
comparisons between records in the heap require going out to read keys
from the WORM, which would be a substantial slowdown.

These threshold numbers (50, 100, 1000, 250, 125000) depend on the
particular characteristics of your CPU, RAM, and WORM. A sufficiently
high cost of space allocation will drive up the threshold at which the
heap/selection hybrid is cheaper than mergesort.  A WORM that doesn’t
provide a reasonably efficient way to mark input items as copied to
output (setting a byte, at least) will make heap/selection sort slower
than mergesort for even small sizes of input, since it could more than
double its output size. Large seek times for reads could counteract
this by making fifty-way merges impractically slow, forcing mergesort
to buffer more and use more passes.  Heap/selection sort is even more
brutally sequential in its memory access patterns than a many-way
mergesort.

Lazy sorting
------------

In a lot of cases it’s probably adequate to do a merge every time you
need to access the supposedly-sorted data, rather than writing an
entire new sorted copy of it; this is the standard “side file”
technique from transaction processing. Lucene more or less uses this
approach.

[0] This phrase isn’t here to make you feel dumb if you don’t know
this stuff. This is here to make sure that you know I don’t think I’m
smart for knowing it.

-- 
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol

sorting in WORM memory (and other algorithms and data structures)

Reply via email to