Suppose you have a small, fast, random-access working memory and a large memory that is fast for sequential access but slow for random access. These days, the latency-bandwidth product of SDRAM seems to be around 40 bytes, of Intel SSD seems to be around 50 kilobytes, of traditional hard disks seems to be around 500 kilobytes. I have no idea what the figure is for modern tape drives, but it must be in the gigabytes. So all of these fall into this “large memory” category to some extent.
MapReduce works by converting records of an input file in a data-parallel fashion into key-value intermediate tuples, and then bringing together groups of intermediate tuples with identical keys for further processing to produce an output file, in order to take advantage of parallelism in warehouse-scale computers. But for pocket-scale or shoebox-scale computers, the above-mentioned latency of memory access is a problem, and will become a bigger problem as time goes on. The MapReduce paradigm contains a solution to that, too. The mapping phase can consume input records in any order, such as a sequential order, and sequentially write intermediate tuples to the large memory in sorted runs. The standard mergesort-on-tape trick of continuing to stream out intermediate tuples to the existing run as long as possible will give runs, on average, twice the size of the working memory. Then, the reducing phase can read sequentially from each of the sorted runs simultaneously, producing output sequentially. In the worst case, doing this efficiently requires as many input buffers as there are sorted runs, each of size comparable to the latency-bandwidth product. This approach can be applied to any algorithm in which it is possible during the first pass to determine the order in which the data must be accessed to produce the final result. It permits a reasonably efficient two-pass implementation of the algorithm for intermediate data up to 2S² / B in size on average, where S is the working memory size available for I/O buffering, and B is the size needed for a reasonably efficient input buffer. So if S is 40kB and B is 500 bytes, this limit is 6.4MB. The worst-case limit is half that. An intermediate merging phase to produce larger sorted runs from smaller sorted runs can increase this limit by a factor of about S/B. So, for example, if B is 256 bytes, then you can handle 10MiB of intermediate tuples with a single intermediate merging phase if 2S³/(65536 bytes²) = 10MiB, which occurs at S ≈ 7KiB: initially producing runs of 14KiB, merged in a 27-way merge into 28 runs of 392KiB each, which are consumed in the final reducing phase. On hardware where the full latency is not incurred for read operations that are sequential with the previous ones, very small input buffers can be used, greatly increasing the capacity of this approach. Are there any inexpensive SPI flash or RAM chips out there that are larger than a megabyte or so and that support writing data at several megabytes per second? The only SPI memory chips I can find over a megabyte are flash, and they all seem to be able to write at less than a megabyte per second. E.g. the Atmel AT25DF081 costs US$1.61 in quantity 1 and is a megabyte, but its t_PP to write 256 bytes is 1-5ms, which means 51kB to 256kB per second, even though the read speed should be around 8 MB/s. If there aren’t any suitable flash chips out there, then this approach may not be particularly useful for microcontroller-based systems — if you can afford even slow SDRAM, it’s far and away faster than the μC, so you can treat it as random-access. (Although if the μC doesn’t have a built-in memory controller, getting any data to or from the SDRAM may be slow.) The approach definitely works and is useful for disk-based systems; I use a form of it for search engine indices that total about 20GB. I suspect it is useful for SSD-based systems too. Is it useful for doing stuff in RAM too? I don’t know. The bandwidth-latency product on my Celeron E1200 desktop machine seems to be around 40 bytes (if I did my benchmarking right, it’s 700MB/s, with latency of 50ns for random read-and-write to a byte), which might be too small to make such expensive optimization strategies worthwhile. (I’m pretty sure there are faster flash chips out there. My SanDisk 16GB Cruzer reads at about 23 MB/s and writes at 5MB/s. Some new SD cards from 2009, which are uniformly one- or two-chip devices, are “class 10”, meaning they are guaranteed to sustain 10MB/s of write speed. An easy solution to the microcontroller external memory problem would be to hook up an SD card to the μC and use it in SPI mode, but new SD cards are expensive.) -- To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol