Suppose you have a small, fast, random-access working memory and a
large memory that is fast for sequential access but slow for random
access. These days, the latency-bandwidth product of SDRAM seems to be
around 40 bytes, of Intel SSD seems to be around 50 kilobytes, of
traditional hard disks seems to be around 500 kilobytes. I have no
idea what the figure is for modern tape drives, but it must be in the
gigabytes. So all of these fall into this “large memory” category to
some extent.

MapReduce works by converting records of an input file in a
data-parallel fashion into key-value intermediate tuples, and then
bringing together groups of intermediate tuples with identical keys
for further processing to produce an output file, in order to take
advantage of parallelism in warehouse-scale computers.

But for pocket-scale or shoebox-scale computers, the above-mentioned
latency of memory access is a problem, and will become a bigger
problem as time goes on. The MapReduce paradigm contains a solution to
that, too.

The mapping phase can consume input records in any order, such as a
sequential order, and sequentially write intermediate tuples to the
large memory in sorted runs. The standard mergesort-on-tape trick of
continuing to stream out intermediate tuples to the existing run as
long as possible will give runs, on average, twice the size of the
working memory.

Then, the reducing phase can read sequentially from each of the sorted
runs simultaneously, producing output sequentially. In the worst case,
doing this efficiently requires as many input buffers as there are
sorted runs, each of size comparable to the latency-bandwidth product.

This approach can be applied to any algorithm in which it is possible
during the first pass to determine the order in which the data must be
accessed to produce the final result.

It permits a reasonably efficient two-pass implementation of the
algorithm for intermediate data up to 2S² / B in size on average,
where S is the working memory size available for I/O buffering, and B
is the size needed for a reasonably efficient input buffer. So if S is
40kB and B is 500 bytes, this limit is 6.4MB. The worst-case limit is
half that.

An intermediate merging phase to produce larger sorted runs from
smaller sorted runs can increase this limit by a factor of about
S/B. So, for example, if B is 256 bytes, then you can handle 10MiB of
intermediate tuples with a single intermediate merging phase if
2S³/(65536 bytes²) = 10MiB, which occurs at S ≈ 7KiB: initially
producing runs of 14KiB, merged in a 27-way merge into 28 runs of
392KiB each, which are consumed in the final reducing phase.

On hardware where the full latency is not incurred for read operations
that are sequential with the previous ones, very small input buffers
can be used, greatly increasing the capacity of this approach.

Are there any inexpensive SPI flash or RAM chips out there that are
larger than a megabyte or so and that support writing data at several
megabytes per second? The only SPI memory chips I can find over a
megabyte are flash, and they all seem to be able to write at less than
a megabyte per second. E.g. the Atmel AT25DF081 costs US$1.61 in
quantity 1 and is a megabyte, but its t_PP to write 256 bytes is
1-5ms, which means 51kB to 256kB per second, even though the read
speed should be around 8 MB/s. If there aren’t any suitable flash
chips out there, then this approach may not be particularly useful for
microcontroller-based systems — if you can afford even slow SDRAM,
it’s far and away faster than the μC, so you can treat it as
random-access. (Although if the μC doesn’t have a built-in memory
controller, getting any data to or from the SDRAM may be slow.)

The approach definitely works and is useful for disk-based systems; I
use a form of it for search engine indices that total about 20GB. I
suspect it is useful for SSD-based systems too.

Is it useful for doing stuff in RAM too? I don’t know. The
bandwidth-latency product on my Celeron E1200 desktop machine seems to
be around 40 bytes (if I did my benchmarking right, it’s 700MB/s, with
latency of 50ns for random read-and-write to a byte), which might be
too small to make such expensive optimization strategies worthwhile.

(I’m pretty sure there are faster flash chips out there. My SanDisk
16GB Cruzer reads at about 23 MB/s and writes at 5MB/s. Some new SD
cards from 2009, which are uniformly one- or two-chip devices, are
“class 10”, meaning they are guaranteed to sustain 10MB/s of write
speed. An easy solution to the microcontroller external memory problem
would be to hook up an SD card to the μC and use it in SPI mode, but
new SD cards are expensive.)

-- 
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol

Reply via email to