I mentioned a bunch of stuff in that prefix compression email about cache
lines, prefetching, trie node sizes, etc...  The gist of it all is that
memory has become relatively slow to the point where you need to start
thinking of it in similar ways as we think of disk/network.

I dug up and cleaned up some benchmarks i wrote a while back to illustrate
how designing around the caches can make a huge difference.  Try it out:

http://pastebin.com/8PS5Hxpt

It creates pools of bytes in memory starting with 8 bytes, and going up to
256 megabytes.  Alongside each pool it also creates an array of indexes (the
accessOrder[]) specifying the order in which to fetch bytes from the pool.
 For each pool size, it generates a sequential list of accesses and a random
list of accesses.  It then fetches 100 million bytes from memory and times
how long it takes.

These results are from an Intel i7-2600 at 3.4GHz.  Each line of output
below shows the pool size, the sequential access rate, then the random
access rate:

bytes fetched per microsecond (megabytes per second)
pool size      sorted      random
       8B        1295        1276
      16B        1461        1355
      32B        1404        1396
      64B        1319        1380
     128B        1440        1375
     256B        1449        1425
     512B        1363        1394
       1K        1431        1354
       2K        1414        1404
       4K        1363        1383
       8K        1307        1399
      16K        1440        1417
      32K        1408        1324
      64K        1408        1230
     128K        1385        1108
     256K        1406         822
     512K        1360         583
       1M        1476         516
       2M        1354         444
       4M        1346         262
       8M        1343         165
      16M        1392         120
      32M        1357         109
      64M        1337          94
     128M        1348          78
     256M        1324          68


You can see that while the pool is under ~128K, the sequential and random
accesses both happen at ~1.3 GB/s.  However, when the memory pool grows
beyond L2 cache size the sequential access barely changes, while the random
access falls all the way down to ~68MB/s.  It's accentuating the worst case
scenario, but we're talking about a 20x difference.  20% performance
differences usually warrant some attention, so I'd say 20x difference is
nothing to scoff at!!

Don't pay attention to the speed at which it prints the output since most of
the time is spent generating the random accessOrder array.  To run with GC
logging on to verify that it's not interfering, try these
options: -XX:+PrintGCDetails -Xms1g -Xmx1g

HBase is currently a powerful database management system for huge disk based
workloads, but it has a (pretty) clean slate and a lot of potential to be a
powerful system for memory resident workloads as well.

Matt

Reply via email to