I mentioned a bunch of stuff in that prefix compression email about cache lines, prefetching, trie node sizes, etc... The gist of it all is that memory has become relatively slow to the point where you need to start thinking of it in similar ways as we think of disk/network.
I dug up and cleaned up some benchmarks i wrote a while back to illustrate how designing around the caches can make a huge difference. Try it out: http://pastebin.com/8PS5Hxpt It creates pools of bytes in memory starting with 8 bytes, and going up to 256 megabytes. Alongside each pool it also creates an array of indexes (the accessOrder[]) specifying the order in which to fetch bytes from the pool. For each pool size, it generates a sequential list of accesses and a random list of accesses. It then fetches 100 million bytes from memory and times how long it takes. These results are from an Intel i7-2600 at 3.4GHz. Each line of output below shows the pool size, the sequential access rate, then the random access rate: bytes fetched per microsecond (megabytes per second) pool size sorted random 8B 1295 1276 16B 1461 1355 32B 1404 1396 64B 1319 1380 128B 1440 1375 256B 1449 1425 512B 1363 1394 1K 1431 1354 2K 1414 1404 4K 1363 1383 8K 1307 1399 16K 1440 1417 32K 1408 1324 64K 1408 1230 128K 1385 1108 256K 1406 822 512K 1360 583 1M 1476 516 2M 1354 444 4M 1346 262 8M 1343 165 16M 1392 120 32M 1357 109 64M 1337 94 128M 1348 78 256M 1324 68 You can see that while the pool is under ~128K, the sequential and random accesses both happen at ~1.3 GB/s. However, when the memory pool grows beyond L2 cache size the sequential access barely changes, while the random access falls all the way down to ~68MB/s. It's accentuating the worst case scenario, but we're talking about a 20x difference. 20% performance differences usually warrant some attention, so I'd say 20x difference is nothing to scoff at!! Don't pay attention to the speed at which it prints the output since most of the time is spent generating the random accessOrder array. To run with GC logging on to verify that it's not interfering, try these options: -XX:+PrintGCDetails -Xms1g -Xmx1g HBase is currently a powerful database management system for huge disk based workloads, but it has a (pretty) clean slate and a lot of potential to be a powerful system for memory resident workloads as well. Matt