Hi, Over on http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html I saw this bit:
"The most important factor is that HBase is not restarted frequently and that it performs house keeping on a regular basis. These so called compactions rewrite files as new data is added over time. All files in HDFS once written are immutable (for all sorts of reasons). Because of that, data is written into new files and as their number grows HBase compacts them into another set of new, consolidated files. And here is the kicker: HDFS is smart enough to put the data where it is needed!" ... and I always wondered what this does to the OS cache. In some applications (non-HBase stuff, say full-text search), the OS cache plays a crucial role in how the system performs. If you have to hit the disk too much, you're in trouble, so one of the things you avoid is making big changes to index files on disk in order to avoid invalidating data that's been nicely cached by the OS. However, with HBase, and especially major compactions, what happens with the OS cache? All gone, right? Do people find this problematic? Or does the OS cache simply not play such a significant role in systems running HBase simply because the data it holds and that needs to be accessed is much bigger than the OS cache could ever be, so even with the OS cache full and hot, other data would still have to be read from disk anyway? Thanks, Otis
