Hi all, Can anyone elaborate on the pitfalls or implications of running MapReduce using an HFileInputFormat extending FileInputFormat?
I'm sure scanning goes through the RS for good reasons (guessing handling splits, locking, RS monitoring etc) but can it ever be "safe" to run MR over HFiles directly? E.g. For scenarios like a a region split, would the MR just get stale data or would _bad_things_happen_? For our use cases we could tolerate stale data, the occasional MR failure on a node dropping out, and if we could detect a region split we can suspend MR jobs on the HFile until the split is finished. We don't anticipate huge daily growth, but a lot of scanning and random access. I knocked up a quick example porting the Scala version of HFIF [1] to Java [2] and full data scans appear to be an order of magnitude quicker (30 -> 3 mins), but I suspect this is *fraught* with dangers. If not, I'd like to try and take this further, possibly with Hive. Thanks, Tim [1] https://gist.github.com/1120311 [2] http://pastebin.com/e5qeKgAd
