Hi Ryan, >> 2. mapreduce.HFileInputFormat >> >> MR library to read data directly from HFiles. (Roughly 2.5 times faster than >> TableInputFormat in my tests) >> >> Current status: Completed a proof-of-concept prototype and measured >> performance.
> On Jan 23, 2011, Ryan Rawson wrote: >> #2 is interesting, what is the benefit? How did you measure said benefit? I have only performed simplified tests; single test thread on single server. It was even not a MR job but a simple program that scans through the whole rows in the table. I'll definitely need deeper tests in a clustering environment to measure more realistic results. The related test programs can be found here (V1 is the one): https://github.com/tatsuya6502/hbase-mr-pof And the chart comparing throughput on RS, HFileInputFormat and HDFS SequenceFile: http://github.com/tatsuya6502/hbase-mr-pof/raw/master/docs/performance_comparison_0821_2010.pdf Please note: The disk drive attached to the EC2 instance was slow, so for this particular test, I used a small table to fit the whole contents of the files in Linux's disk read cache, ran each test twice and only recorded second result. (I restarted RS between first and second tests to clear its block cache) One interesting thing I saw in the result was HDFS SequenceFile didn't scale well in my environment. SequenceFile needed more processor power than HFile and suffered by the processor bottleneck. CPU utilization was about 100% for SequenceFile and about 30% for HFile throughout the tests - Tatsuya -- Tatsuya Kawano Tokyo, Japan
