Hi Ryan, 

>> 2. mapreduce.HFileInputFormat
>> 
>> MR library to read data directly from HFiles. (Roughly 2.5 times faster than 
>> TableInputFormat in my tests)
>> 
>> Current status: Completed a proof-of-concept prototype and measured 
>> performance.


> On Jan 23, 2011, Ryan Rawson wrote:
>> #2 is interesting, what is the benefit? How did you measure said benefit?

I have only performed simplified tests; single test thread on single server. It 
was even not a MR job but a simple program that scans through the whole rows in 
the table. I'll definitely need deeper tests in a clustering environment to 
measure more realistic results. 

The related test programs can be found here (V1 is the one):
https://github.com/tatsuya6502/hbase-mr-pof

And the chart comparing throughput on RS, HFileInputFormat and HDFS 
SequenceFile: 
http://github.com/tatsuya6502/hbase-mr-pof/raw/master/docs/performance_comparison_0821_2010.pdf


Please note: The disk drive attached to the EC2 instance was slow, so for this 
particular test, I used a small table to fit the whole contents of the files in 
Linux's disk read cache, ran each test twice and only recorded second result.  
(I restarted RS between first and second tests to clear its block cache)

One interesting thing I saw in the result  was HDFS SequenceFile didn't scale 
well in my environment. SequenceFile needed more processor power than HFile and 
suffered by the processor bottleneck. CPU utilization was about 100% for 
SequenceFile and about 30% for HFile throughout the tests


- Tatsuya

--
Tatsuya Kawano
Tokyo, Japan

Reply via email to