Indexes in MR tasks

Adrian Popescu Fri, 26 Feb 2010 11:27:18 -0800

Hello folks,

I am using a column index on top of a table with a single column family and 16 
columns per row, and I get about 40% better performance when using the 
IndexedScanner as compared with the normal Scanner. This happens when 
experimenting directly on HBase.


However, if I try the same experiment using map-reduce tasks I do not get any 
benefit from using indexes (actually is even 2x worse). In my implementation I 
am extending the TableInputFormat where I customize the ResultScanner to be an 
IndexedScanner on my input table.  Currently, I don't set the "indexStartRow" 
and indexStopRow in the indexScanner so I suspect that this could be the reason 
for such a performance. The issue here is how to translate betweeen the table 
splits of the input table to the splits of the index table if I don't want to 
scan through all the column values of the input table...  moreover this 
transalation won't result in a contiguous range... Or should I go the other way 
around and have the index table as the input ?

Thank you for your help,
Adrian

Indexes in MR tasks

Reply via email to