Hello folks,
I am using a column index on top of a table with a single column family and 16
columns per row, and I get about 40% better performance when using the
IndexedScanner as compared with the normal Scanner. This happens when
experimenting directly on HBase.
However, if I try the same experiment using map-reduce tasks I do not get any
benefit from using indexes (actually is even 2x worse). In my implementation I
am extending the TableInputFormat where I customize the ResultScanner to be an
IndexedScanner on my input table. Currently, I don't set the "indexStartRow"
and indexStopRow in the indexScanner so I suspect that this could be the reason
for such a performance. The issue here is how to translate betweeen the table
splits of the input table to the splits of the index table if I don't want to
scan through all the column values of the input table... moreover this
transalation won't result in a contiguous range... Or should I go the other way
around and have the index table as the input ?
Thank you for your help,
Adrian