Hello all, I am currently testing some MR job which does a form of a naive join on a HBase table with itself or a Hadoop file with itself.
My findings are that HBase table has a huge overhead (in comparing to Hadoop file) when doing my join (as much as 50 times slower). The map function basically uses the column I wish to join by as the key and the rest of the columns are used as value. The reducer just combines all of the values into a single row. I used 1Million rows for the join. I am using 28 mappers and 21 reducers. All in all I have 7 nodes. My question is this: Should there be such a big overhead (50 times multiplier) when using HBase instead of Hadoop? Thanks, Eran.
