Hallo, I have a mapreduce application reading from an existing hbase table. The map function searches for some values in the table and the reduce function averages them.
*My question is simple : ****Method1**** *I have initially written the program passing to the Map function the* " Input key type: ImmutableBytesWritable and Input Value:RowResult"**. *I have set of course the *setInputFormat(TableInputFormat.class)* and set as well the COLUM_LIST. I have added a debug user counter in order to check how often my table has been read and discovered (with your help as well) that the table is read N times where N is the number of rows in the table. Which was of course not acceptable. This was due to the fact that I am passing the RowResult as an input to the Map function. *****Method2***** I decided not to pass the RowResult as an input format to the map but I have passed a Text which in fact I am not using at all in the map function, I have used it only in oder to pass anything so that hadoop does not give me an error :) . Then, similarly to the first method, in the map function I have created a scanner on the hbase table and started reading the rows. With this solution, Once I haven't passed the RowResult as a parameter ot the mapper, the job was much faster and the table was read only once!!! Perfect! *Question **-*Are there any hidden performance issues or complications behind my method 2? -It is true that I reached a solution with what I have done but I am wondering if I can do it in a cleaner way. So I was wondering if I could somehow skip the fact passing an input key and input value to the map? If yes how? Regards, CJ
