The behavior of TableInputFormat is to schedule one mapper for every table region.
In addition to what others have said already, if your reducer is doing little more than storing data back into HBase (via TableOutputFormat), then you can consider writing results back to HBase directly from the mapper to avoid incurring the overhead of sort/shuffle/merge which happens within the Hadoop job framework as map outputs are input into reducers. For that type of use case -- using the Hadoop mapreduce subsystem as essentially a grid scheduler -- something like job.setNumReducers(0) will do the trick. Best regards, - Andy ________________________________ From: john smith <js1987.sm...@gmail.com> To: hbase-user@hadoop.apache.org Sent: Friday, August 21, 2009 12:42:36 AM Subject: Doubt in HBase Hi all , I have one small doubt . Kindly answer it even if it sounds silly. Iam using Map Reduce in HBase in distributed mode . I have a table which spans across 5 region servers . I am using TableInputFormat to read the data from the tables in the map . When i run the program , by default how many map regions are created ? Is it one per region server or more ? Also after the map task is over.. reduce task is taking a bit more time . Is it due to moving the map output across the regionservers? i.e, moving the values of same key to a particular reduce phase to start the reducer? Is there any way i can optimize the code (e.g. by storing data of same reducer nearby ) Thanks :)