We overridden getSplits so that it does super.getSplits and then using a configuration variable (splitsPerMap) will output another set of splits that basically merges (start/stop row manipulation) the original splits array.
This can be easily modified to get the number of desired maps instead of regions per map (just a matter of taste here:)) Cosmin On Jun 21, 2011, at 4:18 AM, Ma, Ming wrote: > TableInputFormat creates one split/mapper task per region. In the case of > lots of small regions, the overhead of map reduce framework becomes overhead. > There are some related work items that could address this issue. > > > 1. Reduce the number of small regions. > https://issues.apache.org/jira/browse/HBASE-420 > > 2. Improvement in map reduce framework to handle small jobs. > https://issues.apache.org/jira/browse/MAPREDUCE-1220 > > Another quick way to solve this is to just improve TableInputFormat so that > it can pack a configurable number of regions from a given region server into > one mapper task. I tested this approach and was able to achieve 40% > improvement on map job latency. > > Any feedback? > > Ming
