Re: TableInputFormat improvement to handle lots of small regions

Cosmin Lehene Wed, 22 Jun 2011 05:09:25 -0700

We overridden getSplits so that it does super.getSplits and then using a 
configuration variable (splitsPerMap) will output another set of splits that 
basically merges (start/stop row manipulation) the original splits array.


This can be easily modified to get the number of desired maps instead of 
regions per map (just a matter of taste here:))

Cosmin
On Jun 21, 2011, at 4:18 AM, Ma, Ming wrote:

> TableInputFormat creates one split/mapper task per region. In the case of 
> lots of small regions, the overhead of map reduce framework becomes overhead. 
> There are some related work items that could address this issue.
> 
> 
> 1.       Reduce the number of small regions. 
> https://issues.apache.org/jira/browse/HBASE-420
> 
> 2.       Improvement in map reduce framework to handle small jobs. 
> https://issues.apache.org/jira/browse/MAPREDUCE-1220
> 
> Another quick way to solve this is to just improve TableInputFormat so that 
> it can pack a configurable number of regions from a given region server into 
> one mapper task. I tested this approach and was able to achieve 40% 
> improvement on map job latency.
> 
> Any feedback?
> 
> Ming

Re: TableInputFormat improvement to handle lots of small regions

Reply via email to