[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250659#comment-14250659
 ] 

Jonathan Hsieh commented on HBASE-12590:
----------------------------------------

FYI, while working in other code I found this which handles the Uniform region 
split case.  Might make sense to use fold in the ascii splitter into that form 
and use this existing and long tested code path.

https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/util/RegionSplitter.java#L1032

> A solution for data skew in HBase-Mapreduce Job
> -----------------------------------------------
>
>                 Key: HBASE-12590
>                 URL: https://issues.apache.org/jira/browse/HBASE-12590
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Weichen Ye
>         Attachments: A Solution for Data Skew in HBase-MapReduce Job 
> (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job 
> (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, 
> HBase-12590-v1.patch, HBase-12590-v2.patch
>
>
> 1, Motivation
> In production environment, data skew is a very common case. A HBase table may 
> contains a lot of small regions and several large regions. Small regions 
> waste a lot of computing resources. If we use a job to scan a table with 3000 
> small regions, we need a job with 3000 mappers. Large regions always block 
> the job. If in a 100-region table, one region is far large then the other 99 
> regions. When we run a job with the table as input, 99 mappers will be 
> completed very quickly, and then we need to wait for the last mapper for a 
> long time.
> 2, Configuration
> Add three new configuration 
> hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in 
> HBase-MapReduce jobs. The default value is false. 
> hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region 
> size is larger than 3x average region size, treat the region as 
> “proportionately too large”.
> hbase.table.row.textkey  = true means the row key is text. False means binary 
> row key. It is used to find the mid row key in large region. The default 
> value is true. 
> If (region size >= average size*ratio) :  cut the region into two MR input 
> splits
> If (average size <= region size < average size*ratio) : one region as one MR 
> input split
> If (sum of several continuous regions size < average size): combine these 
> regions into one MR input split.
> Example:
> In attachment
> Welcome to the Review Board.
> https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to