[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258224#comment-14258224 ]
Jonathan Hsieh commented on HBASE-12590: ---------------------------------------- Thanks [~yeweichen]! > A solution for data skew in HBase-Mapreduce Job > ----------------------------------------------- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce > Reporter: Weichen Ye > Assignee: Weichen Ye > Fix For: 2.0.0 > > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, > HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table may > contains a lot of small regions and several large regions. Small regions > waste a lot of computing resources. If we use a job to scan a table with 3000 > small regions, we need a job with 3000 mappers. Large regions always block > the job. If in a 100-region table, one region is far large then the other 99 > regions. When we run a job with the table as input, 99 mappers will be > completed very quickly, and then we need to wait for the last mapper for a > long time. > 2, Configuration > Add three new configuration > hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region > size is larger than 3x average region size, treat the region as > “proportionately too large”. > hbase.table.row.textkey = true means the row key is text. False means binary > row key. It is used to find the mid row key in large region. The default > value is true. > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size): combine these > regions into one MR input split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)