[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16195123#comment-16195123 ] Hudson commented on HBASE-12590: Results for branch HBASE-18467, done in 4 hr 24 min and counting [build #136 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-18467/136/]: FAILURE details (if available): (x) *{color:red}-1 overall{color}* Committer, please check your recent inclusion of a patch for this issue. (x) {color:red}-1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-18467/136//General_Nightly_Build_Report/] (/) {color:green}+1 jdk8 checks{color} -- For more information [see jdk8 report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-18467/136//JDK8_Nightly_Build_Report/] (x) {color:red}-1 source release artifact{color} -- See build output for details. > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye >Assignee: Weichen Ye > Fix For: 2.0.0 > > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBase-12590-v1.patch, HBase-12590-v2.patch, > HBASE-12590-v3.patch, HBASE-12590-v4.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table may > contains a lot of small regions and several large regions. Small regions > waste a lot of computing resources. If we use a job to scan a table with 3000 > small regions, we need a job with 3000 mappers. Large regions always block > the job. If in a 100-region table, one region is far large then the other 99 > regions. When we run a job with the table as input, 99 mappers will be > completed very quickly, and then we need to wait for the last mapper for a > long time. > 2, Configuration > Add three new configuration > hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region > size is larger than 3x average region size, treat the region as > “proportionately too large”. > hbase.table.row.textkey = true means the row key is text. False means binary > row key. It is used to find the mid row key in large region. The default > value is true. > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size): combine these > regions into one MR input split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16190787#comment-16190787 ] Hudson commented on HBASE-12590: SUCCESS: Integrated in Jenkins build HBase-Trunk_matrix #3823 (See [https://builds.apache.org/job/HBase-Trunk_matrix/3823/]) HBASE-16894 Create more than 1 split per region, generalize HBASE-12590 (apurtell: rev 16d483f9003ddee71404f37ce7694003d1a18ac4) * (edit) hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java * (edit) hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScanBase.java * (edit) hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScan1.java > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye >Assignee: Weichen Ye > Fix For: 2.0.0 > > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBase-12590-v1.patch, HBase-12590-v2.patch, > HBASE-12590-v3.patch, HBASE-12590-v4.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table may > contains a lot of small regions and several large regions. Small regions > waste a lot of computing resources. If we use a job to scan a table with 3000 > small regions, we need a job with 3000 mappers. Large regions always block > the job. If in a 100-region table, one region is far large then the other 99 > regions. When we run a job with the table as input, 99 mappers will be > completed very quickly, and then we need to wait for the last mapper for a > long time. > 2, Configuration > Add three new configuration > hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region > size is larger than 3x average region size, treat the region as > “proportionately too large”. > hbase.table.row.textkey = true means the row key is text. False means binary > row key. It is used to find the mid row key in large region. The default > value is true. > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size): combine these > regions into one MR input split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16190768#comment-16190768 ] Hudson commented on HBASE-12590: SUCCESS: Integrated in Jenkins build HBase-1.4 #940 (See [https://builds.apache.org/job/HBase-1.4/940/]) HBASE-16894 Create more than 1 split per region, generalize HBASE-12590 (apurtell: rev cbbcb2db2f0a94382cb33fef826cbf1a00b5de6e) * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/namespace/TestNamespaceAuditor.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScanBase.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScan1.java > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye >Assignee: Weichen Ye > Fix For: 2.0.0 > > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBase-12590-v1.patch, HBase-12590-v2.patch, > HBASE-12590-v3.patch, HBASE-12590-v4.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table may > contains a lot of small regions and several large regions. Small regions > waste a lot of computing resources. If we use a job to scan a table with 3000 > small regions, we need a job with 3000 mappers. Large regions always block > the job. If in a 100-region table, one region is far large then the other 99 > regions. When we run a job with the table as input, 99 mappers will be > completed very quickly, and then we need to wait for the last mapper for a > long time. > 2, Configuration > Add three new configuration > hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region > size is larger than 3x average region size, treat the region as > “proportionately too large”. > hbase.table.row.textkey = true means the row key is text. False means binary > row key. It is used to find the mid row key in large region. The default > value is true. > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size): combine these > regions into one MR input split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16190744#comment-16190744 ] Hudson commented on HBASE-12590: FAILURE: Integrated in Jenkins build HBase-1.5 #84 (See [https://builds.apache.org/job/HBase-1.5/84/]) HBASE-16894 Create more than 1 split per region, generalize HBASE-12590 (apurtell: rev fc783ef04505eab7e58c6abc3ac1f7d7ecce465b) * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScan1.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScanBase.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/namespace/TestNamespaceAuditor.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye >Assignee: Weichen Ye > Fix For: 2.0.0 > > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBase-12590-v1.patch, HBase-12590-v2.patch, > HBASE-12590-v3.patch, HBASE-12590-v4.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table may > contains a lot of small regions and several large regions. Small regions > waste a lot of computing resources. If we use a job to scan a table with 3000 > small regions, we need a job with 3000 mappers. Large regions always block > the job. If in a 100-region table, one region is far large then the other 99 > regions. When we run a job with the table as input, 99 mappers will be > completed very quickly, and then we need to wait for the last mapper for a > long time. > 2, Configuration > Add three new configuration > hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region > size is larger than 3x average region size, treat the region as > “proportionately too large”. > hbase.table.row.textkey = true means the row key is text. False means binary > row key. It is used to find the mid row key in large region. The default > value is true. > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size): combine these > regions into one MR input split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16190661#comment-16190661 ] Hudson commented on HBASE-12590: FAILURE: Integrated in Jenkins build HBase-2.0 #622 (See [https://builds.apache.org/job/HBase-2.0/622/]) HBASE-16894 Create more than 1 split per region, generalize HBASE-12590 (apurtell: rev 4475ba88c15886bd15c113f2dbd5214600686cfe) * (edit) hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScanBase.java * (edit) hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java * (edit) hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScan1.java > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye >Assignee: Weichen Ye > Fix For: 2.0.0 > > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBase-12590-v1.patch, HBase-12590-v2.patch, > HBASE-12590-v3.patch, HBASE-12590-v4.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table may > contains a lot of small regions and several large regions. Small regions > waste a lot of computing resources. If we use a job to scan a table with 3000 > small regions, we need a job with 3000 mappers. Large regions always block > the job. If in a 100-region table, one region is far large then the other 99 > regions. When we run a job with the table as input, 99 mappers will be > completed very quickly, and then we need to wait for the last mapper for a > long time. > 2, Configuration > Add three new configuration > hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region > size is larger than 3x average region size, treat the region as > “proportionately too large”. > hbase.table.row.textkey = true means the row key is text. False means binary > row key. It is used to find the mid row key in large region. The default > value is true. > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size): combine these > regions into one MR input split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356415#comment-14356415 ] Hudson commented on HBASE-12590: FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #847 (See [https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/847/]) HBASE-13168 Backport HBASE-12590 "A solution for data skew in HBase-Mapreduce Job" (tedyu: rev 1b4f8afaec8cd4dfef46154bdceb31ce7ddf5982) * hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScan1.java * hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScanBase.java * hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye >Assignee: Weichen Ye > Fix For: 2.0.0 > > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, > HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table may > contains a lot of small regions and several large regions. Small regions > waste a lot of computing resources. If we use a job to scan a table with 3000 > small regions, we need a job with 3000 mappers. Large regions always block > the job. If in a 100-region table, one region is far large then the other 99 > regions. When we run a job with the table as input, 99 mappers will be > completed very quickly, and then we need to wait for the last mapper for a > long time. > 2, Configuration > Add three new configuration > hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region > size is larger than 3x average region size, treat the region as > “proportionately too large”. > hbase.table.row.textkey = true means the row key is text. False means binary > row key. It is used to find the mid row key in large region. The default > value is true. > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size): combine these > regions into one MR input split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356365#comment-14356365 ] Hudson commented on HBASE-12590: FAILURE: Integrated in HBase-0.98 #890 (See [https://builds.apache.org/job/HBase-0.98/890/]) HBASE-13168 Backport HBASE-12590 "A solution for data skew in HBase-Mapreduce Job" (tedyu: rev 1b4f8afaec8cd4dfef46154bdceb31ce7ddf5982) * hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java * hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScan1.java * hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScanBase.java > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye >Assignee: Weichen Ye > Fix For: 2.0.0 > > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, > HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table may > contains a lot of small regions and several large regions. Small regions > waste a lot of computing resources. If we use a job to scan a table with 3000 > small regions, we need a job with 3000 mappers. Large regions always block > the job. If in a 100-region table, one region is far large then the other 99 > regions. When we run a job with the table as input, 99 mappers will be > completed very quickly, and then we need to wait for the last mapper for a > long time. > 2, Configuration > Add three new configuration > hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region > size is larger than 3x average region size, treat the region as > “proportionately too large”. > hbase.table.row.textkey = true means the row key is text. False means binary > row key. It is used to find the mid row key in large region. The default > value is true. > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size): combine these > regions into one MR input split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356251#comment-14356251 ] Hudson commented on HBASE-12590: SUCCESS: Integrated in HBase-1.1 #276 (See [https://builds.apache.org/job/HBase-1.1/276/]) HBASE-13168 Backport HBASE-12590 "A solution for data skew in HBase-Mapreduce Job" (tedyu: rev 05aef46d942a0196c6c655ab19a160cd7dc56789) * hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScan1.java * hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScanBase.java * hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye >Assignee: Weichen Ye > Fix For: 2.0.0 > > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, > HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table may > contains a lot of small regions and several large regions. Small regions > waste a lot of computing resources. If we use a job to scan a table with 3000 > small regions, we need a job with 3000 mappers. Large regions always block > the job. If in a 100-region table, one region is far large then the other 99 > regions. When we run a job with the table as input, 99 mappers will be > completed very quickly, and then we need to wait for the last mapper for a > long time. > 2, Configuration > Add three new configuration > hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region > size is larger than 3x average region size, treat the region as > “proportionately too large”. > hbase.table.row.textkey = true means the row key is text. False means binary > row key. It is used to find the mid row key in large region. The default > value is true. > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size): combine these > regions into one MR input split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356236#comment-14356236 ] Hudson commented on HBASE-12590: FAILURE: Integrated in HBase-1.0 #795 (See [https://builds.apache.org/job/HBase-1.0/795/]) HBASE-13168 Backport HBASE-12590 "A solution for data skew in HBase-Mapreduce Job" (tedyu: rev 89112e84957558f31c161256aa2d7054f165ca02) * hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java * hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScanBase.java * hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScan1.java > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye >Assignee: Weichen Ye > Fix For: 2.0.0 > > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, > HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table may > contains a lot of small regions and several large regions. Small regions > waste a lot of computing resources. If we use a job to scan a table with 3000 > small regions, we need a job with 3000 mappers. Large regions always block > the job. If in a 100-region table, one region is far large then the other 99 > regions. When we run a job with the table as input, 99 mappers will be > completed very quickly, and then we need to wait for the last mapper for a > long time. > 2, Configuration > Add three new configuration > hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region > size is larger than 3x average region size, treat the region as > “proportionately too large”. > hbase.table.row.textkey = true means the row key is text. False means binary > row key. It is used to find the mid row key in large region. The default > value is true. > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size): combine these > regions into one MR input split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258610#comment-14258610 ] Weichen Ye commented on HBASE-12590: Thank you [~jmhsieh] for your help and comments! I`ll continue working on HBASE-12716. And Merry Christmas:) > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye >Assignee: Weichen Ye > Fix For: 2.0.0 > > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, > HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table may > contains a lot of small regions and several large regions. Small regions > waste a lot of computing resources. If we use a job to scan a table with 3000 > small regions, we need a job with 3000 mappers. Large regions always block > the job. If in a 100-region table, one region is far large then the other 99 > regions. When we run a job with the table as input, 99 mappers will be > completed very quickly, and then we need to wait for the last mapper for a > long time. > 2, Configuration > Add three new configuration > hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region > size is larger than 3x average region size, treat the region as > “proportionately too large”. > hbase.table.row.textkey = true means the row key is text. False means binary > row key. It is used to find the mid row key in large region. The default > value is true. > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size): combine these > regions into one MR input split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258282#comment-14258282 ] Hudson commented on HBASE-12590: SUCCESS: Integrated in HBase-TRUNK #5965 (See [https://builds.apache.org/job/HBase-TRUNK/5965/]) HBASE-12590 A solution for data skew in HBase-Mapreduce jobs (Weichen Ye) (jmhsieh: rev a912a56b38fca6aada68dab5ef73613c073cbc6a) * hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScan1.java * hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScanBase.java * hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye >Assignee: Weichen Ye > Fix For: 2.0.0 > > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, > HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table may > contains a lot of small regions and several large regions. Small regions > waste a lot of computing resources. If we use a job to scan a table with 3000 > small regions, we need a job with 3000 mappers. Large regions always block > the job. If in a 100-region table, one region is far large then the other 99 > regions. When we run a job with the table as input, 99 mappers will be > completed very quickly, and then we need to wait for the last mapper for a > long time. > 2, Configuration > Add three new configuration > hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region > size is larger than 3x average region size, treat the region as > “proportionately too large”. > hbase.table.row.textkey = true means the row key is text. False means binary > row key. It is used to find the mid row key in large region. The default > value is true. > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size): combine these > regions into one MR input split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258224#comment-14258224 ] Jonathan Hsieh commented on HBASE-12590: Thanks [~yeweichen]! > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye >Assignee: Weichen Ye > Fix For: 2.0.0 > > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, > HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table may > contains a lot of small regions and several large regions. Small regions > waste a lot of computing resources. If we use a job to scan a table with 3000 > small regions, we need a job with 3000 mappers. Large regions always block > the job. If in a 100-region table, one region is far large then the other 99 > regions. When we run a job with the table as input, 99 mappers will be > completed very quickly, and then we need to wait for the last mapper for a > long time. > 2, Configuration > Add three new configuration > hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region > size is larger than 3x average region size, treat the region as > “proportionately too large”. > hbase.table.row.textkey = true means the row key is text. False means binary > row key. It is used to find the mid row key in large region. The default > value is true. > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size): combine these > regions into one MR input split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258216#comment-14258216 ] Jonathan Hsieh commented on HBASE-12590: nice catches. It would be nice to port a correct algorithm over into this places. > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, > HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table may > contains a lot of small regions and several large regions. Small regions > waste a lot of computing resources. If we use a job to scan a table with 3000 > small regions, we need a job with 3000 mappers. Large regions always block > the job. If in a 100-region table, one region is far large then the other 99 > regions. When we run a job with the table as input, 99 mappers will be > completed very quickly, and then we need to wait for the last mapper for a > long time. > 2, Configuration > Add three new configuration > hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region > size is larger than 3x average region size, treat the region as > “proportionately too large”. > hbase.table.row.textkey = true means the row key is text. False means binary > row key. It is used to find the mid row key in large region. The default > value is true. > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size): combine these > regions into one MR input split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251686#comment-14251686 ] Weichen Ye commented on HBASE-12590: [~j...@cloudera.com] Hi~ I used to try the algorithm in RegionSplitter, but I find there is a small bug. If the start key is the same length as the end key, and their last bytes are adjacent in alphabetical order , the algorithm would not calculate a split point with an additional byte. This split algorithm is not very related to the data skew in HBase-MapReduce job, so i create two new issues about it . https://issues.apache.org/jira/browse/HBASE-12716 https://issues.apache.org/jira/browse/HBASE-12717 > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, > HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table may > contains a lot of small regions and several large regions. Small regions > waste a lot of computing resources. If we use a job to scan a table with 3000 > small regions, we need a job with 3000 mappers. Large regions always block > the job. If in a 100-region table, one region is far large then the other 99 > regions. When we run a job with the table as input, 99 mappers will be > completed very quickly, and then we need to wait for the last mapper for a > long time. > 2, Configuration > Add three new configuration > hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region > size is larger than 3x average region size, treat the region as > “proportionately too large”. > hbase.table.row.textkey = true means the row key is text. False means binary > row key. It is used to find the mid row key in large region. The default > value is true. > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size): combine these > regions into one MR input split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250659#comment-14250659 ] Jonathan Hsieh commented on HBASE-12590: FYI, while working in other code I found this which handles the Uniform region split case. Might make sense to use fold in the ascii splitter into that form and use this existing and long tested code path. https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/util/RegionSplitter.java#L1032 > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, > HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table may > contains a lot of small regions and several large regions. Small regions > waste a lot of computing resources. If we use a job to scan a table with 3000 > small regions, we need a job with 3000 mappers. Large regions always block > the job. If in a 100-region table, one region is far large then the other 99 > regions. When we run a job with the table as input, 99 mappers will be > completed very quickly, and then we need to wait for the last mapper for a > long time. > 2, Configuration > Add three new configuration > hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region > size is larger than 3x average region size, treat the region as > “proportionately too large”. > hbase.table.row.textkey = true means the row key is text. False means binary > row key. It is used to find the mid row key in large region. The default > value is true. > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size): combine these > regions into one MR input split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14249751#comment-14249751 ] Hadoop QA commented on HBASE-12590: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12687706/HBASE-12590-v4.patch against master branch at commit 99a11390b4758c211af04af2ca0696ac6e3e0aeb. ATTACHMENT ID: 12687706 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:red}-1 checkstyle{color}. The applied patch generated 2086 checkstyle errors (more than the master's current 2084 errors). {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/12107//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12107//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12107//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12107//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12107//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12107//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12107//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12107//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12107//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12107//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12107//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12107//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/12107//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/12107//console This message is automatically generated. > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, > HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table may > contains a lot of small regions and several large regions. Small regions > waste a lot of computing resources. If we use a job to scan a table with 3000 > small regions, we need a job with 3000 mappers. Large regions always block > the job. If in a 100-region table, one region is far large then the other 99 > regions. When we run a job with the table as input, 99 mappers will be > completed very quickly, and then we need to wait for the last mapper for a > long time. > 2, Configuration > Add three new configuration > hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.input.
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14249605#comment-14249605 ] Hadoop QA commented on HBASE-12590: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12687680/HBASE-12590-v4.patch against master branch at commit 99a11390b4758c211af04af2ca0696ac6e3e0aeb. ATTACHMENT ID: 12687680 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:red}-1 checkstyle{color}. The applied patch generated 2086 checkstyle errors (more than the master's current 2084 errors). {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestPerColumnFamilyFlush Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/12105//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12105//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12105//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12105//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12105//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12105//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12105//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12105//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12105//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12105//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12105//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12105//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/12105//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/12105//console This message is automatically generated. > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, > HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table may > contains a lot of small regions and several large regions. Small regions > waste a lot of computing resources. If we use a job to scan a table with 3000 > small regions, we need a job with 3000 mappers. Large regions always block > the job. If in a 100-region table, one region is far large then the other 99 > regions. When we run a job with the table as input, 99 mappers will be > completed very quickly, and then we need to wait for the last mapper for a > long time. > 2, Configuration > Add three new configuration > hbase.mapreduce.input.autobalance = true means enabling the “auto bal
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14248276#comment-14248276 ] Hadoop QA commented on HBASE-12590: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12687480/HBASE-12590-v3.patch against master branch at commit 96c6b9815ddbc9f2589655df4ad2381af04ac9f8. ATTACHMENT ID: 12687480 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:red}-1 checkstyle{color}. The applied patch generated 2091 checkstyle errors (more than the master's current 2089 errors). {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/12096//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12096//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12096//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12096//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12096//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12096//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12096//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12096//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12096//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12096//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12096//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/12096//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/12096//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/12096//console This message is automatically generated. > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBASE-12590-v3.patch, HBase-12590-v1.patch, > HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14248199#comment-14248199 ] Weichen Ye commented on HBASE-12590: Latest diff on review board: https://reviews.apache.org/r/28494/diff/ > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBASE-12590-v3.patch, HBase-12590-v1.patch, > HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size > of mapreduce splits. > If a region size is large than the target size, cut the region into two > split.If the sum of several small continuous region size less than the target > size, combine these regions into one split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14233780#comment-14233780 ] Weichen Ye commented on HBASE-12590: [~jmhsieh] Thank you your review! It really help me a lot! I`ll continue improving the patch based on your comments. > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size > of mapreduce splits. > If a region size is large than the target size, cut the region into two > split.If the sum of several small continuous region size less than the target > size, combine these regions into one split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14233278#comment-14233278 ] Jonathan Hsieh commented on HBASE-12590: {quote} 2) It is a difficult issue in this patch. It is hard (~for me) to split a large region into several small "MR input splits" with target size ( we have only "start rowkey", "end rowkey" and the Region size). So my point is just find a "mid rowkey" between "start rowkey" and "end rowkey". Do you have any ideas about this? For instance if we split a 5GB region into five 1GB MR input splits, how to find the split point(rowkey) to make the size of these MR input splits equal to 1GB? {quote} internally the split operation tries to read the cell closest to the the mid point of the hfiles and doesn't make rowkey distribution assumptions[1,2,3]. These values however are not exposed for the MR format to use. In v1 and v2 here calculates a split point assuming an ascii-centric, uniformly distribution of rowkeys in the inputsplit. You should at least note that in the docs. Since you are generating the split point based on the uniform distribution assumption, you can probably actually relatively easily calculate more split points. thanks for posting on review board, I've added more comments there. [1] https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L6023 [2] https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RegionSplitPolicy.java#L67 [3] https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java#L670 > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size > of mapreduce splits. > If a region size is large than the target size, cut the region into two > split.If the sum of several small continuous region size less than the target > size, combine these regions into one split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232972#comment-14232972 ] Hadoop QA commented on HBASE-12590: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12684872/HBase-12590-v2.patch against master branch at commit 13a1eaec09a467153adc1ee0b46df9f457da6115. ATTACHMENT ID: 12684872 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 checkstyle{color}. The applied patch does not increase the total number of checkstyle errors {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: {color:red}-1 core zombie tests{color}. There are 1 zombie test(s): at org.apache.hadoop.hbase.master.balancer.TestDefaultLoadBalancer.testBalanceCluster(TestDefaultLoadBalancer.java:119) Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/11927//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11927//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11927//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11927//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11927//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11927//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11927//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11927//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11927//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11927//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11927//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11927//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/11927//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/11927//console This message is automatically generated. > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The de
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230866#comment-14230866 ] Weichen Ye commented on HBASE-12590: [~jmhsieh] Thank you for your review and your advice! 1) The word "split" may be confusing or misleading here. I`ll change the code and doc about this. 2) It is a difficult issue in this patch. It is hard (~for me) to split a large region into several small "MR input splits" with target size ( we have only "start rowkey", "end rowkey" and the Region size). So my point is just find a "mid rowkey" between "start rowkey" and "end rowkey". Do you have any ideas about this? For instance if we split a 5GB region into five 1GB MR input splits, how to find the split point(rowkey) to make the size of these MR input splits equal to 1GB? 3) You give me a great idea! I totally agree to set a ratio other than a constant size in configuration. This week I`ll making a new patch in this new way. > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job.pdf, > HBase-12590-v1.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size > of mapreduce splits. > If a region size is large than the target size, cut the region into two > split.If the sum of several small continuous region size less than the target > size, combine these regions into one split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230024#comment-14230024 ] Jonathan Hsieh commented on HBASE-12590: Nice description of the problem in the slide deck. I did a quick scan of the docs and the code and had a few questions. 1) The world "split" is ambiguous. Need to make it clear in java doc that this is only a "MR input split" and not an "hbase region split" operation that would trigger a lot io. 2) Why do we only split by 2? Why not split further so that we have n mr input splits that are 1gb (in your example) instead of a 2x 3gb, 2x 2.5gb and a 2x 1gb "artificial" mr input splits? 3) To make this easier for users, do you think it might may sense to use something other than a constant size (which assumes the user knows the the server side region size property)? can we look at all of the regions sizes (we have the info already with the RegionSizeCalculator), and just add new MR inputsplits for the regions that are proportionately too large? Maybe we have the setting be a ratio (maybe 5x-10x) larger than the median median region size? That way the job won't have to change if the server side setting changes. > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job.pdf, > HBase-12590-v1.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size > of mapreduce splits. > If a region size is large than the target size, cut the region into two > split.If the sum of several small continuous region size less than the target > size, combine these regions into one split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14227259#comment-14227259 ] Hadoop QA commented on HBASE-12590: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12683988/HBase-12590-v1.patch against master branch at commit f0d95e7f11403d67b4fc3f1fd4ef048047b6842a. ATTACHMENT ID: 12683988 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 checkstyle{color}. The applied patch does not increase the total number of checkstyle errors {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/11849//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11849//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11849//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11849//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11849//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11849//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11849//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11849//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11849//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11849//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11849//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11849//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/11849//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/11849//console This message is automatically generated. > A solution for data skew in HBase-Mapreduce Job > > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job.pdf, > HBase-12590-v1.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size > of mapreduce splits. > If a region size is large than the target size, cut the region into two > split.If the sum of several small c
[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14227150#comment-14227150 ] Hadoop QA commented on HBASE-12590: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12683981/HBase-12590-v1.patch against master branch at commit f0d95e7f11403d67b4fc3f1fd4ef048047b6842a. ATTACHMENT ID: 12683981 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified tests. {color:red}-1 javac{color}. The patch appears to cause mvn compile goal to fail. Compilation errors resume: [ERROR] COMPILATION ERROR : [ERROR] /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java:[45,48] cannot find symbol [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.2:compile (default-compile) on project hbase-server: Compilation failure [ERROR] /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java:[45,48] cannot find symbol [ERROR] symbol: class HLog [ERROR] location: package org.apache.hadoop.hbase.regionserver.wal [ERROR] -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn -rf :hbase-server Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/11848//console This message is automatically generated. > A solution for data skew in HBase-Mapreduce Job > > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Affects Versions: 2.0.0 >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job.pdf, > HBase-12590-v1.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size > of mapreduce splits. > If a region size is large than the target size, cut the region into two > split.If the sum of several small continuous region size less than the target > size, combine these regions into one split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)