[jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hsieh updated HBASE-12590: --- Assignee: Weichen Ye > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye >Assignee: Weichen Ye > Fix For: 2.0.0 > > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, > HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table may > contains a lot of small regions and several large regions. Small regions > waste a lot of computing resources. If we use a job to scan a table with 3000 > small regions, we need a job with 3000 mappers. Large regions always block > the job. If in a 100-region table, one region is far large then the other 99 > regions. When we run a job with the table as input, 99 mappers will be > completed very quickly, and then we need to wait for the last mapper for a > long time. > 2, Configuration > Add three new configuration > hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region > size is larger than 3x average region size, treat the region as > “proportionately too large”. > hbase.table.row.textkey = true means the row key is text. False means binary > row key. It is used to find the mid row key in large region. The default > value is true. > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size): combine these > regions into one MR input split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hsieh updated HBASE-12590: --- Resolution: Fixed Fix Version/s: 2.0.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) I've committed to master after making a cosmetic change to make it pass checkstyle violations. > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Fix For: 2.0.0 > > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, > HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table may > contains a lot of small regions and several large regions. Small regions > waste a lot of computing resources. If we use a job to scan a table with 3000 > small regions, we need a job with 3000 mappers. Large regions always block > the job. If in a 100-region table, one region is far large then the other 99 > regions. When we run a job with the table as input, 99 mappers will be > completed very quickly, and then we need to wait for the last mapper for a > long time. > 2, Configuration > Add three new configuration > hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region > size is larger than 3x average region size, treat the region as > “proportionately too large”. > hbase.table.row.textkey = true means the row key is text. False means binary > row key. It is used to find the mid row key in large region. The default > value is true. > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size): combine these > regions into one MR input split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Ye updated HBASE-12590: --- Attachment: (was: HBASE-12590-v4.patch) > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, > HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table may > contains a lot of small regions and several large regions. Small regions > waste a lot of computing resources. If we use a job to scan a table with 3000 > small regions, we need a job with 3000 mappers. Large regions always block > the job. If in a 100-region table, one region is far large then the other 99 > regions. When we run a job with the table as input, 99 mappers will be > completed very quickly, and then we need to wait for the last mapper for a > long time. > 2, Configuration > Add three new configuration > hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region > size is larger than 3x average region size, treat the region as > “proportionately too large”. > hbase.table.row.textkey = true means the row key is text. False means binary > row key. It is used to find the mid row key in large region. The default > value is true. > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size): combine these > regions into one MR input split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Ye updated HBASE-12590: --- Attachment: HBASE-12590-v4.patch > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, > HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table may > contains a lot of small regions and several large regions. Small regions > waste a lot of computing resources. If we use a job to scan a table with 3000 > small regions, we need a job with 3000 mappers. Large regions always block > the job. If in a 100-region table, one region is far large then the other 99 > regions. When we run a job with the table as input, 99 mappers will be > completed very quickly, and then we need to wait for the last mapper for a > long time. > 2, Configuration > Add three new configuration > hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region > size is larger than 3x average region size, treat the region as > “proportionately too large”. > hbase.table.row.textkey = true means the row key is text. False means binary > row key. It is used to find the mid row key in large region. The default > value is true. > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size): combine these > regions into one MR input split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Ye updated HBASE-12590: --- Attachment: HBASE-12590-v4.patch [~j...@cloudera.com] Thank you for your review yesterday! Please take a look at this new patch, I have made some changes based on the review comments. Review board: https://reviews.apache.org/r/28494/diff/# > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, > HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table may > contains a lot of small regions and several large regions. Small regions > waste a lot of computing resources. If we use a job to scan a table with 3000 > small regions, we need a job with 3000 mappers. Large regions always block > the job. If in a 100-region table, one region is far large then the other 99 > regions. When we run a job with the table as input, 99 mappers will be > completed very quickly, and then we need to wait for the last mapper for a > long time. > 2, Configuration > Add three new configuration > hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region > size is larger than 3x average region size, treat the region as > “proportionately too large”. > hbase.table.row.textkey = true means the row key is text. False means binary > row key. It is used to find the mid row key in large region. The default > value is true. > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size): combine these > regions into one MR input split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Ye updated HBASE-12590: --- Description: 1, Motivation In production environment, data skew is a very common case. A HBase table may contains a lot of small regions and several large regions. Small regions waste a lot of computing resources. If we use a job to scan a table with 3000 small regions, we need a job with 3000 mappers. Large regions always block the job. If in a 100-region table, one region is far large then the other 99 regions. When we run a job with the table as input, 99 mappers will be completed very quickly, and then we need to wait for the last mapper for a long time. 2, Configuration Add three new configuration hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in HBase-MapReduce jobs. The default value is false. hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region size is larger than 3x average region size, treat the region as “proportionately too large”. hbase.table.row.textkey = true means the row key is text. False means binary row key. It is used to find the mid row key in large region. The default value is true. If (region size >= average size*ratio) : cut the region into two MR input splits If (average size <= region size < average size*ratio) : one region as one MR input split If (sum of several continuous regions size < average size): combine these regions into one MR input split. Example: In attachment Welcome to the Review Board. https://reviews.apache.org/r/28494/diff/# was: 1, Motivation In production environment, data skew is a very common case. A HBase table always contains a lot of small regions and several large regions. Small regions waste a lot of computing resources. If we use a job to scan a table with 3000 small regions, we need a job with 3000 mappers. Large regions always block the job. If in a 100-region table, one region is far larger then the other 99 regions. When we run a job with the table as input, 99 mappers will be completed very quickly, and we need to wait for the last mapper for a long time. 2, Configuration Add two new configuration. hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in HBase-MapReduce jobs. The default value is false. hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size of mapreduce splits. If a region size is large than the target size, cut the region into two split.If the sum of several small continuous region size less than the target size, combine these regions into one split. Example: In attachment Welcome to the Review Board. https://reviews.apache.org/r/28494/diff/# > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBASE-12590-v3.patch, HBase-12590-v1.patch, > HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table may > contains a lot of small regions and several large regions. Small regions > waste a lot of computing resources. If we use a job to scan a table with 3000 > small regions, we need a job with 3000 mappers. Large regions always block > the job. If in a 100-region table, one region is far large then the other 99 > regions. When we run a job with the table as input, 99 mappers will be > completed very quickly, and then we need to wait for the last mapper for a > long time. > 2, Configuration > Add three new configuration > hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region > size is larger than 3x average region size, treat the region as > “proportionately too large”. > hbase.table.row.textkey = true means the row key is text. False means binary > row key. It is used to find the mid row key in large region. The default > value is true. > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size): combine these > regions into one MR input split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Ye updated HBASE-12590: --- Attachment: A Solution for Data Skew in HBase-MapReduce Job (Version3).pdf > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job > (Version3).pdf, HBASE-12590-v3.patch, HBase-12590-v1.patch, > HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size > of mapreduce splits. > If a region size is large than the target size, cut the region into two > split.If the sum of several small continuous region size less than the target > size, combine these regions into one split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Ye updated HBASE-12590: --- Attachment: HBASE-12590-v3.patch [~j...@cloudera.com] Hi, would you please take a look at this new patch? in the new patch: 1, re-design the function for getting split point in large region 2, add a new mode for binary keys. The default mode is for text keys. User can swith by setting a new configuration: hbase.table.row.textkey 3, add new tests for both text keys and binary keys > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, HBASE-12590-v3.patch, HBase-12590-v1.patch, > HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size > of mapreduce splits. > If a region size is large than the target size, cut the region into two > split.If the sum of several small continuous region size less than the target > size, combine these regions into one split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Ye updated HBASE-12590: --- Attachment: HBase-12590-v2.patch > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size > of mapreduce splits. > If a region size is large than the target size, cut the region into two > split.If the sum of several small continuous region size less than the target > size, combine these regions into one split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Ye updated HBASE-12590: --- Attachment: (was: HBase-12590-v2.patch) > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, HBase-12590-v1.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size > of mapreduce splits. > If a region size is large than the target size, cut the region into two > split.If the sum of several small continuous region size less than the target > size, combine these regions into one split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Ye updated HBASE-12590: --- Attachment: A Solution for Data Skew in HBase-MapReduce Job (Version2).pdf > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size > of mapreduce splits. > If a region size is large than the target size, cut the region into two > split.If the sum of several small continuous region size less than the target > size, combine these regions into one split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Ye updated HBASE-12590: --- Attachment: (was: A Solution for Data Skew in HBase-MapReduce Job (Version2).pdf) > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size > of mapreduce splits. > If a region size is large than the target size, cut the region into two > split.If the sum of several small continuous region size less than the target > size, combine these regions into one split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Ye updated HBASE-12590: --- Attachment: HBase-12590-v2.patch > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: HBase-12590-v1.patch, HBase-12590-v2.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size > of mapreduce splits. > If a region size is large than the target size, cut the region into two > split.If the sum of several small continuous region size less than the target > size, combine these regions into one split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Ye updated HBASE-12590: --- Attachment: A Solution for Data Skew in HBase-MapReduce Job (Version2).pdf > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, HBase-12590-v1.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size > of mapreduce splits. > If a region size is large than the target size, cut the region into two > split.If the sum of several small continuous region size less than the target > size, combine these regions into one split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Ye updated HBASE-12590: --- Attachment: (was: A Solution for Data Skew in HBase-MapReduce Job.pdf) > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job > (Version2).pdf, HBase-12590-v1.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size > of mapreduce splits. > If a region size is large than the target size, cut the region into two > split.If the sum of several small continuous region size less than the target > size, combine these regions into one split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hsieh updated HBASE-12590: --- Summary: A solution for data skew in HBase-Mapreduce Job (was: im) > A solution for data skew in HBase-Mapreduce Job > --- > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job.pdf, > HBase-12590-v1.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size > of mapreduce splits. > If a region size is large than the target size, cut the region into two > split.If the sum of several small continuous region size less than the target > size, combine these regions into one split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Ye updated HBASE-12590: --- Affects Version/s: (was: 2.0.0) > A solution for data skew in HBase-Mapreduce Job > > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job.pdf, > HBase-12590-v1.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size > of mapreduce splits. > If a region size is large than the target size, cut the region into two > split.If the sum of several small continuous region size less than the target > size, combine these regions into one split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Ye updated HBASE-12590: --- Attachment: HBase-12590-v1.patch > A solution for data skew in HBase-Mapreduce Job > > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Affects Versions: 2.0.0 >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job.pdf, > HBase-12590-v1.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size > of mapreduce splits. > If a region size is large than the target size, cut the region into two > split.If the sum of several small continuous region size less than the target > size, combine these regions into one split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Ye updated HBASE-12590: --- Attachment: (was: HBase-12590-v1.patch) > A solution for data skew in HBase-Mapreduce Job > > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Affects Versions: 2.0.0 >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job.pdf > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size > of mapreduce splits. > If a region size is large than the target size, cut the region into two > split.If the sum of several small continuous region size less than the target > size, combine these regions into one split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Ye updated HBASE-12590: --- Description: 1, Motivation In production environment, data skew is a very common case. A HBase table always contains a lot of small regions and several large regions. Small regions waste a lot of computing resources. If we use a job to scan a table with 3000 small regions, we need a job with 3000 mappers. Large regions always block the job. If in a 100-region table, one region is far larger then the other 99 regions. When we run a job with the table as input, 99 mappers will be completed very quickly, and we need to wait for the last mapper for a long time. 2, Configuration Add two new configuration. hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in HBase-MapReduce jobs. The default value is false. hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size of mapreduce splits. If a region size is large than the target size, cut the region into two split.If the sum of several small continuous region size less than the target size, combine these regions into one split. Example: In attachment Welcome to the Review Board. https://reviews.apache.org/r/28494/diff/# was: 1, Motivation In production environment, data skew is a very common case. A HBase table always contains a lot of small regions and several large regions. Small regions waste a lot of computing resources. If we use a job to scan a table with 3000 small regions, we need a job with 3000 mappers. Large regions always block the job. If in a 100-region table, one region is far larger then the other 99 regions. When we run a job with the table as input, 99 mappers will be completed very quickly, and we need to wait for the last mapper for a long time. 2, Configuration Add two new configuration. hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in HBase-MapReduce jobs. The default value is false. hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size of mapreduce splits. If a region size is large than the target size, cut the region into two split.If the sum of several small continuous region size less than the target size, combine these regions into one split. Example: In attachment > A solution for data skew in HBase-Mapreduce Job > > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Affects Versions: 2.0.0 >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job.pdf, > HBase-12590-v1.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size > of mapreduce splits. > If a region size is large than the target size, cut the region into two > split.If the sum of several small continuous region size less than the target > size, combine these regions into one split. > Example: > In attachment > Welcome to the Review Board. > https://reviews.apache.org/r/28494/diff/# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Ye updated HBASE-12590: --- Status: Patch Available (was: Open) > A solution for data skew in HBase-Mapreduce Job > > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Affects Versions: 2.0.0 >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job.pdf, > HBase-12590-v1.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size > of mapreduce splits. > If a region size is large than the target size, cut the region into two > split.If the sum of several small continuous region size less than the target > size, combine these regions into one split. > Example: > In attachment -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Ye updated HBASE-12590: --- Attachment: HBase-12590-v1.patch Welcome to review the code. https://reviews.apache.org/r/28494/diff/# > A solution for data skew in HBase-Mapreduce Job > > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Affects Versions: 2.0.0 >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job.pdf, > HBase-12590-v1.patch > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size > of mapreduce splits. > If a region size is large than the target size, cut the region into two > split.If the sum of several small continuous region size less than the target > size, combine these regions into one split. > Example: > In attachment -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
[ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Ye updated HBASE-12590: --- Attachment: A Solution for Data Skew in HBase-MapReduce Job.pdf > A solution for data skew in HBase-Mapreduce Job > > > Key: HBASE-12590 > URL: https://issues.apache.org/jira/browse/HBASE-12590 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Affects Versions: 2.0.0 >Reporter: Weichen Ye > Attachments: A Solution for Data Skew in HBase-MapReduce Job.pdf > > > 1, Motivation > In production environment, data skew is a very common case. A HBase table > always contains a lot of small regions and several large regions. Small > regions waste a lot of computing resources. If we use a job to scan a table > with 3000 small regions, we need a job with 3000 mappers. Large regions > always block the job. If in a 100-region table, one region is far larger then > the other 99 regions. When we run a job with the table as input, 99 mappers > will be completed very quickly, and we need to wait for the last mapper for a > long time. > 2, Configuration > Add two new configuration. > hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in > HBase-MapReduce jobs. The default value is false. > hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size > of mapreduce splits. > If a region size is large than the target size, cut the region into two > split.If the sum of several small continuous region size less than the target > size, combine these regions into one split. > Example: > In attachment -- This message was sent by Atlassian JIRA (v6.3.4#6332)