[ https://issues.apache.org/jira/browse/HBASE-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16190743#comment-16190743 ]
Hudson commented on HBASE-16894: -------------------------------- FAILURE: Integrated in Jenkins build HBase-1.5 #84 (See [https://builds.apache.org/job/HBase-1.5/84/]) HBASE-16894 Create more than 1 split per region, generalize HBASE-12590 (apurtell: rev fc783ef04505eab7e58c6abc3ac1f7d7ecce465b) * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScan1.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScanBase.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/namespace/TestNamespaceAuditor.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java > Create more than 1 split per region, generalize HBASE-12590 > ----------------------------------------------------------- > > Key: HBASE-16894 > URL: https://issues.apache.org/jira/browse/HBASE-16894 > Project: HBase > Issue Type: Improvement > Affects Versions: 3.0.0, 2.0.0-alpha-2 > Reporter: Enis Soztutar > Assignee: Yi Liang > Fix For: 2.0.0, 3.0.0, 1.4.0, 1.5.0 > > Attachments: HBASE-16894.branch-1.patch, HBASE-16894.master.patch, > HBASE-16894-V2-master.patch, HBASE-16894-V3-master.patch, > ImplementaionAndSomeQuestion.docx > > > A common request from users is to be able to better control how many map > tasks are created per region. Right now, it is always 1 region = 1 input > split = 1 map task. Same goes for Spark since it uses the TIF. With region > sizes as large as 50 GBs, it is desirable to be able to create more than 1 > split per region. > HBASE-12590 adds a config property for MR jobs to be able to handle skew in > region sizes. The algorithm is roughly: > {code} > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size * ratio): combine > these regions into one MR input split. > {code} > Although we can set data skew ratio to be 0.5 or something to abuse > HBASE-12590 into creating more than 1 split task per region, it is not ideal. > But there is no way to create more with the patch as it is. For example we > cannot create more than 2 tasks per region. > If we want to fix this properly, we should extend the approach in > HBASE-12590, and make it so that the client can specify the desired num of > mappers, or desired split size, and the TIF generates the splits based on the > current region sizes very similar to the algorithm in HBASE-12590, but a more > generic way. This also would eliminate the hand tuning of data skew ratio. > We also can think about the guidepost approach that Phoenix has in the stats > table which is used for exactly this purpose. Right now, the region can be > split into powers of two assuming uniform distribution within the region. -- This message was sent by Atlassian JIRA (v6.4.14#64029)