[jira] [Commented] (MAPREDUCE-4892) CombineFileInputFormat node input split can be skewed on small clusters
[ https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13589527#comment-13589527 ] Hudson commented on MAPREDUCE-4892: --- Integrated in Hadoop-Mapreduce-trunk #1358 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1358/]) MAPREDUCE-4892. Modify CombineFileInputFormat to not skew input slits' allocation on small clusters. Contributed by Bikas Saha. (Revision 1450912) Result = FAILURE vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1450912 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestCombineFileInputFormat.java > CombineFileInputFormat node input split can be skewed on small clusters > --- > > Key: MAPREDUCE-4892 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bikas Saha >Assignee: Bikas Saha > Fix For: 2.0.4-beta > > Attachments: MAPREDUCE-4892.1.alt.patch, MAPREDUCE-4892.1.alt.patch, > MAPREDUCE-4892.1.patch > > > The CombineFileInputFormat split generation logic tries to group blocks by > node in order to create splits. It iterates through the nodes and creates > splits on them until there aren't enough blocks left on a node that can be > grouped into a valid split. If the first few nodes have a lot of blocks on > them then they can end up getting a disproportionately large share of the > total number of splits created. This can result in poor locality of maps. > This problem is likely to happen on small clusters where its easier to create > a skew in the distribution of blocks on nodes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4892) CombineFileInputFormat node input split can be skewed on small clusters
[ https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13589503#comment-13589503 ] Hudson commented on MAPREDUCE-4892: --- Integrated in Hadoop-Hdfs-trunk #1330 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1330/]) MAPREDUCE-4892. Modify CombineFileInputFormat to not skew input slits' allocation on small clusters. Contributed by Bikas Saha. (Revision 1450912) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1450912 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestCombineFileInputFormat.java > CombineFileInputFormat node input split can be skewed on small clusters > --- > > Key: MAPREDUCE-4892 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bikas Saha >Assignee: Bikas Saha > Fix For: 2.0.4-beta > > Attachments: MAPREDUCE-4892.1.alt.patch, MAPREDUCE-4892.1.alt.patch, > MAPREDUCE-4892.1.patch > > > The CombineFileInputFormat split generation logic tries to group blocks by > node in order to create splits. It iterates through the nodes and creates > splits on them until there aren't enough blocks left on a node that can be > grouped into a valid split. If the first few nodes have a lot of blocks on > them then they can end up getting a disproportionately large share of the > total number of splits created. This can result in poor locality of maps. > This problem is likely to happen on small clusters where its easier to create > a skew in the distribution of blocks on nodes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4892) CombineFileInputFormat node input split can be skewed on small clusters
[ https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13589428#comment-13589428 ] Hudson commented on MAPREDUCE-4892: --- Integrated in Hadoop-Yarn-trunk #141 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/141/]) MAPREDUCE-4892. Modify CombineFileInputFormat to not skew input slits' allocation on small clusters. Contributed by Bikas Saha. (Revision 1450912) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1450912 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestCombineFileInputFormat.java > CombineFileInputFormat node input split can be skewed on small clusters > --- > > Key: MAPREDUCE-4892 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bikas Saha >Assignee: Bikas Saha > Fix For: 2.0.4-beta > > Attachments: MAPREDUCE-4892.1.alt.patch, MAPREDUCE-4892.1.alt.patch, > MAPREDUCE-4892.1.patch > > > The CombineFileInputFormat split generation logic tries to group blocks by > node in order to create splits. It iterates through the nodes and creates > splits on them until there aren't enough blocks left on a node that can be > grouped into a valid split. If the first few nodes have a lot of blocks on > them then they can end up getting a disproportionately large share of the > total number of splits created. This can result in poor locality of maps. > This problem is likely to happen on small clusters where its easier to create > a skew in the distribution of blocks on nodes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4892) CombineFileInputFormat node input split can be skewed on small clusters
[ https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13588631#comment-13588631 ] Hudson commented on MAPREDUCE-4892: --- Integrated in Hadoop-trunk-Commit #3391 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/3391/]) MAPREDUCE-4892. Modify CombineFileInputFormat to not skew input slits' allocation on small clusters. Contributed by Bikas Saha. (Revision 1450912) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1450912 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestCombineFileInputFormat.java > CombineFileInputFormat node input split can be skewed on small clusters > --- > > Key: MAPREDUCE-4892 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bikas Saha >Assignee: Bikas Saha > Fix For: 3.0.0 > > Attachments: MAPREDUCE-4892.1.alt.patch, MAPREDUCE-4892.1.alt.patch, > MAPREDUCE-4892.1.patch > > > The CombineFileInputFormat split generation logic tries to group blocks by > node in order to create splits. It iterates through the nodes and creates > splits on them until there aren't enough blocks left on a node that can be > grouped into a valid split. If the first few nodes have a lot of blocks on > them then they can end up getting a disproportionately large share of the > total number of splits created. This can result in poor locality of maps. > This problem is likely to happen on small clusters where its easier to create > a skew in the distribution of blocks on nodes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4892) CombineFileInputFormat node input split can be skewed on small clusters
[ https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13588579#comment-13588579 ] Hadoop QA commented on MAPREDUCE-4892: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12571219/MAPREDUCE-4892.1.alt.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 tests included appear to have a timeout.{color} {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3369//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3369//console This message is automatically generated. > CombineFileInputFormat node input split can be skewed on small clusters > --- > > Key: MAPREDUCE-4892 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bikas Saha >Assignee: Bikas Saha > Fix For: 3.0.0 > > Attachments: MAPREDUCE-4892.1.alt.patch, MAPREDUCE-4892.1.alt.patch, > MAPREDUCE-4892.1.patch > > > The CombineFileInputFormat split generation logic tries to group blocks by > node in order to create splits. It iterates through the nodes and creates > splits on them until there aren't enough blocks left on a node that can be > grouped into a valid split. If the first few nodes have a lot of blocks on > them then they can end up getting a disproportionately large share of the > total number of splits created. This can result in poor locality of maps. > This problem is likely to happen on small clusters where its easier to create > a skew in the distribution of blocks on nodes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4892) CombineFileInputFormat node input split can be skewed on small clusters
[ https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13566218#comment-13566218 ] Hadoop QA commented on MAPREDUCE-4892: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12567108/MAPREDUCE-4892.1.alt.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3293//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3293//console This message is automatically generated. > CombineFileInputFormat node input split can be skewed on small clusters > --- > > Key: MAPREDUCE-4892 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bikas Saha >Assignee: Bikas Saha > Fix For: 3.0.0 > > Attachments: MAPREDUCE-4892.1.alt.patch, MAPREDUCE-4892.1.patch > > > The CombineFileInputFormat split generation logic tries to group blocks by > node in order to create splits. It iterates through the nodes and creates > splits on them until there aren't enough blocks left on a node that can be > grouped into a valid split. If the first few nodes have a lot of blocks on > them then they can end up getting a disproportionately large share of the > total number of splits created. This can result in poor locality of maps. > This problem is likely to happen on small clusters where its easier to create > a skew in the distribution of blocks on nodes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4892) CombineFileInputFormat node input split can be skewed on small clusters
[ https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13566182#comment-13566182 ] Bikas Saha commented on MAPREDUCE-4892: --- I meant unnecessary computations for nodes that are not going to produce more splits in every iteration. The code will try to group any remaining blocks and fail to produce a meaningful split. And it will do this again and again. In the example, we could end up generating 5, 1, 3 or 4, 1, 4 splits by trying to be completely node local. Or we could go 3, 1, 3 and 2 rack-local. Upon scheduling the rack-local option would start 3, 3, 3 if there are 3 slots per node. The extra node local ones would have to wait for nodes 1 and 3 to finish the first wave before they can run. Or wait for delay scheduling to kick in and get scheduled on node 2. Which of these will be faster is not obvious. In any case, we can debate this endlessly. I have attached an alternative patch which iterates multiple times over nodes until it all blocks have been grouped or no node if left with more than minsize of blocks, while trying to keep number of iterations low. > CombineFileInputFormat node input split can be skewed on small clusters > --- > > Key: MAPREDUCE-4892 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bikas Saha >Assignee: Bikas Saha > Fix For: 3.0.0 > > Attachments: MAPREDUCE-4892.1.alt.patch, MAPREDUCE-4892.1.patch > > > The CombineFileInputFormat split generation logic tries to group blocks by > node in order to create splits. It iterates through the nodes and creates > splits on them until there aren't enough blocks left on a node that can be > grouped into a valid split. If the first few nodes have a lot of blocks on > them then they can end up getting a disproportionately large share of the > total number of splits created. This can result in poor locality of maps. > This problem is likely to happen on small clusters where its easier to create > a skew in the distribution of blocks on nodes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4892) CombineFileInputFormat node input split can be skewed on small clusters
[ https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13565792#comment-13565792 ] Vinod Kumar Vavilapalli commented on MAPREDUCE-4892: In my example, the maxMapperSize is one split (the default). bq. In general and specially in skewed distributions we will be running the inner for loop (that tries to group blocks) multiple times with no useful outcome. I don't see how there will be iterations with no outcome. We will always try to allocate to some node or break if we cannot allocate any. OTOH, we can improve on allocating one at a time by instead trying to allocate as much as possible but not more than the minimum across all nodes. bq. Secondly, for large jobs with multi-wave mappers, effects of non-local reads or delay scheduling might be small compared to the overall cost (including multiple runs). So this change mainly boils down to effects on single mapper wave jobs or small jobs. That's not true, locality affects large jobs too. Unless your definition of large means read less and process for a long time. bq. In these cases, it is arguable that having a few rack local mappers (which significantly increases the on time start of the mapper) might be beneficial than having the mapper wait on a machine because it was node local to it. Likeliness of waiting is high since the machine has many mappers already assigned to it. That's a general argument against node locality itself, one that is take care of by other mechanisms like locality-wait/delay. > CombineFileInputFormat node input split can be skewed on small clusters > --- > > Key: MAPREDUCE-4892 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bikas Saha >Assignee: Bikas Saha > Fix For: 3.0.0 > > Attachments: MAPREDUCE-4892.1.patch > > > The CombineFileInputFormat split generation logic tries to group blocks by > node in order to create splits. It iterates through the nodes and creates > splits on them until there aren't enough blocks left on a node that can be > grouped into a valid split. If the first few nodes have a lot of blocks on > them then they can end up getting a disproportionately large share of the > total number of splits created. This can result in poor locality of maps. > This problem is likely to happen on small clusters where its easier to create > a skew in the distribution of blocks on nodes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4892) CombineFileInputFormat node input split can be skewed on small clusters
[ https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13551474#comment-13551474 ] Bikas Saha commented on MAPREDUCE-4892: --- I did consider the alternate implementation. Arguments against it. In general and specially in skewed distributions we will be running the inner for loop (that tries to group blocks) multiple times with no useful outcome. So for say 10K machines that might end up slowing down the grouping calculation. Secondly, for large jobs with multi-wave mappers, effects of non-local reads or delay scheduling might be small compared to the overall cost (including multiple runs). So this change mainly boils down to effects on single mapper wave jobs or small jobs. In these cases, it is arguable that having a few rack local mappers (which significantly increases the on time start of the mapper) might be beneficial than having the mapper wait on a machine because it was node local to it. Likeliness of waiting is high since the machine has many mappers already assigned to it. That being said, I did not fully get your example. What is the maxMapperSize? If its 3 blocks then how can there be 9 splits created? If its not 3 blocks then why would avg per node be = 3? > CombineFileInputFormat node input split can be skewed on small clusters > --- > > Key: MAPREDUCE-4892 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bikas Saha >Assignee: Bikas Saha > Fix For: 3.0.0 > > Attachments: MAPREDUCE-4892.1.patch > > > The CombineFileInputFormat split generation logic tries to group blocks by > node in order to create splits. It iterates through the nodes and creates > splits on them until there aren't enough blocks left on a node that can be > grouped into a valid split. If the first few nodes have a lot of blocks on > them then they can end up getting a disproportionately large share of the > total number of splits created. This can result in poor locality of maps. > This problem is likely to happen on small clusters where its easier to create > a skew in the distribution of blocks on nodes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4892) CombineFileInputFormat node input split can be skewed on small clusters
[ https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543377#comment-13543377 ] Chris Nauroth commented on MAPREDUCE-4892: -- +1 for the patch I applied the patch locally and ran {{TestCombineFileInputFormat}}. Code looks good. I can't think of any other edge cases that this patch doesn't handle. > CombineFileInputFormat node input split can be skewed on small clusters > --- > > Key: MAPREDUCE-4892 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bikas Saha >Assignee: Bikas Saha > Fix For: 3.0.0 > > Attachments: MAPREDUCE-4892.1.patch > > > The CombineFileInputFormat split generation logic tries to group blocks by > node in order to create splits. It iterates through the nodes and creates > splits on them until there aren't enough blocks left on a node that can be > grouped into a valid split. If the first few nodes have a lot of blocks on > them then they can end up getting a disproportionately large share of the > total number of splits created. This can result in poor locality of maps. > This problem is likely to happen on small clusters where its easier to create > a skew in the distribution of blocks on nodes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4892) CombineFileInputFormat node input split can be skewed on small clusters
[ https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537523#comment-13537523 ] Hadoop QA commented on MAPREDUCE-4892: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562000/MAPREDUCE-4892.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3152//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3152//console This message is automatically generated. > CombineFileInputFormat node input split can be skewed on small clusters > --- > > Key: MAPREDUCE-4892 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bikas Saha >Assignee: Bikas Saha > Fix For: 3.0.0 > > Attachments: MAPREDUCE-4892.1.patch > > > The CombineFileInputFormat split generation logic tries to group blocks by > node in order to create splits. It iterates through the nodes and creates > splits on them until there aren't enough blocks left on a node that can be > grouped into a valid split. If the first few nodes have a lot of blocks on > them then they can end up getting a disproportionately large share of the > total number of splits created. This can result in poor locality of maps. > This problem is likely to happen on small clusters where its easier to create > a skew in the distribution of blocks on nodes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira