[ https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13589428#comment-13589428 ]
Hudson commented on MAPREDUCE-4892: ----------------------------------- Integrated in Hadoop-Yarn-trunk #141 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/141/]) MAPREDUCE-4892. Modify CombineFileInputFormat to not skew input slits' allocation on small clusters. Contributed by Bikas Saha. (Revision 1450912) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1450912 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestCombineFileInputFormat.java > CombineFileInputFormat node input split can be skewed on small clusters > ----------------------------------------------------------------------- > > Key: MAPREDUCE-4892 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892 > Project: Hadoop Map/Reduce > Issue Type: Bug > Reporter: Bikas Saha > Assignee: Bikas Saha > Fix For: 2.0.4-beta > > Attachments: MAPREDUCE-4892.1.alt.patch, MAPREDUCE-4892.1.alt.patch, > MAPREDUCE-4892.1.patch > > > The CombineFileInputFormat split generation logic tries to group blocks by > node in order to create splits. It iterates through the nodes and creates > splits on them until there aren't enough blocks left on a node that can be > grouped into a valid split. If the first few nodes have a lot of blocks on > them then they can end up getting a disproportionately large share of the > total number of splits created. This can result in poor locality of maps. > This problem is likely to happen on small clusters where its easier to create > a skew in the distribution of blocks on nodes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira