[ https://issues.apache.org/jira/browse/MAPREDUCE-5817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698271#comment-14698271 ]
Hudson commented on MAPREDUCE-5817: ----------------------------------- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2234 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2234/]) MAPREDUCE-5817. Mappers get rescheduled on node transition even after all reducers are completed. (Sangjin Lee via kasha) (kasha: rev 27d24f96ab8d17e839a1ef0d7076efc78d28724a) * hadoop-mapreduce-project/CHANGES.txt * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestJobImpl.java > Mappers get rescheduled on node transition even after all reducers are > completed > -------------------------------------------------------------------------------- > > Key: MAPREDUCE-5817 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5817 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster > Affects Versions: 2.3.0 > Reporter: Sangjin Lee > Assignee: Sangjin Lee > Fix For: 2.8.0 > > Attachments: MAPREDUCE-5817.001.patch, MAPREDUCE-5817.002.patch, > mapreduce-5817.patch > > > We're seeing a behavior where a job runs long after all reducers were already > finished. We found that the job was rescheduling and running a number of > mappers beyond the point of reducer completion. In one situation, the job ran > for some 9 more hours after all reducers completed! > This happens because whenever a node transition (to an unusable state) comes > into the app master, it just reschedules all mappers that already ran on the > node in all cases. > Therefore, if any node transition has a potential to extend the job period. > Once this window opens, another node transition can prolong it, and this can > happen indefinitely in theory. > If there is some instability in the pool (unhealthy, etc.) for a duration, > then any big job is severely vulnerable to this problem. > If all reducers have been completed, JobImpl.actOnUnusableNode() should not > reschedule mapper tasks. If all reducers are completed, the mapper outputs > are no longer needed, and there is no need to reschedule mapper tasks as they > would not be consumed anyway. -- This message was sent by Atlassian JIRA (v6.3.4#6332)