[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13584174#comment-13584174 ] Hudson commented on MAPREDUCE-4951: --- Integrated in Hadoop-Yarn-trunk #135 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/135/]) MAPREDUCE-4951. Container preemption interpreted as task failure. Contributed by Sandy Ryza. (Revision 1448615) Result = SUCCESS tomwhite : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1448615 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 2.0.4-beta Attachments: MAPREDUCE-4951-1.patch, MAPREDUCE-4951-2.patch, MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13584238#comment-13584238 ] Hudson commented on MAPREDUCE-4951: --- Integrated in Hadoop-Hdfs-trunk #1324 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1324/]) MAPREDUCE-4951. Container preemption interpreted as task failure. Contributed by Sandy Ryza. (Revision 1448615) Result = FAILURE tomwhite : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1448615 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 2.0.4-beta Attachments: MAPREDUCE-4951-1.patch, MAPREDUCE-4951-2.patch, MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583138#comment-13583138 ] Hudson commented on MAPREDUCE-4951: --- Integrated in Hadoop-trunk-Commit #3372 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/3372/]) MAPREDUCE-4951. Container preemption interpreted as task failure. Contributed by Sandy Ryza. (Revision 1448615) Result = SUCCESS tomwhite : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1448615 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-4951-1.patch, MAPREDUCE-4951-2.patch, MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583177#comment-13583177 ] Hudson commented on MAPREDUCE-4951: --- Integrated in Hadoop-Mapreduce-trunk #1351 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1351/]) MAPREDUCE-4951. Container preemption interpreted as task failure. Contributed by Sandy Ryza. (Revision 1448615) Result = FAILURE tomwhite : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1448615 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 2.0.4-beta Attachments: MAPREDUCE-4951-1.patch, MAPREDUCE-4951-2.patch, MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562607#comment-13562607 ] Tom White commented on MAPREDUCE-4951: -- +1 on the latest patch. Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-4951-1.patch, MAPREDUCE-4951-2.patch, MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562142#comment-13562142 ] Jason Lowe commented on MAPREDUCE-4951: --- Agree that solving MAPREDUCE-4955 is separate, sorry for the extra noise. I just wanted to point out that even with this patch there will still be spurious failures if the task notifies the AM before the AM sees the container status from the RM. Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-4951-1.patch, MAPREDUCE-4951-2.patch, MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13560736#comment-13560736 ] Jason Lowe commented on MAPREDUCE-4951: --- bq. having the RM ask the AM to kill the container in case of preemption would likely not work as the AM cannot be trusted. Agreed, I was thinking of exactly the alternative you propose where preemption has potentially two phases, a please AM, preempt that container you have with a watchdog timer to have the RM kill it forcefully if the AM does not comply in a reasonable amount of time. This eliminates the race where the container can fail because of the preemption and provides a way for the AM to potentially checkpoint the state of the container for faster recovery. However it does mean the meantime latency for container availability would be higher since the AM will have a grace period before relinquishing the resources. Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-4951-1.patch, MAPREDUCE-4951-2.patch, MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13560903#comment-13560903 ] Bikas Saha commented on MAPREDUCE-4951: --- We might be digressing from this jira here. But I really dont think the 2-step approach is worth its complexity. The main scenario where it makes sense is when the task has an ability to checkpoint its work before getting preempted. I havent seen this capability outside of basic research prototypes. Its much simpler to have the preemption be an RM only action. We do need to fix the action and information loop so that AM's can get correct information about the infrastructure's actions. Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-4951-1.patch, MAPREDUCE-4951-2.patch, MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13561449#comment-13561449 ] Sandy Ryza commented on MAPREDUCE-4951: --- It doesn't seem to me that either approach would conflict with this patch at the moment. While this code might get rewritten in the future, under the current preemption mechanism, when MR is explicitly told that a container was preempted, it should not count it as failed. Does anybody disagree? Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-4951-1.patch, MAPREDUCE-4951-2.patch, MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13559593#comment-13559593 ] Tom White commented on MAPREDUCE-4951: -- The change looks good to me, but shouldn't the other exit codes be covered too or are they already being treated as task killed? The ones mentioned above plus -1000 (INVALID_CONTAINER_EXIT_STATUS), -101 (DISKS_FAILED). Also looks like you added testTaskPreemption without any code. Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-4951-1.patch, MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13559849#comment-13559849 ] Bikas Saha commented on MAPREDUCE-4951: --- From what I see in the code. Both killing due to exceeding memory and killing under command from RM (preemption) eventually end up sending a ContainerKillEvent that only differentiates between the two using a String diagnostic message. That event ends up causing a signal to be sent to the actual running container. Based on that, I am not very sure that the exit codes are being explicitly used by the NM to differentiate between RM killings or memory killings etc. Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-4951-1.patch, MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13559978#comment-13559978 ] Jason Lowe commented on MAPREDUCE-4951: --- Like the comment states in FairScheduler.preemptResources, I too am unsure if the preemption is translated into a kill command to the NM by the RM directly or if the scheduler is relying on the AM to see the finished container status from the RM and issue the kill to the AM. If it's the latter, then the container will be killed after the AM has already determined the container status correctly. If the RM really is cleaning up the container and turning that into a kill command for the NM, then we've got problems. The task itself could fail as the JVM tears down from a kill command and report that failure to the AM via the task umbilical *before* the AM discovers via the heartbeat to the RM that the container was preempted. A similar race occurs now when an NM kills a container for being over limits, see MAPREDUCE-4955. Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-4951-1.patch, MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13560031#comment-13560031 ] Bikas Saha commented on MAPREDUCE-4951: --- I think its RM killing the container via the NM. The RM kill command ends up sending a containers clean list in the NM heartbeat. NM kills containers in that list by sending a container_kill event to the container. Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-4951-1.patch, MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13560090#comment-13560090 ] Sandy Ryza commented on MAPREDUCE-4951: --- Tom, Regarding the other special exit codes, my opinion is that they don't merit the same treatment. In general, and if I understand correctly how things worked in MR1, failed tasks should be considered guilty until proven innocent, with innocent meaning killed explicitly by the RM, and guilty meaning anything else. Bikas, That's correct that a ContainerKillEvent is issued in both cases. However, if I understand correctly, when a container is explicitly killed by the RM, the special value of -100 is reported to the AM instead of any exit code reported by the NM. You can look for references to YarnConfiguration.ABORTED_CONTAINER_EXIT_STATUS to see when/how this works. Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-4951-1.patch, MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13560107#comment-13560107 ] Jason Lowe commented on MAPREDUCE-4951: --- bq. That's correct that a ContainerKillEvent is issued in both cases. However, if I understand correctly, when a container is explicitly killed by the RM, the special value of -100 is reported to the AM instead of any exit code reported by the NM. If the RM is indeed telling the NM to kill the container then we would have a race with tasks failing due to the kill-shutdown notifying the AM before the AM sees the container status from the RM. Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-4951-1.patch, MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13560111#comment-13560111 ] Sandy Ryza commented on MAPREDUCE-4951: --- bq. If the RM is indeed telling the NM to kill the container then we would have a race with tasks failing due to the kill-shutdown notifying the AM before the AM sees the container status from the RM. Oh I didn't realize that. Should I file a YARN JIRA for that? Or is it something that MR should be handling? Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-4951-1.patch, MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13560116#comment-13560116 ] Jason Lowe commented on MAPREDUCE-4951: --- Arguably it's yet another instance of the race already covered by MAPREDUCE-4955 as I mentioned above. Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-4951-1.patch, MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13560123#comment-13560123 ] Jason Lowe commented on MAPREDUCE-4951: --- Note that I'm not sure whether the fix belongs in YARN or left to the AM to sort out. YARN could implement preemption by asking the AM to kill it on the scheduler's behalf (so the AM definitely knows why the container is being killed since it's the one giving the final order to the NM), or the AM could work around the race by waiting for the final container status even though the task reported failure. There are some issues to work out wrt. failure modes, e.g. the AM loses connectivity to the NM, etc. Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-4951-1.patch, MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13560174#comment-13560174 ] Hitesh Shah commented on MAPREDUCE-4951: @Jason, having the RM ask the AM to kill the container in case of preemption would likely not work as the AM cannot be trusted. Obviously, there could be a different approach where the RM informs the AM that a particular container will be preempted soon but the RM eventually would need to trigger a kill for that container after a certain delay if it is still up. Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-4951-1.patch, MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13560265#comment-13560265 ] Sandy Ryza commented on MAPREDUCE-4951: --- Uploaded a patch that removes the vestigial testTaskPreemption. Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-4951-1.patch, MAPREDUCE-4951-2.patch, MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13560281#comment-13560281 ] Hadoop QA commented on MAPREDUCE-4951: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12566059/MAPREDUCE-4951-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3264//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3264//console This message is automatically generated. Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-4951-1.patch, MAPREDUCE-4951-2.patch, MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558987#comment-13558987 ] Bikas Saha commented on MAPREDUCE-4951: --- Will that differentiate between preemption killing and resource (eg out of memory) killing? Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13559023#comment-13559023 ] Sandy Ryza commented on MAPREDUCE-4951: --- I believe in that case the exit code will be FORCE_KILLED(137) or TERMINATED(143) (from ContainerExecutor.java). Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13559160#comment-13559160 ] Sandy Ryza commented on MAPREDUCE-4951: --- New patch includes test and uses constant from YarnConfiguration instead of hardcoded -100. Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-4951-1.patch, MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13559170#comment-13559170 ] Hadoop QA commented on MAPREDUCE-4951: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12565866/MAPREDUCE-4951-1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3260//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3260//console This message is automatically generated. Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-4951-1.patch, MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558506#comment-13558506 ] Hadoop QA commented on MAPREDUCE-4951: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12565727/MAPREDUCE-4951.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3256//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3256//console This message is automatically generated. Container preemption interpreted as task failure Key: MAPREDUCE-4951 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am, mrv2 Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-4951.patch When YARN reports a completed container to the MR AM, it always interprets it as a failure. This can lead to a job failing because too many of its tasks failed, when in fact they only failed because the scheduler preempted them. MR needs to recognize the special exit code value of -100 and interpret it as a container being killed instead of a container failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira