[jira] [Updated] (MAPREDUCE-4833) Task can get stuck in FAIL_CONTAINER_CLEANUP
[ https://issues.apache.org/jira/browse/MAPREDUCE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-4833: -- Resolution: Fixed Fix Version/s: 0.23.6 2.0.3-alpha Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Thanks, Robert. I committed this to trunk, branch-2, and branch-0.23. > Task can get stuck in FAIL_CONTAINER_CLEANUP > > > Key: MAPREDUCE-4833 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4833 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 >Affects Versions: 0.23.5 >Reporter: Robert Joseph Evans >Assignee: Robert Parker >Priority: Critical > Fix For: 2.0.3-alpha, 0.23.6 > > Attachments: MAPREDUCE4833-1.patch, MAPREDUCE4833-2.patch, > MAPREDUCE4833.patch > > > If an NM goes down and the AM still tries to launch a container on it the > ContainerLauncherImpl can get stuck in an RPC timeout. At the same time the > RM may notice that the NM has gone away and inform the AM of this, this > triggers a TA_FAILMSG. If the TA_FAILMSG arrives at the TaskAttemptImpl > before the TA_CONTAINER_LAUNCH_FAILED message then the task attempt will try > to kill the container, but the ContainerLauncherImpl will not send back a > TA_CONTAINER_CLEANED event causing the attempt to be stuck. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4833) Task can get stuck in FAIL_CONTAINER_CLEANUP
[ https://issues.apache.org/jira/browse/MAPREDUCE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Parker updated MAPREDUCE-4833: - Attachment: MAPREDUCE4833-2.patch Actually saving the file before creating the patch this time. > Task can get stuck in FAIL_CONTAINER_CLEANUP > > > Key: MAPREDUCE-4833 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4833 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 >Affects Versions: 0.23.5 >Reporter: Robert Joseph Evans >Assignee: Robert Parker >Priority: Critical > Attachments: MAPREDUCE4833-1.patch, MAPREDUCE4833-2.patch, > MAPREDUCE4833.patch > > > If an NM goes down and the AM still tries to launch a container on it the > ContainerLauncherImpl can get stuck in an RPC timeout. At the same time the > RM may notice that the NM has gone away and inform the AM of this, this > triggers a TA_FAILMSG. If the TA_FAILMSG arrives at the TaskAttemptImpl > before the TA_CONTAINER_LAUNCH_FAILED message then the task attempt will try > to kill the container, but the ContainerLauncherImpl will not send back a > TA_CONTAINER_CLEANED event causing the attempt to be stuck. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4833) Task can get stuck in FAIL_CONTAINER_CLEANUP
[ https://issues.apache.org/jira/browse/MAPREDUCE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Parker updated MAPREDUCE-4833: - Attachment: (was: MAPREDUCE4833-23.patch) > Task can get stuck in FAIL_CONTAINER_CLEANUP > > > Key: MAPREDUCE-4833 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4833 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 >Affects Versions: 0.23.5 >Reporter: Robert Joseph Evans >Assignee: Robert Parker >Priority: Critical > Attachments: MAPREDUCE4833-1.patch, MAPREDUCE4833.patch > > > If an NM goes down and the AM still tries to launch a container on it the > ContainerLauncherImpl can get stuck in an RPC timeout. At the same time the > RM may notice that the NM has gone away and inform the AM of this, this > triggers a TA_FAILMSG. If the TA_FAILMSG arrives at the TaskAttemptImpl > before the TA_CONTAINER_LAUNCH_FAILED message then the task attempt will try > to kill the container, but the ContainerLauncherImpl will not send back a > TA_CONTAINER_CLEANED event causing the attempt to be stuck. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4833) Task can get stuck in FAIL_CONTAINER_CLEANUP
[ https://issues.apache.org/jira/browse/MAPREDUCE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Parker updated MAPREDUCE-4833: - Attachment: MAPREDUCE4833-1.patch fixed compiler warnings > Task can get stuck in FAIL_CONTAINER_CLEANUP > > > Key: MAPREDUCE-4833 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4833 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 >Affects Versions: 0.23.5 >Reporter: Robert Joseph Evans >Assignee: Robert Parker >Priority: Critical > Attachments: MAPREDUCE4833-1.patch, MAPREDUCE4833.patch > > > If an NM goes down and the AM still tries to launch a container on it the > ContainerLauncherImpl can get stuck in an RPC timeout. At the same time the > RM may notice that the NM has gone away and inform the AM of this, this > triggers a TA_FAILMSG. If the TA_FAILMSG arrives at the TaskAttemptImpl > before the TA_CONTAINER_LAUNCH_FAILED message then the task attempt will try > to kill the container, but the ContainerLauncherImpl will not send back a > TA_CONTAINER_CLEANED event causing the attempt to be stuck. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4833) Task can get stuck in FAIL_CONTAINER_CLEANUP
[ https://issues.apache.org/jira/browse/MAPREDUCE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Parker updated MAPREDUCE-4833: - Attachment: MAPREDUCE4833.patch Added a test case. Test fails without the fix and passes with the fix > Task can get stuck in FAIL_CONTAINER_CLEANUP > > > Key: MAPREDUCE-4833 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4833 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 >Affects Versions: 0.23.5 >Reporter: Robert Joseph Evans >Assignee: Robert Parker >Priority: Critical > Attachments: MAPREDUCE4833-23.patch, MAPREDUCE4833.patch > > > If an NM goes down and the AM still tries to launch a container on it the > ContainerLauncherImpl can get stuck in an RPC timeout. At the same time the > RM may notice that the NM has gone away and inform the AM of this, this > triggers a TA_FAILMSG. If the TA_FAILMSG arrives at the TaskAttemptImpl > before the TA_CONTAINER_LAUNCH_FAILED message then the task attempt will try > to kill the container, but the ContainerLauncherImpl will not send back a > TA_CONTAINER_CLEANED event causing the attempt to be stuck. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4833) Task can get stuck in FAIL_CONTAINER_CLEANUP
[ https://issues.apache.org/jira/browse/MAPREDUCE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Parker updated MAPREDUCE-4833: - Release Note: (was: Previously the Container did not send an event on kill if it was DONE, and returned (essentially a no-op). This patch will send a TA_CONTAINER_CLEANED event in all cases.) > Task can get stuck in FAIL_CONTAINER_CLEANUP > > > Key: MAPREDUCE-4833 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4833 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 >Affects Versions: 0.23.5 >Reporter: Robert Joseph Evans >Assignee: Robert Parker >Priority: Critical > Attachments: MAPREDUCE4833-23.patch > > > If an NM goes down and the AM still tries to launch a container on it the > ContainerLauncherImpl can get stuck in an RPC timeout. At the same time the > RM may notice that the NM has gone away and inform the AM of this, this > triggers a TA_FAILMSG. If the TA_FAILMSG arrives at the TaskAttemptImpl > before the TA_CONTAINER_LAUNCH_FAILED message then the task attempt will try > to kill the container, but the ContainerLauncherImpl will not send back a > TA_CONTAINER_CLEANED event causing the attempt to be stuck. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4833) Task can get stuck in FAIL_CONTAINER_CLEANUP
[ https://issues.apache.org/jira/browse/MAPREDUCE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Parker updated MAPREDUCE-4833: - Attachment: MAPREDUCE4833-23.patch > Task can get stuck in FAIL_CONTAINER_CLEANUP > > > Key: MAPREDUCE-4833 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4833 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 >Affects Versions: 0.23.5 >Reporter: Robert Joseph Evans >Assignee: Robert Parker >Priority: Critical > Attachments: MAPREDUCE4833-23.patch > > > If an NM goes down and the AM still tries to launch a container on it the > ContainerLauncherImpl can get stuck in an RPC timeout. At the same time the > RM may notice that the NM has gone away and inform the AM of this, this > triggers a TA_FAILMSG. If the TA_FAILMSG arrives at the TaskAttemptImpl > before the TA_CONTAINER_LAUNCH_FAILED message then the task attempt will try > to kill the container, but the ContainerLauncherImpl will not send back a > TA_CONTAINER_CLEANED event causing the attempt to be stuck. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4833) Task can get stuck in FAIL_CONTAINER_CLEANUP
[ https://issues.apache.org/jira/browse/MAPREDUCE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Parker updated MAPREDUCE-4833: - Release Note: Previously the Container did not send an event on kill if it was DONE, and returned (essentially a no-op). This patch will send a TA_CONTAINER_CLEANED event in all cases. Status: Patch Available (was: Open) > Task can get stuck in FAIL_CONTAINER_CLEANUP > > > Key: MAPREDUCE-4833 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4833 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 >Affects Versions: 0.23.5 >Reporter: Robert Joseph Evans >Assignee: Robert Parker >Priority: Critical > > If an NM goes down and the AM still tries to launch a container on it the > ContainerLauncherImpl can get stuck in an RPC timeout. At the same time the > RM may notice that the NM has gone away and inform the AM of this, this > triggers a TA_FAILMSG. If the TA_FAILMSG arrives at the TaskAttemptImpl > before the TA_CONTAINER_LAUNCH_FAILED message then the task attempt will try > to kill the container, but the ContainerLauncherImpl will not send back a > TA_CONTAINER_CLEANED event causing the attempt to be stuck. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4833) Task can get stuck in FAIL_CONTAINER_CLEANUP
[ https://issues.apache.org/jira/browse/MAPREDUCE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Parker updated MAPREDUCE-4833: - Assignee: Robert Parker > Task can get stuck in FAIL_CONTAINER_CLEANUP > > > Key: MAPREDUCE-4833 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4833 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 >Affects Versions: 0.23.5 >Reporter: Robert Joseph Evans >Assignee: Robert Parker >Priority: Critical > > If an NM goes down and the AM still tries to launch a container on it the > ContainerLauncherImpl can get stuck in an RPC timeout. At the same time the > RM may notice that the NM has gone away and inform the AM of this, this > triggers a TA_FAILMSG. If the TA_FAILMSG arrives at the TaskAttemptImpl > before the TA_CONTAINER_LAUNCH_FAILED message then the task attempt will try > to kill the container, but the ContainerLauncherImpl will not send back a > TA_CONTAINER_CLEANED event causing the attempt to be stuck. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira