[jira] [Updated] (MAPREDUCE-4833) Task can get stuck in FAIL_CONTAINER_CLEANUP

2012-12-21 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-4833:
--

   Resolution: Fixed
Fix Version/s: 0.23.6
   2.0.3-alpha
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

Thanks, Robert.  I committed this to trunk, branch-2, and branch-0.23.

> Task can get stuck in FAIL_CONTAINER_CLEANUP
> 
>
> Key: MAPREDUCE-4833
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4833
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster, mrv2
>Affects Versions: 0.23.5
>Reporter: Robert Joseph Evans
>Assignee: Robert Parker
>Priority: Critical
> Fix For: 2.0.3-alpha, 0.23.6
>
> Attachments: MAPREDUCE4833-1.patch, MAPREDUCE4833-2.patch, 
> MAPREDUCE4833.patch
>
>
> If an NM goes down and the AM still tries to launch a container on it the 
> ContainerLauncherImpl can get stuck in an RPC timeout.  At the same time the 
> RM may notice that the NM has gone away and inform the AM of this, this 
> triggers a TA_FAILMSG.  If the TA_FAILMSG arrives at the TaskAttemptImpl 
> before the TA_CONTAINER_LAUNCH_FAILED message then the task attempt will try 
> to kill the container, but the ContainerLauncherImpl will not send back a 
> TA_CONTAINER_CLEANED event causing the attempt to be stuck.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4833) Task can get stuck in FAIL_CONTAINER_CLEANUP

2012-12-21 Thread Robert Parker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Parker updated MAPREDUCE-4833:
-

Attachment: MAPREDUCE4833-2.patch

Actually saving the file before creating the patch this time.

> Task can get stuck in FAIL_CONTAINER_CLEANUP
> 
>
> Key: MAPREDUCE-4833
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4833
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster, mrv2
>Affects Versions: 0.23.5
>Reporter: Robert Joseph Evans
>Assignee: Robert Parker
>Priority: Critical
> Attachments: MAPREDUCE4833-1.patch, MAPREDUCE4833-2.patch, 
> MAPREDUCE4833.patch
>
>
> If an NM goes down and the AM still tries to launch a container on it the 
> ContainerLauncherImpl can get stuck in an RPC timeout.  At the same time the 
> RM may notice that the NM has gone away and inform the AM of this, this 
> triggers a TA_FAILMSG.  If the TA_FAILMSG arrives at the TaskAttemptImpl 
> before the TA_CONTAINER_LAUNCH_FAILED message then the task attempt will try 
> to kill the container, but the ContainerLauncherImpl will not send back a 
> TA_CONTAINER_CLEANED event causing the attempt to be stuck.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4833) Task can get stuck in FAIL_CONTAINER_CLEANUP

2012-12-21 Thread Robert Parker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Parker updated MAPREDUCE-4833:
-

Attachment: (was: MAPREDUCE4833-23.patch)

> Task can get stuck in FAIL_CONTAINER_CLEANUP
> 
>
> Key: MAPREDUCE-4833
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4833
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster, mrv2
>Affects Versions: 0.23.5
>Reporter: Robert Joseph Evans
>Assignee: Robert Parker
>Priority: Critical
> Attachments: MAPREDUCE4833-1.patch, MAPREDUCE4833.patch
>
>
> If an NM goes down and the AM still tries to launch a container on it the 
> ContainerLauncherImpl can get stuck in an RPC timeout.  At the same time the 
> RM may notice that the NM has gone away and inform the AM of this, this 
> triggers a TA_FAILMSG.  If the TA_FAILMSG arrives at the TaskAttemptImpl 
> before the TA_CONTAINER_LAUNCH_FAILED message then the task attempt will try 
> to kill the container, but the ContainerLauncherImpl will not send back a 
> TA_CONTAINER_CLEANED event causing the attempt to be stuck.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4833) Task can get stuck in FAIL_CONTAINER_CLEANUP

2012-12-21 Thread Robert Parker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Parker updated MAPREDUCE-4833:
-

Attachment: MAPREDUCE4833-1.patch

fixed compiler warnings

> Task can get stuck in FAIL_CONTAINER_CLEANUP
> 
>
> Key: MAPREDUCE-4833
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4833
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster, mrv2
>Affects Versions: 0.23.5
>Reporter: Robert Joseph Evans
>Assignee: Robert Parker
>Priority: Critical
> Attachments: MAPREDUCE4833-1.patch, MAPREDUCE4833.patch
>
>
> If an NM goes down and the AM still tries to launch a container on it the 
> ContainerLauncherImpl can get stuck in an RPC timeout.  At the same time the 
> RM may notice that the NM has gone away and inform the AM of this, this 
> triggers a TA_FAILMSG.  If the TA_FAILMSG arrives at the TaskAttemptImpl 
> before the TA_CONTAINER_LAUNCH_FAILED message then the task attempt will try 
> to kill the container, but the ContainerLauncherImpl will not send back a 
> TA_CONTAINER_CLEANED event causing the attempt to be stuck.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4833) Task can get stuck in FAIL_CONTAINER_CLEANUP

2012-12-21 Thread Robert Parker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Parker updated MAPREDUCE-4833:
-

Attachment: MAPREDUCE4833.patch

Added a test case.  Test fails without the fix and passes with the fix

> Task can get stuck in FAIL_CONTAINER_CLEANUP
> 
>
> Key: MAPREDUCE-4833
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4833
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster, mrv2
>Affects Versions: 0.23.5
>Reporter: Robert Joseph Evans
>Assignee: Robert Parker
>Priority: Critical
> Attachments: MAPREDUCE4833-23.patch, MAPREDUCE4833.patch
>
>
> If an NM goes down and the AM still tries to launch a container on it the 
> ContainerLauncherImpl can get stuck in an RPC timeout.  At the same time the 
> RM may notice that the NM has gone away and inform the AM of this, this 
> triggers a TA_FAILMSG.  If the TA_FAILMSG arrives at the TaskAttemptImpl 
> before the TA_CONTAINER_LAUNCH_FAILED message then the task attempt will try 
> to kill the container, but the ContainerLauncherImpl will not send back a 
> TA_CONTAINER_CLEANED event causing the attempt to be stuck.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4833) Task can get stuck in FAIL_CONTAINER_CLEANUP

2012-12-18 Thread Robert Parker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Parker updated MAPREDUCE-4833:
-

Release Note:   (was: Previously the Container did not send an event on 
kill if it was DONE, and returned (essentially a no-op). This patch will send a 
TA_CONTAINER_CLEANED event in all cases.)

> Task can get stuck in FAIL_CONTAINER_CLEANUP
> 
>
> Key: MAPREDUCE-4833
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4833
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster, mrv2
>Affects Versions: 0.23.5
>Reporter: Robert Joseph Evans
>Assignee: Robert Parker
>Priority: Critical
> Attachments: MAPREDUCE4833-23.patch
>
>
> If an NM goes down and the AM still tries to launch a container on it the 
> ContainerLauncherImpl can get stuck in an RPC timeout.  At the same time the 
> RM may notice that the NM has gone away and inform the AM of this, this 
> triggers a TA_FAILMSG.  If the TA_FAILMSG arrives at the TaskAttemptImpl 
> before the TA_CONTAINER_LAUNCH_FAILED message then the task attempt will try 
> to kill the container, but the ContainerLauncherImpl will not send back a 
> TA_CONTAINER_CLEANED event causing the attempt to be stuck.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4833) Task can get stuck in FAIL_CONTAINER_CLEANUP

2012-12-18 Thread Robert Parker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Parker updated MAPREDUCE-4833:
-

Attachment: MAPREDUCE4833-23.patch

> Task can get stuck in FAIL_CONTAINER_CLEANUP
> 
>
> Key: MAPREDUCE-4833
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4833
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster, mrv2
>Affects Versions: 0.23.5
>Reporter: Robert Joseph Evans
>Assignee: Robert Parker
>Priority: Critical
> Attachments: MAPREDUCE4833-23.patch
>
>
> If an NM goes down and the AM still tries to launch a container on it the 
> ContainerLauncherImpl can get stuck in an RPC timeout.  At the same time the 
> RM may notice that the NM has gone away and inform the AM of this, this 
> triggers a TA_FAILMSG.  If the TA_FAILMSG arrives at the TaskAttemptImpl 
> before the TA_CONTAINER_LAUNCH_FAILED message then the task attempt will try 
> to kill the container, but the ContainerLauncherImpl will not send back a 
> TA_CONTAINER_CLEANED event causing the attempt to be stuck.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4833) Task can get stuck in FAIL_CONTAINER_CLEANUP

2012-12-18 Thread Robert Parker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Parker updated MAPREDUCE-4833:
-

Release Note: Previously the Container did not send an event on kill if it 
was DONE, and returned (essentially a no-op). This patch will send a 
TA_CONTAINER_CLEANED event in all cases.
  Status: Patch Available  (was: Open)

> Task can get stuck in FAIL_CONTAINER_CLEANUP
> 
>
> Key: MAPREDUCE-4833
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4833
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster, mrv2
>Affects Versions: 0.23.5
>Reporter: Robert Joseph Evans
>Assignee: Robert Parker
>Priority: Critical
>
> If an NM goes down and the AM still tries to launch a container on it the 
> ContainerLauncherImpl can get stuck in an RPC timeout.  At the same time the 
> RM may notice that the NM has gone away and inform the AM of this, this 
> triggers a TA_FAILMSG.  If the TA_FAILMSG arrives at the TaskAttemptImpl 
> before the TA_CONTAINER_LAUNCH_FAILED message then the task attempt will try 
> to kill the container, but the ContainerLauncherImpl will not send back a 
> TA_CONTAINER_CLEANED event causing the attempt to be stuck.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4833) Task can get stuck in FAIL_CONTAINER_CLEANUP

2012-12-12 Thread Robert Parker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Parker updated MAPREDUCE-4833:
-

Assignee: Robert Parker

> Task can get stuck in FAIL_CONTAINER_CLEANUP
> 
>
> Key: MAPREDUCE-4833
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4833
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster, mrv2
>Affects Versions: 0.23.5
>Reporter: Robert Joseph Evans
>Assignee: Robert Parker
>Priority: Critical
>
> If an NM goes down and the AM still tries to launch a container on it the 
> ContainerLauncherImpl can get stuck in an RPC timeout.  At the same time the 
> RM may notice that the NM has gone away and inform the AM of this, this 
> triggers a TA_FAILMSG.  If the TA_FAILMSG arrives at the TaskAttemptImpl 
> before the TA_CONTAINER_LAUNCH_FAILED message then the task attempt will try 
> to kill the container, but the ContainerLauncherImpl will not send back a 
> TA_CONTAINER_CLEANED event causing the attempt to be stuck.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira