[jira] [Updated] (MAPREDUCE-4992) AM hung in RecoveryService

2013-02-12 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-4992:
--

Target Version/s: trunk, 0.23.7, 2.0.4-beta  (was: trunk)
  Status: Patch Available  (was: Open)

> AM hung in RecoveryService
> --
>
> Key: MAPREDUCE-4992
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4992
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mr-am
>Affects Versions: 2.0.2-alpha, trunk, 0.23.6
>Reporter: Robert Parker
>Assignee: Robert Parker
>Priority: Critical
> Attachments: MAPREDUCE-4992v1.patch, MAPREDUCE-4992v2.patch
>
>
> A job hung in the Recovery Service on an AM restart. There were four map 
> tasks events that were not processed and that prevented the complete task 
> count from reaching zero which exits the recovery service. All four tasks 
> were speculative

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4992) AM hung in RecoveryService

2013-02-11 Thread Robert Parker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Parker updated MAPREDUCE-4992:
-

Attachment: MAPREDUCE-4992v2.patch

Thanks for the review Jason.  I incorporated the iterator to remove the 
entries, which is a much better approach.

> AM hung in RecoveryService
> --
>
> Key: MAPREDUCE-4992
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4992
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mr-am
>Affects Versions: trunk, 2.0.2-alpha, 0.23.6
>Reporter: Robert Parker
>Assignee: Robert Parker
>Priority: Critical
> Attachments: MAPREDUCE-4992v1.patch, MAPREDUCE-4992v2.patch
>
>
> A job hung in the Recovery Service on an AM restart. There were four map 
> tasks events that were not processed and that prevented the complete task 
> count from reaching zero which exits the recovery service. All four tasks 
> were speculative

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4992) AM hung in RecoveryService

2013-02-11 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-4992:
--

Status: Open  (was: Patch Available)

This approach looks OK for the short-term.  I'm not thrilled about the idea of 
explicitly losing information on task attempts that happened to not complete, 
as it will be odd in the history of the recovered AM to see a map task with a 
single attempt that ends in _1 or _2 instead of _0.  If this goes in we should 
file a follow-up JIRA to fix recovery so attempts that were "in-flight" when 
the AM crashed are at least documented in some way on the subsequent AM (e.g.: 
we mark them as KILLED or something, but at least the user can see what nodes 
they ran on and what time they were launched).

There is one thing I'd like to see fixed in the patch.  When we're iterating 
the taskAttempts in the {{taskInfo}} and filtering out attempts that didn't 
complete, we should walk and remove entries using an iterator rather than 
reaching around and calling {{remove}} on the map.

> AM hung in RecoveryService
> --
>
> Key: MAPREDUCE-4992
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4992
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mr-am
>Affects Versions: 2.0.2-alpha, trunk, 0.23.6
>Reporter: Robert Parker
>Assignee: Robert Parker
>Priority: Critical
> Attachments: MAPREDUCE-4992v1.patch
>
>
> A job hung in the Recovery Service on an AM restart. There were four map 
> tasks events that were not processed and that prevented the complete task 
> count from reaching zero which exits the recovery service. All four tasks 
> were speculative

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4992) AM hung in RecoveryService

2013-02-11 Thread Robert Parker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Parker updated MAPREDUCE-4992:
-

Target Version/s: trunk
  Status: Patch Available  (was: Open)

Added code to identify all the completed task attempts in the JobHistory parser 
and then removed incomplete task attempts from the completed tasks identified 
in the recovery service. 

> AM hung in RecoveryService
> --
>
> Key: MAPREDUCE-4992
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4992
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mr-am
>Affects Versions: 2.0.2-alpha, trunk, 0.23.6
>Reporter: Robert Parker
>Assignee: Robert Parker
>Priority: Critical
> Attachments: MAPREDUCE-4992v1.patch
>
>
> A job hung in the Recovery Service on an AM restart. There were four map 
> tasks events that were not processed and that prevented the complete task 
> count from reaching zero which exits the recovery service. All four tasks 
> were speculative

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4992) AM hung in RecoveryService

2013-02-11 Thread Robert Parker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Parker updated MAPREDUCE-4992:
-

Attachment: MAPREDUCE-4992v1.patch

> AM hung in RecoveryService
> --
>
> Key: MAPREDUCE-4992
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4992
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mr-am
>Affects Versions: trunk, 2.0.2-alpha, 0.23.6
>Reporter: Robert Parker
>Assignee: Robert Parker
>Priority: Critical
> Attachments: MAPREDUCE-4992v1.patch
>
>
> A job hung in the Recovery Service on an AM restart. There were four map 
> tasks events that were not processed and that prevented the complete task 
> count from reaching zero which exits the recovery service. All four tasks 
> were speculative

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4992) AM hung in RecoveryService

2013-02-11 Thread Robert Parker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Parker updated MAPREDUCE-4992:
-

Affects Version/s: trunk
   2.0.2-alpha

> AM hung in RecoveryService
> --
>
> Key: MAPREDUCE-4992
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4992
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mr-am
>Affects Versions: trunk, 2.0.2-alpha, 0.23.6
>Reporter: Robert Parker
>Assignee: Robert Parker
>Priority: Critical
>
> A job hung in the Recovery Service on an AM restart. There were four map 
> tasks events that were not processed and that prevented the complete task 
> count from reaching zero which exits the recovery service. All four tasks 
> were speculative

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4992) AM hung in RecoveryService

2013-02-08 Thread Robert Parker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Parker updated MAPREDUCE-4992:
-

Assignee: Robert Parker

> AM hung in RecoveryService
> --
>
> Key: MAPREDUCE-4992
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4992
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mr-am
>Affects Versions: 0.23.6
>Reporter: Robert Parker
>Assignee: Robert Parker
>Priority: Critical
>
> A job hung in the Recovery Service on an AM restart. There were four map 
> tasks events that were not processed and that prevented the complete task 
> count from reaching zero which exits the recovery service. All four tasks 
> were speculative

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4992) AM hung in RecoveryService

2013-02-08 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-4992:
--

Priority: Critical  (was: Major)

> AM hung in RecoveryService
> --
>
> Key: MAPREDUCE-4992
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4992
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mr-am
>Affects Versions: 0.23.6
>Reporter: Robert Parker
>Priority: Critical
>
> A job hung in the Recovery Service on an AM restart. There were four map 
> tasks events that were not processed and that prevented the complete task 
> count from reaching zero which exits the recovery service. All four tasks 
> were speculative

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira