[jira] [Updated] (MAPREDUCE-4890) Invalid TaskImpl state transitions when task fails while speculating
[ https://issues.apache.org/jira/browse/MAPREDUCE-4890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-4890: -- Resolution: Fixed Fix Version/s: 0.23.6 2.0.3-alpha Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Thanks for the review, Tom. I committed this to trunk, branch-2, and branch-0.23. Invalid TaskImpl state transitions when task fails while speculating Key: MAPREDUCE-4890 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4890 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 2.0.2-alpha, 0.23.5 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Fix For: 2.0.3-alpha, 0.23.6 Attachments: MAPREDUCE-4890.patch There are a couple of issues when a task fails while speculating (i.e.: multiple attempts are active): # The other active attempts are not killed. # TaskImpl's FAILED state does not handle the T_ATTEMPT_* set of events which can be sent from the other active attempts. These all need to be handled since they can be sent asynchronously from the other active task attempts. Failure to handle this properly means jobs that are configured to normally tolerate failures via mapreduce.map.failures.maxpercent or mapreduce.reduce.failures.maxpercent and also speculate can easily end up failing due to invalid state transitions rather than complete successfully with a few explicitly allowed task failures. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4890) Invalid TaskImpl state transitions when task fails while speculating
[ https://issues.apache.org/jira/browse/MAPREDUCE-4890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-4890: -- Attachment: MAPREDUCE-4890.patch Patch to kill active attempts when a task transitions to the FAILED state and also ingore all T_ATTEMPT_* events while in the FAILED state. Invalid TaskImpl state transitions when task fails while speculating Key: MAPREDUCE-4890 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4890 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 2.0.2-alpha, 0.23.5 Reporter: Jason Lowe Priority: Critical Attachments: MAPREDUCE-4890.patch There are a couple of issues when a task fails while speculating (i.e.: multiple attempts are active): # The other active attempts are not killed. # TaskImpl's FAILED state does not handle the T_ATTEMPT_* set of events which can be sent from the other active attempts. These all need to be handled since they can be sent asynchronously from the other active task attempts. Failure to handle this properly means jobs that are configured to normally tolerate failures via mapreduce.map.failures.maxpercent or mapreduce.reduce.failures.maxpercent and also speculate can easily end up failing due to invalid state transitions rather than complete successfully with a few explicitly allowed task failures. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4890) Invalid TaskImpl state transitions when task fails while speculating
[ https://issues.apache.org/jira/browse/MAPREDUCE-4890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-4890: -- Assignee: Jason Lowe Target Version/s: 2.0.3-alpha, 0.23.6 Status: Patch Available (was: Open) Invalid TaskImpl state transitions when task fails while speculating Key: MAPREDUCE-4890 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4890 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 0.23.5, 2.0.2-alpha Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: MAPREDUCE-4890.patch There are a couple of issues when a task fails while speculating (i.e.: multiple attempts are active): # The other active attempts are not killed. # TaskImpl's FAILED state does not handle the T_ATTEMPT_* set of events which can be sent from the other active attempts. These all need to be handled since they can be sent asynchronously from the other active task attempts. Failure to handle this properly means jobs that are configured to normally tolerate failures via mapreduce.map.failures.maxpercent or mapreduce.reduce.failures.maxpercent and also speculate can easily end up failing due to invalid state transitions rather than complete successfully with a few explicitly allowed task failures. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira