[jira] [Updated] (MAPREDUCE-6263) Configurable timeout between YARNRunner terminate the application and forcefully kill.
[ https://issues.apache.org/jira/browse/MAPREDUCE-6263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated MAPREDUCE-6263: Fix Version/s: 2.7.0 Configurable timeout between YARNRunner terminate the application and forcefully kill. -- Key: MAPREDUCE-6263 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6263 Project: Hadoop Map/Reduce Issue Type: Bug Components: client Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Eric Payne Fix For: 2.7.0 Attachments: MAPREDUCE-6263.v1.txt, MAPREDUCE-6263.v2.txt YARNRunner connects to the AM to send the kill job command then waits a hardcoded 10 seconds for the job to enter a terminal state. If the job fails to enter a terminal state in that time then YARNRunner will tell YARN to kill the application forcefully. The latter type of kill usually results in no job history, since the AM process is killed forcefully. Ten seconds can be too short for large jobs in a large cluster, as it takes time to connect to all the nodemanagers, process the state machine events, and copy a large jhist file. The timeout should be more lenient or configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6263) Configurable timeout between YARNRunner terminate the application and forcefully kill.
[ https://issues.apache.org/jira/browse/MAPREDUCE-6263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated MAPREDUCE-6263: -- Summary: Configurable timeout between YARNRunner terminate the application and forcefully kill. (was: Large jobs can lose history when killed due to brief client timeout) Configurable timeout between YARNRunner terminate the application and forcefully kill. -- Key: MAPREDUCE-6263 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6263 Project: Hadoop Map/Reduce Issue Type: Bug Components: client Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Eric Payne Attachments: MAPREDUCE-6263.v1.txt, MAPREDUCE-6263.v2.txt YARNRunner connects to the AM to send the kill job command then waits a hardcoded 10 seconds for the job to enter a terminal state. If the job fails to enter a terminal state in that time then YARNRunner will tell YARN to kill the application forcefully. The latter type of kill usually results in no job history, since the AM process is killed forcefully. Ten seconds can be too short for large jobs in a large cluster, as it takes time to connect to all the nodemanagers, process the state machine events, and copy a large jhist file. The timeout should be more lenient or configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6263) Configurable timeout between YARNRunner terminate the application and forcefully kill.
[ https://issues.apache.org/jira/browse/MAPREDUCE-6263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated MAPREDUCE-6263: -- Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) I have commit v2 patch to trunk, branch-2 and branch-2.7. Thanks [~eepayne] for the contribution! Configurable timeout between YARNRunner terminate the application and forcefully kill. -- Key: MAPREDUCE-6263 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6263 Project: Hadoop Map/Reduce Issue Type: Bug Components: client Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Eric Payne Attachments: MAPREDUCE-6263.v1.txt, MAPREDUCE-6263.v2.txt YARNRunner connects to the AM to send the kill job command then waits a hardcoded 10 seconds for the job to enter a terminal state. If the job fails to enter a terminal state in that time then YARNRunner will tell YARN to kill the application forcefully. The latter type of kill usually results in no job history, since the AM process is killed forcefully. Ten seconds can be too short for large jobs in a large cluster, as it takes time to connect to all the nodemanagers, process the state machine events, and copy a large jhist file. The timeout should be more lenient or configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)