[jira] [Updated] (MAPREDUCE-6263) Configurable timeout between YARNRunner terminate the application and forcefully kill.

2015-03-30 Thread Allen Wittenauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated MAPREDUCE-6263:

Fix Version/s: 2.7.0

 Configurable timeout between YARNRunner terminate the application and 
 forcefully kill.
 --

 Key: MAPREDUCE-6263
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6263
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: client
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Eric Payne
 Fix For: 2.7.0

 Attachments: MAPREDUCE-6263.v1.txt, MAPREDUCE-6263.v2.txt


 YARNRunner connects to the AM to send the kill job command then waits a 
 hardcoded 10 seconds for the job to enter a terminal state.  If the job fails 
 to enter a terminal state in that time then YARNRunner will tell YARN to kill 
 the application forcefully.  The latter type of kill usually results in no 
 job history, since the AM process is killed forcefully.
 Ten seconds can be too short for large jobs in a large cluster, as it takes 
 time to connect to all the nodemanagers, process the state machine events, 
 and copy a large jhist file.  The timeout should be more lenient or 
 configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6263) Configurable timeout between YARNRunner terminate the application and forcefully kill.

2015-03-10 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6263:
--
Summary: Configurable timeout between YARNRunner terminate the application 
and forcefully kill.  (was: Large jobs can lose history when killed due to 
brief client timeout)

 Configurable timeout between YARNRunner terminate the application and 
 forcefully kill.
 --

 Key: MAPREDUCE-6263
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6263
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: client
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Eric Payne
 Attachments: MAPREDUCE-6263.v1.txt, MAPREDUCE-6263.v2.txt


 YARNRunner connects to the AM to send the kill job command then waits a 
 hardcoded 10 seconds for the job to enter a terminal state.  If the job fails 
 to enter a terminal state in that time then YARNRunner will tell YARN to kill 
 the application forcefully.  The latter type of kill usually results in no 
 job history, since the AM process is killed forcefully.
 Ten seconds can be too short for large jobs in a large cluster, as it takes 
 time to connect to all the nodemanagers, process the state machine events, 
 and copy a large jhist file.  The timeout should be more lenient or 
 configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6263) Configurable timeout between YARNRunner terminate the application and forcefully kill.

2015-03-10 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6263:
--
  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

I have commit v2 patch to trunk, branch-2 and branch-2.7. Thanks [~eepayne] for 
the contribution!

 Configurable timeout between YARNRunner terminate the application and 
 forcefully kill.
 --

 Key: MAPREDUCE-6263
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6263
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: client
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Eric Payne
 Attachments: MAPREDUCE-6263.v1.txt, MAPREDUCE-6263.v2.txt


 YARNRunner connects to the AM to send the kill job command then waits a 
 hardcoded 10 seconds for the job to enter a terminal state.  If the job fails 
 to enter a terminal state in that time then YARNRunner will tell YARN to kill 
 the application forcefully.  The latter type of kill usually results in no 
 job history, since the AM process is killed forcefully.
 Ten seconds can be too short for large jobs in a large cluster, as it takes 
 time to connect to all the nodemanagers, process the state machine events, 
 and copy a large jhist file.  The timeout should be more lenient or 
 configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)