Hitesh Sharma created TEZ-4011:
----------------------------------

             Summary: Don't consider task attempt failed if container fails to 
launch and thus times out
                 Key: TEZ-4011
                 URL: https://issues.apache.org/jira/browse/TEZ-4011
             Project: Apache Tez
          Issue Type: Improvement
            Reporter: Hitesh Sharma


If a container fails to start (never heartbeats back to the AM during launch) 
then the container is considered timed out and the task attempt assigned to the 
container is failed. This is counted towards the failure count for the task. In 
some environments this may not be desirable (due to high probability of these 
events) as the task itself never got the chance to run, but since it counts 
towards the max task attempts it could lead to a failure. If we configure the 
timeout value for container heartbeat to be bigger then we have job slowness as 
the job just waits for the container to launch. An alternative here is to 
instead kill the task attempt if the container times out during launch. This is 
because the killed containers are not counted towards task attempt failures and 
allows one to have a bit more aggressive launch timeout. This behavior would be 
off by default and users could opt into it if it makes more sense to do so in 
their environments.

 

Thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to