Jagadish created SAMZA-835:
------------------------------

             Summary: Certain Errors in AM don't cause retry of failed AM 
containers
                 Key: SAMZA-835
                 URL: https://issues.apache.org/jira/browse/SAMZA-835
             Project: Samza
          Issue Type: Bug
            Reporter: Jagadish
            Assignee: Jagadish


Currently, a Samza Job could fail owing to numerous reasons. 
1. Successive container failures occuring within a certain time window, 
containers exceeding resource requests (like memory over-utilization)
2. AM failures like - AM not able to spawn a container because an NM was 
unreachable, Yarn exception when the AM try to execute a container on an NM, NM 
token expiration etc.

When there are type (2) failures, Yarn does not restart the AM. Most of these 
failures, can be solved by re-trying the AM attempt at a different host.

Reason: Currently, we explicitly unregister the AM from the RM when the AM 
shuts-down irrespective of the final app status. This causes Yarn to assume 
that the AM finished successfully (removing the AM from the RM state transition 
monitoring). 

When a job starts, the state is UNDEFINED. We manipulate the state to be 
SUCCESS or FAILURE depending on events we receive from the RM. 

When we end the job, (possibly because of (1) or (2)), The key is to *not* call 
unregister when the state is UNDEFINED. This will ensure that we will retry the 
AM attempt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to