Daniel Nishimura created SAMZA-2266:
---------------------------------------

             Summary: Introduce a backoff when there are repeated failures for 
host-affinity allocations
                 Key: SAMZA-2266
                 URL: https://issues.apache.org/jira/browse/SAMZA-2266
             Project: Samza
          Issue Type: Bug
            Reporter: Daniel Nishimura
            Assignee: Daniel Nishimura


The issue here is that we retry allocations of dead containers (and repeatedly 
on subsequent failures) in a very small window of time (<1min). 

It is observed that NMs take ~2mins to mark themselves as unhealthy to the RM.

If a job has host-affinity enabled, this will cause us to allocate containers 
on the same unhealthy host multiple times and eventually kill the application.

This ticket is to evaluate the feasibility and possibly implement a fix that 
involves introducing a time backoff on retries of container allocation on the 
same host - so we eventually get a different host when the unhealthy NM's 
status is updated.

We may also want to look into the possibility of abandoning host-affinity on 
the 8th attempt of restarting a container - so we don't kill the entire job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to