Anindya Sinha created MESOS-7087:
------------------------------------

             Summary: Consider improving exponential backoff algorithm.
                 Key: MESOS-7087
                 URL: https://issues.apache.org/jira/browse/MESOS-7087
             Project: Mesos
          Issue Type: Improvement
          Components: general
            Reporter: Anindya Sinha
            Assignee: Anindya Sinha


There are 3 types of backoff algorithms in use:
1) Exponential backoff with randomness, as in framework/agent registration.
2) Exponential backoff with no randomness, as in status updates.
3) Linear backoff with randomness, as in executor registration.

Consider framework registration. nth retry attempt is done after a random 
interval ranging between [0 .. backoff * 2^(n-1)] as long as each interval is 
less than 1 min. The default value for backoff is 2secs.

Although the current approach brings in exponential backoff with randomness, we 
have observed that for clusters with thousands of agents and/or frameworks, the 
actual retry interval (which is randomized) can end up being very frequent for 
a substantial number of agents and/or frameworks due to the fact that the 
allowed range is [0 .. <n>], which leads to bombarding the master with tons of 
messages thereby overloading it.

So, the main issues seen are (esp for large number of frameworks and/or agents) 
are:

1) Every subsequent retry should be spaced off by a minimum deterministic 
amount from the previous attempt.
2) Every subsequent retry should be greater or equal to the previous attempt.
3) Maximum retry interval should be configurable since it can be a function of 
the initial backoff factor.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to