Anindya Sinha created MESOS-7087: ------------------------------------ Summary: Consider improving exponential backoff algorithm. Key: MESOS-7087 URL: https://issues.apache.org/jira/browse/MESOS-7087 Project: Mesos Issue Type: Improvement Components: general Reporter: Anindya Sinha Assignee: Anindya Sinha
There are 3 types of backoff algorithms in use: 1) Exponential backoff with randomness, as in framework/agent registration. 2) Exponential backoff with no randomness, as in status updates. 3) Linear backoff with randomness, as in executor registration. Consider framework registration. nth retry attempt is done after a random interval ranging between [0 .. backoff * 2^(n-1)] as long as each interval is less than 1 min. The default value for backoff is 2secs. Although the current approach brings in exponential backoff with randomness, we have observed that for clusters with thousands of agents and/or frameworks, the actual retry interval (which is randomized) can end up being very frequent for a substantial number of agents and/or frameworks due to the fact that the allowed range is [0 .. <n>], which leads to bombarding the master with tons of messages thereby overloading it. So, the main issues seen are (esp for large number of frameworks and/or agents) are: 1) Every subsequent retry should be spaced off by a minimum deterministic amount from the previous attempt. 2) Every subsequent retry should be greater or equal to the previous attempt. 3) Maximum retry interval should be configurable since it can be a function of the initial backoff factor. -- This message was sent by Atlassian JIRA (v6.3.15#6346)