[ https://issues.apache.org/jira/browse/MESOS-7688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ilya Pronin updated MESOS-7688: ------------------------------- Attachment: reregistration.svg reregistration.perf.gz Attached a perf script ([^reregistration.perf.gz]) and a flamegraph ([^reregistration.svg]) for a 2 minute sample of agents reregistration after master failover. > Improve master failover performance by reducing unnecessary agent retries. > -------------------------------------------------------------------------- > > Key: MESOS-7688 > URL: https://issues.apache.org/jira/browse/MESOS-7688 > Project: Mesos > Issue Type: Improvement > Components: agent, master > Reporter: Benjamin Mahler > Labels: scalability > Attachments: 1.2.0.png, reregistration.perf.gz, reregistration.svg > > > Currently, during a failover the agents will (re-)register with the master. > While the master is recovering, the master may drop messages from the agents, > and so the agents must retry registration using a backoff mechanism. For > large clusters, there can be a lot of overhead in processing unnecessary > retries from the agents, given that these messages must be deserialized and > contain all of the task / executor information many times over. > In order to reduce this overhead, the idea is to avoid the need for agents to > blindly retry (re-)registration with the master. Two approaches for this are: > (1) Update the MasterInfo in ZK when the master is recovered. This is a bit > of an abuse of MasterInfo unfortunately, but the idea is for agents to only > (re-)register when they see that the master reaches a recovered state. Once > recovered, the master will not drop messages, and therefore agents only need > to retry when the connection breaks. > (2) Have the master reply with a retry message when it's in the recovering > state, so that agents get a clear signal that their messages were dropped. > The agents only retry when the connection breaks or they get a retry message. > This one is less optimal, because the master may have to process a lot of > messages and send retries, but once the master is recovered, the master will > process only a single (re-)registration from each agent. The number of > (re-)registrations that occur while the master is recovering can be reduced > to 1 in this approach if the master sends the retry message only after the > master completes recovery. -- This message was sent by Atlassian JIRA (v6.4.14#64029)