[ 
https://issues.apache.org/jira/browse/MESOS-7688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7688:
-----------------------------------
    Description: 
Currently, during a failover the agents will (re-)register with the master. 
While the master is recovering, the master may drop messages from the agents, 
and so the agents must retry registration using a backoff mechanism. For large 
clusters, there can be a lot of overhead in processing unnecessary retries from 
the agents, given that these messages must be deserialized and contain all of 
the task / executor information many times over.

In order to reduce this overhead, the idea is to avoid the need for agents to 
blindly retry (re-)registration with the master. Two approaches for this are:

(1) Update the MasterInfo in ZK when the master is recovered. This is a bit of 
an abuse of MasterInfo unfortunately, but the idea is for agents to only 
(re-)register when they see that the master reaches a recovered state. Once 
recovered, the master will not drop messages, and therefore agents only need to 
retry when the connection breaks.

(2) Have the master reply with a retry message when it's in the recovering 
state, so that agents get a clear signal that their messages were dropped. The 
agents only retry when the connection breaks or they get a retry message. This 
one is less optimal, because the master may have to process a lot of messages 
and send retries, but once the master is recovered, the master will process 
only a single (re-)registration from each agent. The number of 
(re-)registrations that occur while the master is recovering can be reduced to 
1 in this approach if the master sends the retry message only after the master 
completes recovery.

  was:
Currently, during a failover the agents will (re-)register with the master. 
While the master is recovering, the master may drop messages from the agents, 
and so the agents must retry registration using a backoff mechanism. For large 
clusters, there can be a lot of overhead in processing unnecessary retries from 
the agents, given that these messages must be deserialized and contain all of 
the task / executor information many times over.

In order to reduce this overhead, the idea is to avoid the need for agents to 
blindly retry (re-)registration with the master. Two approaches for this are:

(1) Update the MasterInfo in ZK when the master is recovered. This is a bit of 
an abuse of MasterInfo unfortunately, but the idea is for agents to only 
(re-)register when they see that the master reaches a recovered state. Once 
recovered, the master will not drop messages, and therefore agents only need to 
retry when the connection breaks.

(2) Have the master reply with a retry message when it's in the recovering 
state, so that agents get a clear signal that their messages were dropped. This 
one is less optimal, because the master may have to process a lot of messages 
and send retries, but once the master is recovered, the master will process 
only a single (re-)registration from each agent. Here, agents only retry when 
the connection breaks or they get a retry message.


> Improve master failover performance by reducing unnecessary agent retries.
> --------------------------------------------------------------------------
>
>                 Key: MESOS-7688
>                 URL: https://issues.apache.org/jira/browse/MESOS-7688
>             Project: Mesos
>          Issue Type: Improvement
>          Components: agent, master
>            Reporter: Benjamin Mahler
>              Labels: scalability
>
> Currently, during a failover the agents will (re-)register with the master. 
> While the master is recovering, the master may drop messages from the agents, 
> and so the agents must retry registration using a backoff mechanism. For 
> large clusters, there can be a lot of overhead in processing unnecessary 
> retries from the agents, given that these messages must be deserialized and 
> contain all of the task / executor information many times over.
> In order to reduce this overhead, the idea is to avoid the need for agents to 
> blindly retry (re-)registration with the master. Two approaches for this are:
> (1) Update the MasterInfo in ZK when the master is recovered. This is a bit 
> of an abuse of MasterInfo unfortunately, but the idea is for agents to only 
> (re-)register when they see that the master reaches a recovered state. Once 
> recovered, the master will not drop messages, and therefore agents only need 
> to retry when the connection breaks.
> (2) Have the master reply with a retry message when it's in the recovering 
> state, so that agents get a clear signal that their messages were dropped. 
> The agents only retry when the connection breaks or they get a retry message. 
> This one is less optimal, because the master may have to process a lot of 
> messages and send retries, but once the master is recovered, the master will 
> process only a single (re-)registration from each agent. The number of 
> (re-)registrations that occur while the master is recovering can be reduced 
> to 1 in this approach if the master sends the retry message only after the 
> master completes recovery.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to