[ 
https://issues.apache.org/jira/browse/MESOS-5330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-5330:
--------------------------
    Fix Version/s: 0.28.3
                   0.27.4

> Agent should backoff before connecting to the master
> ----------------------------------------------------
>
>                 Key: MESOS-5330
>                 URL: https://issues.apache.org/jira/browse/MESOS-5330
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: David Robinson
>            Assignee: David Robinson
>             Fix For: 0.28.3, 1.0.0, 0.27.4
>
>
> When an agent is started it starts a background task (libprocess process?) to 
> detect the leading master. When the leading master is detected (or changes) 
> the [SocketManager's link() method is called and a TCP connection to the 
> master is 
> established|https://github.com/apache/mesos/blob/a138e2246a30c4b5c9bc3f7069ad12204dcaffbc/src/slave/slave.cpp#L954].
>  The agent _then_ backs off before sending a ReRegisterSlave message via the 
> newly established connection. The agent needs to backoff _before_ attempting 
> to establish a TCP connection to the master, not before sending the first 
> message over the connection.
> During scale tests at Twitter we discovered that agents can SYN flood the 
> master upon leader changes, then the problem described in MESOS-5200 can 
> occur where ephemeral connections are used, which exacerbates the problem. 
> The end result is a lot of hosts setting up and tearing down TCP connections 
> every slave_ping_timeout seconds (15 by default), connections failing to be 
> established, hosts being marked as unhealthy and being shutdown. We observed 
> ~800 passive TCP connections per second on the leading master during scale 
> tests.
> The problem can be somewhat mitigated by tuning the kernel to handle a 
> thundering herd of TCP connections, but ideally there would not be a 
> thundering herd to begin with.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to