David Robinson created MESOS-5330:
-------------------------------------

             Summary: Agent should backoff before connecting to the master
                 Key: MESOS-5330
                 URL: https://issues.apache.org/jira/browse/MESOS-5330
             Project: Mesos
          Issue Type: Bug
            Reporter: David Robinson


When an agent is started it starts a background task (libprocess process?) to 
detect the leading master. When the leading master is detected (or changes) the 
[SocketManager's link() method is called and a TCP connection to the master is 
established|https://github.com/apache/mesos/blob/a138e2246a30c4b5c9bc3f7069ad12204dcaffbc/src/slave/slave.cpp#L954].
 The agent _then_ backs off before sending a ReRegisterSlave message via the 
newly established connection. The agent needs to backoff _before_ attempting to 
establish a TCP connection to the master, not before sending the first message 
over the connection.

During scale tests at Twitter we discovered that agents can SYN flood the 
master upon leader changes, then the problem described in MESOS-5200 can occur 
where ephemeral connections are used, which exacerbates the problem. The end 
result is a lot of hosts setting up and tearing down TCP connections every 
slave_ping_timeout seconds (15 by default), connections failing to be 
established, hosts being marked as unhealthy and being shutdown. We observed 
~800 passive TCP connections per second on the leading master during scale 
tests.

The problem can be somewhat mitigated by tuning the kernel to handle a 
thundering herd of TCP connections, but ideally there would not be a thundering 
herd to begin with.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to