[ https://issues.apache.org/jira/browse/MESOS-5330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jie Yu updated MESOS-5330: -------------------------- Fix Version/s: 0.28.3 0.27.4 > Agent should backoff before connecting to the master > ---------------------------------------------------- > > Key: MESOS-5330 > URL: https://issues.apache.org/jira/browse/MESOS-5330 > Project: Mesos > Issue Type: Bug > Reporter: David Robinson > Assignee: David Robinson > Fix For: 0.28.3, 1.0.0, 0.27.4 > > > When an agent is started it starts a background task (libprocess process?) to > detect the leading master. When the leading master is detected (or changes) > the [SocketManager's link() method is called and a TCP connection to the > master is > established|https://github.com/apache/mesos/blob/a138e2246a30c4b5c9bc3f7069ad12204dcaffbc/src/slave/slave.cpp#L954]. > The agent _then_ backs off before sending a ReRegisterSlave message via the > newly established connection. The agent needs to backoff _before_ attempting > to establish a TCP connection to the master, not before sending the first > message over the connection. > During scale tests at Twitter we discovered that agents can SYN flood the > master upon leader changes, then the problem described in MESOS-5200 can > occur where ephemeral connections are used, which exacerbates the problem. > The end result is a lot of hosts setting up and tearing down TCP connections > every slave_ping_timeout seconds (15 by default), connections failing to be > established, hosts being marked as unhealthy and being shutdown. We observed > ~800 passive TCP connections per second on the leading master during scale > tests. > The problem can be somewhat mitigated by tuning the kernel to handle a > thundering herd of TCP connections, but ideally there would not be a > thundering herd to begin with. -- This message was sent by Atlassian JIRA (v6.3.4#6332)