[ https://issues.apache.org/jira/browse/MESOS-5723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph Wu updated MESOS-5723: ----------------------------- Fix Version/s: 0.27.4 0.28.3 > SSL-enabled libprocess will leak incoming links to forks > -------------------------------------------------------- > > Key: MESOS-5723 > URL: https://issues.apache.org/jira/browse/MESOS-5723 > Project: Mesos > Issue Type: Bug > Components: libprocess > Affects Versions: 0.24.0, 0.25.0, 0.26.0, 0.27.0, 0.28.0 > Reporter: Joseph Wu > Assignee: Joseph Wu > Priority: Blocker > Labels: libprocess, mesosphere, ssl > Fix For: 0.28.3, 1.0.0, 0.27.4 > > > Encountered two different buggy behaviors that can be tracked down to the > same underlying problem. > Repro #1 (non-crashy): > (1) Start a master. Doesn't matter if SSL is enabled or not. > (2) Start an agent, with SSL enabled. Downgrade support has the same > problem. The master/agent {{link}} to one another. > (3) Run a sleep task. Keep this alive. If you inspect FDs at this point, > you'll notice the task has inherited the {{link}} FD (master -> agent). > (4) Restart the agent. Due to (3), the master's {{link}} stays open. > (5) Check master's logs for the agent's re-registration message. > (6) Check the agent's logs for re-registration. The message will not appear. > The master is actually using the old {{link}} which is not connected to the > agent. > ---- > Repro #2 (crashy): > (1) Start a master. Doesn't matter if SSL is enabled or not. > (2) Start an agent, with SSL enabled. Downgrade support has the same problem. > (3) Run ~100 sleep task one after the other, keep them all alive. Each task > links back to the agent. Due to an FD leak, each task will inherit the > incoming links from all other actors... > (4) At some point, the agent will run out of FDs and kernel panic. > ---- > It appears that the SSL socket {{accept}} call is missing {{os::nonblock}} > and {{os::cloexec}} calls: > https://github.com/apache/mesos/blob/4b91d936f50885b6a66277e26ea3c32fe942cf1a/3rdparty/libprocess/src/libevent_ssl_socket.cpp#L794-L806 > For reference, here's {{poll}} socket's {{accept}}: > https://github.com/apache/mesos/blob/4b91d936f50885b6a66277e26ea3c32fe942cf1a/3rdparty/libprocess/src/poll_socket.cpp#L53-L75 -- This message was sent by Atlassian JIRA (v6.3.4#6332)