[
https://issues.apache.org/jira/browse/MESOS-10005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954913#comment-16954913
]
Benjamin Mahler commented on MESOS-10005:
-----------------------------------------
Attached a patch that can be applied on 1.9.x to see if parallel broadcasting
helps alleviate this.
> Only broadcast framework update to agents associated with framework
> -------------------------------------------------------------------
>
> Key: MESOS-10005
> URL: https://issues.apache.org/jira/browse/MESOS-10005
> Project: Mesos
> Issue Type: Improvement
> Components: master
> Affects Versions: 1.9.0
> Environment: Ubuntu Bionic 18.04, Mesos 1.9.0 on the master, Mesos
> 1.4.1 on the agents. Spark 2.1.1 is the primary framework running.
> Reporter: Terra Field
> Priority: Major
> Attachments: 0001-Send-framework-updates-in-parallel.patch,
> mesos-master.log.gz, mesos-master.stacks - 2 - 1.9.0.gz, mesos-master.stacks
> - 3 - 1.9.0.gz, mesos-master.stacks - 4 - framework update - 1.9.0.gz,
> mesos-master.stacks - 5 - new healthy master.gz
>
>
> We have at any given time ~100 frameworks connected to our Mesos Master with
> agents spread across anywhere from 6,000 to 11,000 EC2 instances. We've been
> encounter a crash (which I'll document separately) and when that happens, the
> new Mesos Master will sometimes (but not always) struggle to catch up, and
> eventually crash again. Usually the third or fourth crash will end with a
> stable master (not ideal, but at least we can get to stable).
> Looking over the logs, I'm seeing hundreds of attempts to contact dead agents
> each second (and presumably many contacts with healthy agents that don't
> throw an error):
> {noformat}
> W1003 21:39:39.299998 8618 process.cpp:1917] Failed to send
> 'mesos.internal.UpdateFrameworkMessage' to '100.82.103.99:5051', connect:
> Failed to connect to 100.82.103.99:5051: Connection refused W1003
> 21:39:39.300143 8618 process.cpp:1917] Failed to send
> 'mesos.internal.UpdateFrameworkMessage' to '100.85.122.190:5051', connect:
> Failed to connect to 100.85.122.190:5051: Connection refused W1003
> 21:39:39.300285 8618 process.cpp:1917] Failed to send
> 'mesos.internal.UpdateFrameworkMessage' to '100.85.84.187:5051', connect:
> Failed to connect to 100.85.84.187:5051: Connection refused W1003
> 21:39:39.302122 8618 process.cpp:1917] Failed to send
> 'mesos.internal.UpdateFrameworkMessage' to '100.82.163.228:5051', connect:
> Failed to connect to 100.82.163.228:5051: Connection refused{noformat}
> I gave [~bmahler] a perf trace of the master on Slack at this point, and it
> looks like the master at is spending a significant amount of time doing
> framework update broadcasting. I'll attach the perf dump to the ticket, as
> well as the log of what the master did while it was alive.
> It sounds like currently, every framework update (100 total frameworks in our
> case) results in a broadcast to all 6000-11000 agents (depending on how busy
> the cluster is). Also, since our health checks rely on the UI currently, we
> usually end up killing the master because it fails a health check for long
> periods of time while overwhelmed by doing these broadcasts.
> Could optimizations to be made to either throttle these broadcasts or to only
> target nodes which need those framework updates?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)