[ https://issues.apache.org/jira/browse/MESOS-10005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954913#comment-16954913 ]
Benjamin Mahler commented on MESOS-10005: ----------------------------------------- Attached a patch that can be applied on 1.9.x to see if parallel broadcasting helps alleviate this. > Only broadcast framework update to agents associated with framework > ------------------------------------------------------------------- > > Key: MESOS-10005 > URL: https://issues.apache.org/jira/browse/MESOS-10005 > Project: Mesos > Issue Type: Improvement > Components: master > Affects Versions: 1.9.0 > Environment: Ubuntu Bionic 18.04, Mesos 1.9.0 on the master, Mesos > 1.4.1 on the agents. Spark 2.1.1 is the primary framework running. > Reporter: Terra Field > Priority: Major > Attachments: 0001-Send-framework-updates-in-parallel.patch, > mesos-master.log.gz, mesos-master.stacks - 2 - 1.9.0.gz, mesos-master.stacks > - 3 - 1.9.0.gz, mesos-master.stacks - 4 - framework update - 1.9.0.gz, > mesos-master.stacks - 5 - new healthy master.gz > > > We have at any given time ~100 frameworks connected to our Mesos Master with > agents spread across anywhere from 6,000 to 11,000 EC2 instances. We've been > encounter a crash (which I'll document separately) and when that happens, the > new Mesos Master will sometimes (but not always) struggle to catch up, and > eventually crash again. Usually the third or fourth crash will end with a > stable master (not ideal, but at least we can get to stable). > Looking over the logs, I'm seeing hundreds of attempts to contact dead agents > each second (and presumably many contacts with healthy agents that don't > throw an error): > {noformat} > W1003 21:39:39.299998 8618 process.cpp:1917] Failed to send > 'mesos.internal.UpdateFrameworkMessage' to '100.82.103.99:5051', connect: > Failed to connect to 100.82.103.99:5051: Connection refused W1003 > 21:39:39.300143 8618 process.cpp:1917] Failed to send > 'mesos.internal.UpdateFrameworkMessage' to '100.85.122.190:5051', connect: > Failed to connect to 100.85.122.190:5051: Connection refused W1003 > 21:39:39.300285 8618 process.cpp:1917] Failed to send > 'mesos.internal.UpdateFrameworkMessage' to '100.85.84.187:5051', connect: > Failed to connect to 100.85.84.187:5051: Connection refused W1003 > 21:39:39.302122 8618 process.cpp:1917] Failed to send > 'mesos.internal.UpdateFrameworkMessage' to '100.82.163.228:5051', connect: > Failed to connect to 100.82.163.228:5051: Connection refused{noformat} > I gave [~bmahler] a perf trace of the master on Slack at this point, and it > looks like the master at is spending a significant amount of time doing > framework update broadcasting. I'll attach the perf dump to the ticket, as > well as the log of what the master did while it was alive. > It sounds like currently, every framework update (100 total frameworks in our > case) results in a broadcast to all 6000-11000 agents (depending on how busy > the cluster is). Also, since our health checks rely on the UI currently, we > usually end up killing the master because it fails a health check for long > periods of time while overwhelmed by doing these broadcasts. > Could optimizations to be made to either throttle these broadcasts or to only > target nodes which need those framework updates? -- This message was sent by Atlassian Jira (v8.3.4#803005)