[ 
https://issues.apache.org/jira/browse/MESOS-10005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954913#comment-16954913
 ] 

Benjamin Mahler commented on MESOS-10005:
-----------------------------------------

Attached a patch that can be applied on 1.9.x to see if parallel broadcasting 
helps alleviate this.

> Only broadcast framework update to agents associated with framework
> -------------------------------------------------------------------
>
>                 Key: MESOS-10005
>                 URL: https://issues.apache.org/jira/browse/MESOS-10005
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 1.9.0
>         Environment: Ubuntu Bionic 18.04, Mesos 1.9.0 on the master, Mesos 
> 1.4.1 on the agents. Spark 2.1.1 is the primary framework running.
>            Reporter: Terra Field
>            Priority: Major
>         Attachments: 0001-Send-framework-updates-in-parallel.patch, 
> mesos-master.log.gz, mesos-master.stacks - 2 - 1.9.0.gz, mesos-master.stacks 
> - 3 - 1.9.0.gz, mesos-master.stacks - 4 - framework update - 1.9.0.gz, 
> mesos-master.stacks - 5 - new healthy master.gz
>
>
> We have at any given time ~100 frameworks connected to our Mesos Master with 
> agents spread across anywhere from 6,000 to 11,000 EC2 instances. We've been 
> encounter a crash (which I'll document separately) and when that happens, the 
> new Mesos Master will sometimes (but not always) struggle to catch up, and 
> eventually crash again. Usually the third or fourth crash will end with a 
> stable master (not ideal, but at least we can get to stable).
> Looking over the logs, I'm seeing hundreds of attempts to contact dead agents 
> each second (and presumably many contacts with healthy agents that don't 
> throw an error):
> {noformat}
> W1003 21:39:39.299998  8618 process.cpp:1917] Failed to send 
> 'mesos.internal.UpdateFrameworkMessage' to '100.82.103.99:5051', connect: 
> Failed to connect to 100.82.103.99:5051: Connection refused W1003 
> 21:39:39.300143  8618 process.cpp:1917] Failed to send 
> 'mesos.internal.UpdateFrameworkMessage' to '100.85.122.190:5051', connect: 
> Failed to connect to 100.85.122.190:5051: Connection refused W1003 
> 21:39:39.300285  8618 process.cpp:1917] Failed to send 
> 'mesos.internal.UpdateFrameworkMessage' to '100.85.84.187:5051', connect: 
> Failed to connect to 100.85.84.187:5051: Connection refused W1003 
> 21:39:39.302122  8618 process.cpp:1917] Failed to send 
> 'mesos.internal.UpdateFrameworkMessage' to '100.82.163.228:5051', connect: 
> Failed to connect to 100.82.163.228:5051: Connection refused{noformat}
> I gave [~bmahler] a perf trace of the master on Slack at this point, and it 
> looks like the master at is spending a significant amount of time doing 
> framework update broadcasting. I'll attach the perf dump to the ticket, as 
> well as the log of what the master did while it was alive.
> It sounds like currently, every framework update (100 total frameworks in our 
> case) results in a broadcast to all 6000-11000 agents (depending on how busy 
> the cluster is). Also, since our health checks rely on the UI currently, we 
> usually end up killing the master because it fails a health check for long 
> periods of time while overwhelmed by doing these broadcasts.
> Could optimizations to be made to either throttle these broadcasts or to only 
> target nodes which need those framework updates?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to