[ https://issues.apache.org/jira/browse/MESOS-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Niklas Quarfot Nielsen updated MESOS-2110: ------------------------------------------ Shepherd: Niklas Quarfot Nielsen > Configurable Ping Timeouts > -------------------------- > > Key: MESOS-2110 > URL: https://issues.apache.org/jira/browse/MESOS-2110 > Project: Mesos > Issue Type: Improvement > Components: master, slave > Reporter: Adam B > Assignee: Adam B > Labels: master, network, slave, timeout > > After a series of ping-failures, the master considers the slave lost and > calls shutdownSlave, requiring such a slave that reconnects to kill its tasks > and re-register as a new slaveId. On the other side, after a similar timeout, > the slave will consider the master lost and try to detect a new master. These > timeouts are currently hardcoded constants (5 * 15s), which may not be > well-suited for all scenarios. > - Some clusters may tolerate a longer slave process restart period, and > wouldn't want tasks to be killed upon reconnect. > - Some clusters may have higher-latency networks (e.g. cross-datacenter, or > for volunteer computing efforts), and would like to tolerate longer periods > without communication. > We should provide flags/mechanisms on the master to control its tolerance for > non-communicative slaves, and (less importantly?) on the slave to tolerate > missing masters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)