[ 
https://issues.apache.org/jira/browse/MESOS-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler resolved MESOS-2679.
------------------------------------
    Resolution: Not A Problem

The slave needs to be run under a process which will restart it if it 
terminates. We currently don't provide such a watchdog process. 

As for the health check failure. The current timeout is 75 seconds. Once the 
timeout elapsed and the shutdown message was sent, it looks like it took 
approximately 12 seconds for the message to reach the slave, which seems to 
indicate there may have been an actual network issue here that led to the 
health check failure. The master will shut down slaves that it cannot 
communicate with, so this is to be expected.

> Slave asked to shut down by master because 'health check timed out'
> -------------------------------------------------------------------
>
>                 Key: MESOS-2679
>                 URL: https://issues.apache.org/jira/browse/MESOS-2679
>             Project: Mesos
>          Issue Type: Bug
>          Components: isolation
>    Affects Versions: 0.22.1
>            Reporter: Littlestar
>
> I run spark 1.3.1 on mesos 0.22.1 rc6 (linux64), some mesos slave node 
> offline.....
> slave node logs:
> I0430 15:12:12.737057 32354 slave.cpp:571] Slave asked to shut down by 
> master@192.168.1.10:5050 because 'health check timed out'
> master node logs:
> I0430 15:12:00.615777 19759 master.cpp:237] Shutting down slave 
> 20150430-141442-1214949568-5050-19747-S2 due to health check timeout
> W0430 15:12:00.616083 19751 master.cpp:3417] Shutting down slave 
> 20150430-141442-1214949568-5050-19747-S2 at slave(1)@192.168.1.15:5051 
> (hpblade05) with message 'health check timed out'
> why master-slave offline and not restart itself? 
> Any configurations to increase this timeout interval?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to