IIRC, the standalone master detector (the detector that's used when using a
local ip address of the master and not zk) doesn't re-detect when master
process restarts. It's a limitation of that detector since it's mainly used
for testing purposes and not recommended for production use. For
production, please use zookeeper master detector (this detector is used
when using zookeeper).

On Fri, Dec 20, 2019 at 5:11 AM Charles-François Natali <cf.nat...@gmail.com>
wrote:

> Hi,
>
> It seems that the C++ scheduler driver doesn't detect loss of the
> connection to the master when not using zookeeper.
>
> A simple way to reproduce this is to start a server passing it e.g.
> "--ip=127.0.0.1", start the scheduler driver passing it "127.0.0.1:5050",
> and then send a SIGKILL to the master. The scheduler logs the following:
>
>
> I1220 10:56:11.679347 10635 process.cpp:2928] Resuming
> __reaper__(1)@192.168.65.76:34345 at 2019-12-20
> 10:56:11.679366144+00:00
> I1220 10:56:11.679392 10635 clock.cpp:279] Created a timer for
> __reaper__(1)@192.168.65.76:34345 in 100ms in the future (2019-12-20
> 10:56:11.779389952+00:00)
> I1220 10:56:11.690646 10631 process.cpp:2928] Resuming
> scheduler-6a93a8e3-5a8f-4195-bde2-718b5832d317@192.168.65.76:34345 at
> 2019-12-20 10:56:11.690665984+00:00
> I1220 10:56:11.690775 10632 process.cpp:2928] Resuming
> __http__(1)@192.168.65.76:34345 at 2019-12-20 10:56:11.690784000+00:00
> I1220 10:56:11.690806 10632 process.cpp:3088] Cleaning up
> __http__(1)@192.168.65.76:34345
> I1220 10:56:11.690914 10632 process.cpp:2928] Resuming
> help@192.168.65.76:34345 at 2019-12-20 10:56:11.690921984+00:00
>
> An strace confirms that the process receives EOF when reading from the
> socket, but Scheduler::disconnected isn't called.
> It's that expected?
>
> Or is it assumed that the scheduler relies on zookeeper for detection?
>
> Cheers,
>
> Charles
>

Reply via email to