Thanks.

That's what I thought. The problem though is that it is probably possible
that the zookeeper detector doesn't detect the failure while the connection
to the master fails. One way this could happen would be for example because
of a firewall causing the TCP connection from the framework to the master
to fail, while the zookeeper connections (from master to zk and framework
to zk) still work. Unlikely but possible I think. Having the driver detect
and fail upon EOF/socket error would guard against that.





On Thu, 26 Dec 2019, 18:07 Vinod Kone, <vinodk...@apache.org> wrote:

> IIRC, the standalone master detector (the detector that's used when using a
> local ip address of the master and not zk) doesn't re-detect when master
> process restarts. It's a limitation of that detector since it's mainly used
> for testing purposes and not recommended for production use. For
> production, please use zookeeper master detector (this detector is used
> when using zookeeper).
>
> On Fri, Dec 20, 2019 at 5:11 AM Charles-François Natali <
> cf.nat...@gmail.com>
> wrote:
>
> > Hi,
> >
> > It seems that the C++ scheduler driver doesn't detect loss of the
> > connection to the master when not using zookeeper.
> >
> > A simple way to reproduce this is to start a server passing it e.g.
> > "--ip=127.0.0.1", start the scheduler driver passing it "127.0.0.1:5050
> ",
> > and then send a SIGKILL to the master. The scheduler logs the following:
> >
> >
> > I1220 10:56:11.679347 10635 process.cpp:2928] Resuming
> > __reaper__(1)@192.168.65.76:34345 at 2019-12-20
> > 10:56:11.679366144+00:00
> > I1220 10:56:11.679392 10635 clock.cpp:279] Created a timer for
> > __reaper__(1)@192.168.65.76:34345 in 100ms in the future (2019-12-20
> > 10:56:11.779389952+00:00)
> > I1220 10:56:11.690646 10631 process.cpp:2928] Resuming
> > scheduler-6a93a8e3-5a8f-4195-bde2-718b5832d317@192.168.65.76:34345 at
> > 2019-12-20 10:56:11.690665984+00:00
> > I1220 10:56:11.690775 10632 process.cpp:2928] Resuming
> > __http__(1)@192.168.65.76:34345 at 2019-12-20 10:56:11.690784000+00:00
> > I1220 10:56:11.690806 10632 process.cpp:3088] Cleaning up
> > __http__(1)@192.168.65.76:34345
> > I1220 10:56:11.690914 10632 process.cpp:2928] Resuming
> > help@192.168.65.76:34345 at 2019-12-20 10:56:11.690921984+00:00
> >
> > An strace confirms that the process receives EOF when reading from the
> > socket, but Scheduler::disconnected isn't called.
> > It's that expected?
> >
> > Or is it assumed that the scheduler relies on zookeeper for detection?
> >
> > Cheers,
> >
> > Charles
> >
>

Reply via email to