Perfect, thanks!


On Mon, 30 Dec 2019, 13:42 Vinod Kone, <vinodk...@gmail.com> wrote:

> In latest versions of mesos that is handled via heartbeats.
>
> Thanks,
> Vinod
>
> > On Dec 30, 2019, at 4:37 AM, Charles-François Natali <
> cf.nat...@gmail.com> wrote:
> >
> > Thanks.
> >
> > That's what I thought. The problem though is that it is probably possible
> > that the zookeeper detector doesn't detect the failure while the
> connection
> > to the master fails. One way this could happen would be for example
> because
> > of a firewall causing the TCP connection from the framework to the master
> > to fail, while the zookeeper connections (from master to zk and framework
> > to zk) still work. Unlikely but possible I think. Having the driver
> detect
> > and fail upon EOF/socket error would guard against that.
> >
> >
> >
> >
> >
> >> On Thu, 26 Dec 2019, 18:07 Vinod Kone, <vinodk...@apache.org> wrote:
> >>
> >> IIRC, the standalone master detector (the detector that's used when
> using a
> >> local ip address of the master and not zk) doesn't re-detect when master
> >> process restarts. It's a limitation of that detector since it's mainly
> used
> >> for testing purposes and not recommended for production use. For
> >> production, please use zookeeper master detector (this detector is used
> >> when using zookeeper).
> >>
> >> On Fri, Dec 20, 2019 at 5:11 AM Charles-François Natali <
> >> cf.nat...@gmail.com>
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> It seems that the C++ scheduler driver doesn't detect loss of the
> >>> connection to the master when not using zookeeper.
> >>>
> >>> A simple way to reproduce this is to start a server passing it e.g.
> >>> "--ip=127.0.0.1", start the scheduler driver passing it "
> 127.0.0.1:5050
> >> ",
> >>> and then send a SIGKILL to the master. The scheduler logs the
> following:
> >>>
> >>>
> >>> I1220 10:56:11.679347 10635 process.cpp:2928] Resuming
> >>> __reaper__(1)@192.168.65.76:34345 at 2019-12-20
> >>> 10:56:11.679366144+00:00
> >>> I1220 10:56:11.679392 10635 clock.cpp:279] Created a timer for
> >>> __reaper__(1)@192.168.65.76:34345 in 100ms in the future (2019-12-20
> >>> 10:56:11.779389952+00:00)
> >>> I1220 10:56:11.690646 10631 process.cpp:2928] Resuming
> >>> scheduler-6a93a8e3-5a8f-4195-bde2-718b5832d317@192.168.65.76:34345 at
> >>> 2019-12-20 10:56:11.690665984+00:00
> >>> I1220 10:56:11.690775 10632 process.cpp:2928] Resuming
> >>> __http__(1)@192.168.65.76:34345 at 2019-12-20 10:56:11.690784000+00:00
> >>> I1220 10:56:11.690806 10632 process.cpp:3088] Cleaning up
> >>> __http__(1)@192.168.65.76:34345
> >>> I1220 10:56:11.690914 10632 process.cpp:2928] Resuming
> >>> help@192.168.65.76:34345 at 2019-12-20 10:56:11.690921984+00:00
> >>>
> >>> An strace confirms that the process receives EOF when reading from the
> >>> socket, but Scheduler::disconnected isn't called.
> >>> It's that expected?
> >>>
> >>> Or is it assumed that the scheduler relies on zookeeper for detection?
> >>>
> >>> Cheers,
> >>>
> >>> Charles
> >>>
> >>
>

Reply via email to