Perfect, thanks!
On Mon, 30 Dec 2019, 13:42 Vinod Kone, <vinodk...@gmail.com> wrote: > In latest versions of mesos that is handled via heartbeats. > > Thanks, > Vinod > > > On Dec 30, 2019, at 4:37 AM, Charles-François Natali < > cf.nat...@gmail.com> wrote: > > > > Thanks. > > > > That's what I thought. The problem though is that it is probably possible > > that the zookeeper detector doesn't detect the failure while the > connection > > to the master fails. One way this could happen would be for example > because > > of a firewall causing the TCP connection from the framework to the master > > to fail, while the zookeeper connections (from master to zk and framework > > to zk) still work. Unlikely but possible I think. Having the driver > detect > > and fail upon EOF/socket error would guard against that. > > > > > > > > > > > >> On Thu, 26 Dec 2019, 18:07 Vinod Kone, <vinodk...@apache.org> wrote: > >> > >> IIRC, the standalone master detector (the detector that's used when > using a > >> local ip address of the master and not zk) doesn't re-detect when master > >> process restarts. It's a limitation of that detector since it's mainly > used > >> for testing purposes and not recommended for production use. For > >> production, please use zookeeper master detector (this detector is used > >> when using zookeeper). > >> > >> On Fri, Dec 20, 2019 at 5:11 AM Charles-François Natali < > >> cf.nat...@gmail.com> > >> wrote: > >> > >>> Hi, > >>> > >>> It seems that the C++ scheduler driver doesn't detect loss of the > >>> connection to the master when not using zookeeper. > >>> > >>> A simple way to reproduce this is to start a server passing it e.g. > >>> "--ip=127.0.0.1", start the scheduler driver passing it " > 127.0.0.1:5050 > >> ", > >>> and then send a SIGKILL to the master. The scheduler logs the > following: > >>> > >>> > >>> I1220 10:56:11.679347 10635 process.cpp:2928] Resuming > >>> __reaper__(1)@192.168.65.76:34345 at 2019-12-20 > >>> 10:56:11.679366144+00:00 > >>> I1220 10:56:11.679392 10635 clock.cpp:279] Created a timer for > >>> __reaper__(1)@192.168.65.76:34345 in 100ms in the future (2019-12-20 > >>> 10:56:11.779389952+00:00) > >>> I1220 10:56:11.690646 10631 process.cpp:2928] Resuming > >>> scheduler-6a93a8e3-5a8f-4195-bde2-718b5832d317@192.168.65.76:34345 at > >>> 2019-12-20 10:56:11.690665984+00:00 > >>> I1220 10:56:11.690775 10632 process.cpp:2928] Resuming > >>> __http__(1)@192.168.65.76:34345 at 2019-12-20 10:56:11.690784000+00:00 > >>> I1220 10:56:11.690806 10632 process.cpp:3088] Cleaning up > >>> __http__(1)@192.168.65.76:34345 > >>> I1220 10:56:11.690914 10632 process.cpp:2928] Resuming > >>> help@192.168.65.76:34345 at 2019-12-20 10:56:11.690921984+00:00 > >>> > >>> An strace confirms that the process receives EOF when reading from the > >>> socket, but Scheduler::disconnected isn't called. > >>> It's that expected? > >>> > >>> Or is it assumed that the scheduler relies on zookeeper for detection? > >>> > >>> Cheers, > >>> > >>> Charles > >>> > >> >