No problem. Glad you figured out. @vinodkone
> On Jan 23, 2017, at 8:38 AM, Vova Shelgunov <vvs...@gmail.com> wrote: > > Yes, it works. Sorry for troubling, the first time when I looked at the logs > I did not notice that failover_timeout is zero. > > 2017-01-23 19:27 GMT+03:00 Vova Shelgunov <vvs...@gmail.com>: >> Logs from mesos master: >> >> 0123 15:53:44.523613 7 http.cpp:391] HTTP POST for >> /master/api/v1/scheduler from 172.18.0.1:58864 with User-Agent='AHC/2.0' >> I0123 15:53:44.524159 7 master.cpp:4827] Processing ACKNOWLEDGE call >> ac9a6e5e-67b3-490a-930f-0024eab734b4 for task 10336 of framework >> 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) on agent >> 16c100c1-13fe-47b8-a2a0-aed9bafbbf8c-S0 >> I0123 15:53:44.524849 7 master.cpp:7744] Removing task 10336 with >> resources cpus(*):0.1; mem(*):32 of framework >> 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 on agent >> 16c100c1-13fe-47b8-a2a0-aed9bafbbf8c-S0 at slave(1)@172.18.0.3:5051 >> (mesos-slave) >> I0123 15:53:44.529033 7 master.cpp:1297] Framework >> 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) disconnected >> I0123 15:53:44.529636 7 master.cpp:2902] Disconnecting framework >> 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) >> I0123 15:53:44.529974 7 master.cpp:2926] Deactivating framework >> 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) >> I0123 15:53:44.530299 7 master.cpp:1310] Giving framework >> 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) 0ns to >> failover >> I0123 15:53:44.530594 7 hierarchical.cpp:386] Deactivated framework >> 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 >> I0123 15:53:44.531962 7 master.cpp:6369] Framework failover timeout, >> removing framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTif TP >> Framework) >> I0123 15:53:44.534992 7 master.cpp:7103] Removing framework >> 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) >> >> It seems failover timeout is set to zero for the framework. >> >> It can be my coding error if framework looses its connection to the master >> multiple times (I see that I do not pass failover_timeout value during >> reconnection). >> I will try to observe if it solves my issue. >> >> Thanks >> >> 2017-01-23 19:05 GMT+03:00 Vova Shelgunov <vvs...@gmail.com>: >>> Hi, >>> >>> I faced a very strange situation with my framework that talks to mesos >>> master via Scheduler HTTP API: >>> >>> Sometimes my framework stops to receive the heartbeats and task updates >>> from a master. >>> I read the documentation of mesos >>> (http://mesos.apache.org/documentation/latest/scheduler-http-api/), Network >>> partitions section and I see that if a framework does not receive the >>> heartbeats within some time it should reconnect to the master. >>> >>> I have written a heartbeat monitor that checks if there were not heartbeats >>> last n seconds, then reconnect, but after the reconnection, I all the time >>> receive an ERROR from the mesos master that my framework has been removed. >>> >>> Why is it happening? >>> >>> Regards, >>> Uladzimir >> >