We had this problem show up yesterday, just one time, that I don't understand. Would appreciate any help.
This is the sequence of events, as far as I can tell: >From framework's perspective: F1: framework got an offer from a host that it decided it will not use, so it declines it F2: got scheduler call back about offer being rescinded (I believe same host that I just declined; the host was terminated by a separate decom process) F3: calling Mesos driver to kill a task shows driver status as DRIVER_ABORTED. However, there was no scheduler callback to reflect this. Wouldn't scheduler be told about driver being aborted via one of disconnected(), error(), other? >From Mesos Master perspective: M1: failed to validate offer (must be in response to F1) M2: deactivating framework I am thinking that F1 was initiated by framework before that slave went down. But, the slave went down and offer rescinded in Mesos before F1 was received in Mesos master, which resulted in M1. Which should be OK, I'd imagine. But, here are two things I can't understand: 1. Why was the framework deactivated? I looked in Mesos logs and only found the below lines of interest. 2. Why was the framework not notified about being deactivated, but using the driver shows status as DRIVER_ABORTED? 2.1 Are frameworks required to periodically check the status of driver via mechanisms other than the scheduler callback? If so, what are they? As I said, this happened only once and likely is a race condition of sorts. I can't reproduce it. This sequence of events happen routinely but this error happened only once. It is nasty since then the framework just sits there with no offers and therefore no tasks get scheduled. We're on Mesos 0.18.0 (if this is specifically addressed in 0.19 or 0.20, that'd be good to know). I remember there was a reference to a problem caused when the created mesos driver gets GC'ed. However, our driver reference never goes out of scope. I have the following relevant logs from framework and Mesos master. The timestamps in the logs are from the same clock (on the same machine). >From MantisMaster: 2014-09-08 20:08:46,263 WARN Thread-42 MesosSchedulerCallbackHandler - Declining offer from host 10.200.13.87 due to missing attribute value for EC2_AMI_ID - expecting [ami-5e6bc836] got [ami-28d47740] 2014-09-08 20:08:46,271 WARN Thread-58 MesosSchedulerCallbackHandler - Offer rescinded: offerID=20140908-195444-2298791946-7103-5698-5 ..... 2014-09-08 20:11:31,322 INFO pool-27-thread-1 VirtualMachineMasterServiceMesosImpl - Calling mesos to kill outliers-5-worker-0-7 2014-09-08 20:11:31,322 INFO pool-27-thread-1 VirtualMachineMasterServiceMesosImpl - Kill status = DRIVER_ABORTED >From Mesos-Master: W0908 20:08:46.277575 5791 master.cpp:1556] Failed to validate offer 20140908-195444-2298791946-7103-5698-5 : Offer 20140908-195444-2298791946-7103-5698-5 is no longer valid I0908 20:08:46.277721 5791 master.cpp:1079] Deactivating framework MantisFramework I0908 20:08:46.278017 5789 hierarchical_allocator_process.hpp:408] Deactivated framework MantisFramework