Mesos Driver aborted silently?

Sharma Podila Tue, 09 Sep 2014 15:15:13 -0700

We had this problem show up yesterday, just one time, that I don't
understand. Would appreciate any help.


This is the sequence of events, as far as I can tell:

>From framework's perspective:
F1: framework got an offer from a host that it decided it will not use, so
it declines it
F2: got scheduler call back about offer being rescinded (I believe same
host that I just declined; the host was terminated by a separate decom
process)
F3: calling Mesos driver to kill a task shows driver status as
DRIVER_ABORTED. However, there was no scheduler callback to reflect this.
Wouldn't scheduler be told about driver being aborted via one of
disconnected(), error(), other?

>From Mesos Master perspective:
M1: failed to validate offer (must be in response to F1)
M2: deactivating framework

I am thinking that F1 was initiated by framework before that slave went
down. But, the slave went down and offer rescinded in Mesos before F1 was
received in Mesos master, which resulted in M1.

Which should be OK, I'd imagine. But, here are two things I can't
understand:

1. Why was the framework deactivated? I looked in Mesos logs and only found
the below lines of interest.

2. Why was the framework not notified about being deactivated, but using
the driver shows status as DRIVER_ABORTED?

  2.1 Are frameworks required to periodically check the status of driver
via mechanisms other than the scheduler callback? If so, what are they?

As I said, this happened only once and likely is a race condition of sorts.
I can't reproduce it. This sequence of events happen routinely but this
error happened only once. It is nasty since then the framework just sits
there with no offers and therefore no tasks get scheduled.

We're on Mesos 0.18.0 (if this is specifically addressed in 0.19 or 0.20,
that'd be good to know).
I remember there was a reference to a problem caused when the created mesos
driver gets GC'ed. However, our driver reference never goes out of scope.

I have the following relevant logs from framework and Mesos master. The
timestamps in the logs are from the same clock (on the same machine).

>From MantisMaster:
2014-09-08 20:08:46,263 WARN Thread-42 MesosSchedulerCallbackHandler -
Declining offer from host 10.200.13.87 due to missing attribute value for
EC2_AMI_ID - expecting [ami-5e6bc836] got [ami-28d47740]
2014-09-08 20:08:46,271 WARN Thread-58 MesosSchedulerCallbackHandler -
Offer rescinded: offerID=20140908-195444-2298791946-7103-5698-5
.....
2014-09-08 20:11:31,322 INFO pool-27-thread-1
VirtualMachineMasterServiceMesosImpl - Calling mesos to kill
outliers-5-worker-0-7
2014-09-08 20:11:31,322 INFO pool-27-thread-1
VirtualMachineMasterServiceMesosImpl - Kill status = DRIVER_ABORTED


>From Mesos-Master:
W0908 20:08:46.277575  5791 master.cpp:1556] Failed to validate offer
20140908-195444-2298791946-7103-5698-5 : Offer
20140908-195444-2298791946-7103-5698-5 is no longer valid
I0908 20:08:46.277721  5791 master.cpp:1079] Deactivating framework
MantisFramework
I0908 20:08:46.278017  5789 hierarchical_allocator_process.hpp:408]
Deactivated framework MantisFramework

Mesos Driver aborted silently?

Reply via email to