My guess is that your driver threw an exception while handling the
offerRescinded() callback which was detected by the JNI binding (IIRC
Mantis is a JVM framework?) causing it to abort the driver. Note that when
a driver aborts, it will send a DeactivateFrameworkMessage to the master
causing the master to deactivate the framework (but still keep it's tasks
alive until the framework failover timeout).

Having said that, your point regarding the scheduler not being able to
detect that "the driver is aborted" until it makes *another* driver call is
true. The driver doesn't call the error() callback when aborted for a
couple reasons 1) abort() can be called by the scheduler itself, so it
doesn't make too much sense to send a error() callback and 2) if abort() is
causing by a JVM exception, the scheduler probably already knows of it (I'm
guessing this wasn't the case for Mantis?). Perhaps these semantics are
worth reconsidering.

On Tue, Sep 9, 2014 at 3:14 PM, Sharma Podila <spod...@netflix.com> wrote:

> We had this problem show up yesterday, just one time, that I don't
> understand. Would appreciate any help.
>
> This is the sequence of events, as far as I can tell:
>
> From framework's perspective:
> F1: framework got an offer from a host that it decided it will not use, so
> it declines it
> F2: got scheduler call back about offer being rescinded (I believe same
> host that I just declined; the host was terminated by a separate decom
> process)
> F3: calling Mesos driver to kill a task shows driver status as
> DRIVER_ABORTED. However, there was no scheduler callback to reflect this.
> Wouldn't scheduler be told about driver being aborted via one of
> disconnected(), error(), other?
>
> From Mesos Master perspective:
> M1: failed to validate offer (must be in response to F1)
> M2: deactivating framework
>
> I am thinking that F1 was initiated by framework before that slave went
> down. But, the slave went down and offer rescinded in Mesos before F1 was
> received in Mesos master, which resulted in M1.
>
> Which should be OK, I'd imagine. But, here are two things I can't
> understand:
>
> 1. Why was the framework deactivated? I looked in Mesos logs and only
> found the below lines of interest.
>
> 2. Why was the framework not notified about being deactivated, but using
> the driver shows status as DRIVER_ABORTED?
>
>   2.1 Are frameworks required to periodically check the status of driver
> via mechanisms other than the scheduler callback? If so, what are they?
>
> As I said, this happened only once and likely is a race condition of
> sorts. I can't reproduce it. This sequence of events happen routinely but
> this error happened only once. It is nasty since then the framework just
> sits there with no offers and therefore no tasks get scheduled.
>
> We're on Mesos 0.18.0 (if this is specifically addressed in 0.19 or 0.20,
> that'd be good to know).
> I remember there was a reference to a problem caused when the created
> mesos driver gets GC'ed. However, our driver reference never goes out of
> scope.
>
> I have the following relevant logs from framework and Mesos master. The
> timestamps in the logs are from the same clock (on the same machine).
>
> From MantisMaster:
> 2014-09-08 20:08:46,263 WARN Thread-42 MesosSchedulerCallbackHandler -
> Declining offer from host 10.200.13.87 due to missing attribute value for
> EC2_AMI_ID - expecting [ami-5e6bc836] got [ami-28d47740]
> 2014-09-08 20:08:46,271 WARN Thread-58 MesosSchedulerCallbackHandler -
> Offer rescinded: offerID=20140908-195444-2298791946-7103-5698-5
> .....
> 2014-09-08 20:11:31,322 INFO pool-27-thread-1
> VirtualMachineMasterServiceMesosImpl - Calling mesos to kill
> outliers-5-worker-0-7
> 2014-09-08 20:11:31,322 INFO pool-27-thread-1
> VirtualMachineMasterServiceMesosImpl - Kill status = DRIVER_ABORTED
>
>
> From Mesos-Master:
> W0908 20:08:46.277575  5791 master.cpp:1556] Failed to validate offer
> 20140908-195444-2298791946-7103-5698-5 : Offer
> 20140908-195444-2298791946-7103-5698-5 is no longer valid
> I0908 20:08:46.277721  5791 master.cpp:1079] Deactivating framework
> MantisFramework
> I0908 20:08:46.278017  5789 hierarchical_allocator_process.hpp:408]
> Deactivated framework MantisFramework
>
>

Reply via email to