[ 
https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16495858#comment-16495858
 ] 

Benno Evers commented on MESOS-7966:
------------------------------------

Ok, the trick to reproduce the issue was to add a framework that issues a 
dynamic reservation on the host scheduled for maintenance, which causes the 
master to send out inverse offers to that framework. (for the original report, 
that framework was marathon) With that, I've been able to reproduce the issue 
on latest master.

I'm pretty sure that what's going on is the following:

1) Request A adds a new maintenance windows for agent S. Framework F has a 
reservation on S, so the allocator dispatches a call to the 
`Master::inverseOffer()` callback instructing the master to notify F about the 
pending maintenance.
2) Request B comes in, removing the maintenance window for S. This dispatches a 
call to `Master::updateUnavailability()` on the master actor. In this function, 
all outstanding inverse offers for S are removed from `slave->inverseOffers` 
and a call to the allocator is dispatched to remove the maintenance field for S.
3) The `Master::inverseOffer()` call is executed, sending a new inverse offer 
to S and saving this offers into the `slave->inverseOffers` structure.

And now we have sort of a "time bomb" in Mesos: S has an outstanding inverse 
offer but in the allocator the maintenance structure for S is set to None. The 
master will crash the next time someone will try to update the maintenance 
schedule for S. This actually makes the logs a bit useless, because depending 
on the update frequency, that could be days or weeks after the bug happened.

Interestingly, marathon could always trigger it immediately by either accepting 
or declining the offer, but it is using the v0 API, and the scheduler driver 
code will currently do neither and silently ignore the inverse offers.

As a fix, the most straightforward way would be to just remove the `CHECK()` in 
question, but its probably best to take a day or two to think if this could 
have any unintended side effects.

> check for maintenance on agent causes fatal error
> -------------------------------------------------
>
>                 Key: MESOS-7966
>                 URL: https://issues.apache.org/jira/browse/MESOS-7966
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.1.0
>            Reporter: Rob Johnson
>            Assignee: Benno Evers
>            Priority: Critical
>              Labels: mesosphere, reliability
>
> We interact with the maintenance API frequently to orchestrate gracefully 
> draining agents of tasks without impacting service availability.
> Occasionally we seem to trigger a fatal error in Mesos when interacting with 
> the api. This happens relatively frequently, and impacts us when downstream 
> frameworks (marathon) react badly to leader elections.
> Here is the log line that we see when the master dies:
> {code}
> F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: 
> slaves[slaveId].maintenance.isSome()
> {code}
> It's quite possibly we're using the maintenance API in the wrong way. We're 
> happy to provide any other logs you need - please let me know what would be 
> useful for debugging.
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to