[ https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493528#comment-16493528 ]
Matthew Mead-Briggs commented on MESOS-7966: -------------------------------------------- Thanks for taking a look at this [~bennoe], I'll have a read of the code and see if I can follow what you describe. I think the logs I shared already contain those log lines unless I've missed something? I've also dumped theĀ unfiltered logs in a private Slack channel on the mesosphere slack if you prefer to filter yourself. Also, we are running 1.4.1 although I don't expect that makes a lot of difference. > check for maintenance on agent causes fatal error > ------------------------------------------------- > > Key: MESOS-7966 > URL: https://issues.apache.org/jira/browse/MESOS-7966 > Project: Mesos > Issue Type: Bug > Components: master > Affects Versions: 1.1.0 > Reporter: Rob Johnson > Assignee: Joseph Wu > Priority: Critical > Labels: mesosphere, reliability > > We interact with the maintenance API frequently to orchestrate gracefully > draining agents of tasks without impacting service availability. > Occasionally we seem to trigger a fatal error in Mesos when interacting with > the api. This happens relatively frequently, and impacts us when downstream > frameworks (marathon) react badly to leader elections. > Here is the log line that we see when the master dies: > {code} > F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: > slaves[slaveId].maintenance.isSome() > {code} > It's quite possibly we're using the maintenance API in the wrong way. We're > happy to provide any other logs you need - please let me know what would be > useful for debugging. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)