[ https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16470200#comment-16470200 ]
Matthew Mead-Briggs commented on MESOS-7966: -------------------------------------------- I've recently started looking at this again and managed to gather some logs this time. I'll post a filtered version here that might be helpful and then I'll share the full master logs privately (just incase they contain something sensitive). Filtered master logs: https://gist.github.com/mattmb/d2bb103b162da75c4e25c2dc0eadad4e > check for maintenance on agent causes fatal error > ------------------------------------------------- > > Key: MESOS-7966 > URL: https://issues.apache.org/jira/browse/MESOS-7966 > Project: Mesos > Issue Type: Bug > Components: master > Affects Versions: 1.1.0 > Reporter: Rob Johnson > Assignee: Joseph Wu > Priority: Critical > Labels: mesosphere, reliability > > We interact with the maintenance API frequently to orchestrate gracefully > draining agents of tasks without impacting service availability. > Occasionally we seem to trigger a fatal error in Mesos when interacting with > the api. This happens relatively frequently, and impacts us when downstream > frameworks (marathon) react badly to leader elections. > Here is the log line that we see when the master dies: > {code} > F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: > slaves[slaveId].maintenance.isSome() > {code} > It's quite possibly we're using the maintenance API in the wrong way. We're > happy to provide any other logs you need - please let me know what would be > useful for debugging. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)