[jira] [Commented] (MESOS-7966) check for maintenance on agent causes fatal error

Matthew Mead-Briggs (JIRA) Tue, 29 May 2018 06:31:14 -0700


    [ 
https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493528#comment-16493528
 ]


Matthew Mead-Briggs commented on MESOS-7966:
--------------------------------------------

Thanks for taking a look at this [~bennoe], I'll have a read of the code and 
see if I can follow what you describe.

I think the logs I shared already contain those log lines unless I've missed 
something? I've also dumped the unfiltered logs in a private Slack channel on 
the mesosphere slack if you prefer to filter yourself.

Also, we are running 1.4.1 although I don't expect that makes a lot of 
difference.

> check for maintenance on agent causes fatal error
> -------------------------------------------------
>
>                 Key: MESOS-7966
>                 URL: https://issues.apache.org/jira/browse/MESOS-7966
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.1.0
>            Reporter: Rob Johnson
>            Assignee: Joseph Wu
>            Priority: Critical
>              Labels: mesosphere, reliability
>
> We interact with the maintenance API frequently to orchestrate gracefully 
> draining agents of tasks without impacting service availability.
> Occasionally we seem to trigger a fatal error in Mesos when interacting with 
> the api. This happens relatively frequently, and impacts us when downstream 
> frameworks (marathon) react badly to leader elections.
> Here is the log line that we see when the master dies:
> {code}
> F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: 
> slaves[slaveId].maintenance.isSome()
> {code}
> It's quite possibly we're using the maintenance API in the wrong way. We're 
> happy to provide any other logs you need - please let me know what would be 
> useful for debugging.
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-7966) check for maintenance on agent causes fatal error

Reply via email to