[jira] [Commented] (MESOS-9555) Check failed: reservationScalarQuantities.contains(role)

Benjamin Mahler (JIRA) Thu, 07 Feb 2019 10:08:25 -0800


    [ 
https://issues.apache.org/jira/browse/MESOS-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762934#comment-16762934
 ]


Benjamin Mahler commented on MESOS-9555:
----------------------------------------

For posterity, the check that's failing here in 1.5.0 is:

https://github.com/apache/mesos/blob/1.5.0/src/master/allocator/mesos/hierarchical.cpp#L2630

It seems the reservation tracking in the allocator is missing an entry for a 
particular role. After looking over newer versions of the code, I doubt this 
issue will be fixed by an upgrade.

I didn't find anything interesting in the logs for the two agents that were 
removed. [~fluxx] we'll likely need to add some additional logging to debug 
this further, would you be able to deploy a patched version of the masters?

> Check failed: reservationScalarQuantities.contains(role)
> --------------------------------------------------------
>
>                 Key: MESOS-9555
>                 URL: https://issues.apache.org/jira/browse/MESOS-9555
>             Project: Mesos
>          Issue Type: Bug
>          Components: allocation, master
>    Affects Versions: 1.5.0
>         Environment: * Mesos 1.5
>  * {{DISTRIB_ID=Ubuntu}}
>  * {{DISTRIB_RELEASE=16.04}}
>  * {{DISTRIB_CODENAME=xenial}}
>  * {{DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"}}
>            Reporter: Jeff Pollard
>            Priority: Critical
>
> We recently upgraded our Mesos cluster from version 1.3 to 1.5, and since 
> then have been getting periodic master crashes due to this error:
> {code:java}
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118 8434 
> hierarchical.cpp:2630] Check failed: 
> reservationScalarQuantities.contains(role){code}
> Full stack trace is at the end of this issue description. When the master 
> fails, we automatically restart it and it rejoins the cluster just fine. I 
> did some initial searching and was unable to find any existing bug reports or 
> other people experiencing this issue. We run a cluster of 3 masters, and see 
> crashes on all 3 instances.
> Right before the crash, we saw a {{Removed agent:...}} log line noting that 
> it was agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 that was removed.
> {code:java}
> 294929:Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: I0205 
> 15:53:57.384759 8432 master.cpp:9893] Removed agent 
> 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 at slave(1)@10.0.18.78:5051 
> (10.0.18.78): the agent unregistered{code}
> I saved the full log from the master, so happy to provide more info from it, 
> or anything else about our current environment.
> Full stack trace is below.
> {code:java}
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170a7d 
> google::LogMessage::Fail()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9172830 
> google::LogMessage::SendToLog()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170663 
> google::LogMessage::Flush()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9173259 
> google::LogMessageFatal::~LogMessageFatal()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8443cbd 
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8448fcd 
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90c4f11 
> process::ProcessBase::consume()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90dea4a 
> process::ProcessManager::resume()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90e25d6 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e6700c80 (unknown)
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5f136ba 
> start_thread
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5c4941d 
> (unknown){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9555) Check failed: reservationScalarQuantities.contains(role)

Reply via email to