[ https://issues.apache.org/jira/browse/MESOS-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762934#comment-16762934 ]
Benjamin Mahler commented on MESOS-9555: ---------------------------------------- For posterity, the check that's failing here in 1.5.0 is: https://github.com/apache/mesos/blob/1.5.0/src/master/allocator/mesos/hierarchical.cpp#L2630 It seems the reservation tracking in the allocator is missing an entry for a particular role. After looking over newer versions of the code, I doubt this issue will be fixed by an upgrade. I didn't find anything interesting in the logs for the two agents that were removed. [~fluxx] we'll likely need to add some additional logging to debug this further, would you be able to deploy a patched version of the masters? > Check failed: reservationScalarQuantities.contains(role) > -------------------------------------------------------- > > Key: MESOS-9555 > URL: https://issues.apache.org/jira/browse/MESOS-9555 > Project: Mesos > Issue Type: Bug > Components: allocation, master > Affects Versions: 1.5.0 > Environment: * Mesos 1.5 > * {{DISTRIB_ID=Ubuntu}} > * {{DISTRIB_RELEASE=16.04}} > * {{DISTRIB_CODENAME=xenial}} > * {{DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"}} > Reporter: Jeff Pollard > Priority: Critical > > We recently upgraded our Mesos cluster from version 1.3 to 1.5, and since > then have been getting periodic master crashes due to this error: > {code:java} > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118 8434 > hierarchical.cpp:2630] Check failed: > reservationScalarQuantities.contains(role){code} > Full stack trace is at the end of this issue description. When the master > fails, we automatically restart it and it rejoins the cluster just fine. I > did some initial searching and was unable to find any existing bug reports or > other people experiencing this issue. We run a cluster of 3 masters, and see > crashes on all 3 instances. > Right before the crash, we saw a {{Removed agent:...}} log line noting that > it was agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 that was removed. > {code:java} > 294929:Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: I0205 > 15:53:57.384759 8432 master.cpp:9893] Removed agent > 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 at slave(1)@10.0.18.78:5051 > (10.0.18.78): the agent unregistered{code} > I saved the full log from the master, so happy to provide more info from it, > or anything else about our current environment. > Full stack trace is below. > {code:java} > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170a7d > google::LogMessage::Fail() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9172830 > google::LogMessage::SendToLog() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170663 > google::LogMessage::Flush() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9173259 > google::LogMessageFatal::~LogMessageFatal() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8443cbd > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8448fcd > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90c4f11 > process::ProcessBase::consume() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90dea4a > process::ProcessManager::resume() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90e25d6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e6700c80 (unknown) > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5f136ba > start_thread > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5c4941d > (unknown){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)