[ 
https://issues.apache.org/jira/browse/MESOS-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763082#comment-16763082
 ] 

Jeff Pollard commented on MESOS-9555:
-------------------------------------

[~bmahler] thanks for the patch. A different engineer here is going to do the 
1.5.2 upgrade today, and then work on applying the patch. I'll report back here 
if the 1.5.2 patch fixes it (which sounds unlikely), or once we have some 
logging from a crash with the patched version.

> Check failed: reservationScalarQuantities.contains(role)
> --------------------------------------------------------
>
>                 Key: MESOS-9555
>                 URL: https://issues.apache.org/jira/browse/MESOS-9555
>             Project: Mesos
>          Issue Type: Bug
>          Components: allocation, master
>    Affects Versions: 1.5.0
>         Environment: * Mesos 1.5
>  * {{DISTRIB_ID=Ubuntu}}
>  * {{DISTRIB_RELEASE=16.04}}
>  * {{DISTRIB_CODENAME=xenial}}
>  * {{DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"}}
>            Reporter: Jeff Pollard
>            Priority: Critical
>         Attachments: 
> 0001-Added-additional-logging-to-1.5.2-to-investigate-MES.patch
>
>
> We recently upgraded our Mesos cluster from version 1.3 to 1.5, and since 
> then have been getting periodic master crashes due to this error:
> {code:java}
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118 8434 
> hierarchical.cpp:2630] Check failed: 
> reservationScalarQuantities.contains(role){code}
> Full stack trace is at the end of this issue description. When the master 
> fails, we automatically restart it and it rejoins the cluster just fine. I 
> did some initial searching and was unable to find any existing bug reports or 
> other people experiencing this issue. We run a cluster of 3 masters, and see 
> crashes on all 3 instances.
> Right before the crash, we saw a {{Removed agent:...}} log line noting that 
> it was agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 that was removed.
> {code:java}
> 294929:Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: I0205 
> 15:53:57.384759 8432 master.cpp:9893] Removed agent 
> 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 at slave(1)@10.0.18.78:5051 
> (10.0.18.78): the agent unregistered{code}
> I saved the full log from the master, so happy to provide more info from it, 
> or anything else about our current environment.
> Full stack trace is below.
> {code:java}
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170a7d 
> google::LogMessage::Fail()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9172830 
> google::LogMessage::SendToLog()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170663 
> google::LogMessage::Flush()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9173259 
> google::LogMessageFatal::~LogMessageFatal()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8443cbd 
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8448fcd 
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90c4f11 
> process::ProcessBase::consume()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90dea4a 
> process::ProcessManager::resume()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90e25d6 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e6700c80 (unknown)
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5f136ba 
> start_thread
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5c4941d 
> (unknown){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to