[ 
https://issues.apache.org/jira/browse/MESOS-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762048#comment-16762048
 ] 

Jeff Pollard commented on MESOS-9555:
-------------------------------------

[~bmahler], that is correct we are running 1.5.0 - specifically the 
{{mesos_1.5.0-2.0.1.ubuntu1604_amd64.deb}} package.

We do plan to upgrade our cluster in the near term, we can give 1.5.x a shot. 
Could you link to the previous issue that was fixed, I'd be curious to read it 
to see if there was anything in it that seems related to our environment.

I'll hold off attaching the log for right now until then.

> Check failed: reservationScalarQuantities.contains(role)
> --------------------------------------------------------
>
>                 Key: MESOS-9555
>                 URL: https://issues.apache.org/jira/browse/MESOS-9555
>             Project: Mesos
>          Issue Type: Bug
>          Components: allocation, master
>    Affects Versions: 1.5.0
>         Environment: * Mesos 1.5
>  * {{DISTRIB_ID=Ubuntu}}
>  * {{DISTRIB_RELEASE=16.04}}
>  * {{DISTRIB_CODENAME=xenial}}
>  * {{DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"}}
>            Reporter: Jeff Pollard
>            Priority: Critical
>
> We recently upgraded our Mesos cluster from version 1.3 to 1.5, and since 
> then have been getting periodic master crashes due to this error:
> {code:java}
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118 8434 
> hierarchical.cpp:2630] Check failed: 
> reservationScalarQuantities.contains(role){code}
> Full stack trace is at the end of this issue description. When the master 
> fails, we automatically restart it and it rejoins the cluster just fine. I 
> did some initial searching and was unable to find any existing bug reports or 
> other people experiencing this issue. We run a cluster of 3 masters, and see 
> crashes on all 3 instances.
> Right before the crash, we saw a {{Removed agent:...}} log line noting that 
> it was agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 that was removed.
> {code:java}
> 294929:Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: I0205 
> 15:53:57.384759 8432 master.cpp:9893] Removed agent 
> 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 at slave(1)@10.0.18.78:5051 
> (10.0.18.78): the agent unregistered{code}
> I saved the full log from the master, so happy to provide more info from it, 
> or anything else about our current environment.
> Full stack trace is below.
> {code:java}
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170a7d 
> google::LogMessage::Fail()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9172830 
> google::LogMessage::SendToLog()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170663 
> google::LogMessage::Flush()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9173259 
> google::LogMessageFatal::~LogMessageFatal()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8443cbd 
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8448fcd 
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90c4f11 
> process::ProcessBase::consume()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90dea4a 
> process::ProcessManager::resume()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90e25d6 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e6700c80 (unknown)
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5f136ba 
> start_thread
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5c4941d 
> (unknown){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to