[ https://issues.apache.org/jira/browse/MESOS-10011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949139#comment-16949139 ]
Yan Xu commented on MESOS-10011: -------------------------------- [~greggomann] any thoughts on how this should be addressed? > Operation feedback with stale agent ID crashes the master > --------------------------------------------------------- > > Key: MESOS-10011 > URL: https://issues.apache.org/jira/browse/MESOS-10011 > Project: Mesos > Issue Type: Bug > Components: agent, master > Affects Versions: 1.9.0 > Reporter: Yan Xu > Priority: Critical > > We have observed the following in our environment. > {noformat} > F1003 17:35:30.742681 58334 master.cpp:12528] Check failed: slave != nullptr > f664c4a9-d1ca-4cd0-88e4-0a6acf20e629-S218 > *** Check failure stack trace: *** > @ 0x7fd36ca9cf4d google::LogMessage::Fail() > @ 0x7fd36ca9f13d google::LogMessage::SendToLog() > @ 0x7fd36ca9ca87 google::LogMessage::Flush() > @ 0x7fd36ca9fbc9 google::LogMessageFatal::~LogMessageFatal() > @ 0x7fd36b5ae3bc mesos::internal::master::Master::removeOperation() > @ 0x7fd36b5b3446 > mesos::internal::master::Master::updateOperationStatus() > {noformat} > This follows registration of an agent that has changed its agent ID due to > losing its local state. > The check failure code is in > [Master::removeOperation|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/master/master.cpp#L12451]. > The masters would enter a crash loop unless the operation checkpoint state > (i.e., {{resources_and_operations.state}}) on the offending agent is deleted. > Even thought we try to minimize the cases where an agent would lose its > state, it can still happen when the {{latest}} symlink is removed either by > an operator or automatically [in certain > cases|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/slave/slave.cpp#L7719-L7725]. -- This message was sent by Atlassian Jira (v8.3.4#803005)