Andrei Sekretenko created MESOS-10109:
-----------------------------------------

             Summary: After failover, master crashes on re-adding an agent with 
maintenace schedule set.
                 Key: MESOS-10109
                 URL: https://issues.apache.org/jira/browse/MESOS-10109
             Project: Mesos
          Issue Type: Bug
          Components: allocation
    Affects Versions: 1.10.0
            Reporter: Andrei Sekretenko
            Assignee: Andrei Sekretenko


Stacktrace:
{noformat}
2020-04-03 08:34:58.007285 +0000 UTC F0403 08:34:58.003100  2717 
hierarchical.cpp:2461] Check failed: 'getFramework(frameworkId)' Must be SOME
2020-04-03 08:34:58.007563 +0000 UTC *** Check failure stack trace: ***
2020-04-03 08:34:58.007827 +0000 UTC I0403 08:34:58.003136  2713 
master.cpp:1721] Sending register ACK to: [email protected]:5051
2020-04-03 08:34:58.008064 +0000 UTC I0403 08:34:58.003142  2715 
master.cpp:9963] Adding framework b4fd9630-674e-4dea-b072-c3c48ccfdd42-0000 
(marathon) with roles {  } suppressed
2020-04-03 08:34:58.008305 +0000 UTC I0403 08:34:58.004185  2714 
master.cpp:7635] Ignoring update on agent 
b4fd9630-674e-4dea-b072-c3c48ccfdd42-S38 at slave(1)@172.16.6.89:5051 
(172.16.6.89) as it reports no changes
2020-04-03 08:34:58.008568 +0000 UTC @     0x7fb70eda72ad  
google::LogMessage::Fail()
2020-04-03 08:34:58.010292 +0000 UTC @     0x7fb70eda9508  
google::LogMessage::SendToLog()
2020-04-03 08:34:58.010583 +0000 UTC @     0x7fb70eda6e43  
google::LogMessage::Flush()
2020-04-03 08:34:58.012035 +0000 UTC @     0x7fb70eda9e49  
google::LogMessageFatal::~LogMessageFatal()
2020-04-03 08:34:58.013252 +0000 UTC @     0x7fb70d94748d  _check_not_none<>()
2020-04-03 08:34:58.014963 +0000 UTC @     0x7fb70d940f84  
mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::generateInverseOffers()
2020-04-03 08:34:58.016681 +0000 UTC @     0x7fb70d9414a1  
mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::_generateOffers()
2020-04-03 08:34:58.017498 +0000 UTC @     0x7fb70d94ee32  
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchI7NothingN5mesos8internal6master9allocator8internal28HierarchicalAllocatorProcessEEENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSL_FSI_vEEUlSt10unique_ptrINS1_7PromiseISA_EESt14default_deleteIST_EES3_E_ISW_St12_PlaceholderILi1EEEEEEclEOS3_
2020-04-03 08:34:58.020673 +0000 UTC @     0x7fb70ecf34b1  
process::ProcessBase::consume()
2020-04-03 08:34:58.022404 +0000 UTC @     0x7fb70ed0812b  
process::ProcessManager::resume()
2020-04-03 08:34:58.023133 +0000 UTC @     0x7fb70ed0eb36  
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
2020-04-03 08:34:58.023782 +0000 UTC @     0x7fb70a9772b0  (unknown)
2020-04-03 08:34:58.024105 +0000 UTC @     0x7fb70a195e65  start_thread
2020-04-03 08:34:58.024669 +0000 UTC @     0x7fb709ebe88d  __clone
{noformat}

This immediately follows re-adding an agent after master failover.

The issue was introduced by this patch:
https://reviews.apache.org/r/71428

which didn't account for the fact that `addSlave()` takes as an argument 
per-framework used resources that potentially can contain frameworks that were 
not added to allocator yet: when master re-registers an agent, it first calls 
addSlave() and only then addFramework() for recovered frameworks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to