[ 
https://issues.apache.org/jira/browse/MESOS-10209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263011#comment-17263011
 ] 

Ilya commented on MESOS-10209:
------------------------------

Review request: https://reviews.apache.org/r/73131/

> Agent reregistration and marking race
> -------------------------------------
>
>                 Key: MESOS-10209
>                 URL: https://issues.apache.org/jira/browse/MESOS-10209
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.11.0
>            Reporter: Ilya
>            Assignee: Ilya
>            Priority: Major
>
> After master failover if an agent attempts to reregister while it is being 
> marked as unreachable and reregistration finishes before the 
> {{MarkUnreachable}} operation is complete, the assertion that the agent is in 
> the {{recovered}} set in {{Master::\_markUnreachable()}} [1] fails. When 
> readmitting the agent the master removes it from the {{recovered}} set in 
> {{Master::\_\_reregisterSlave()}} [2]. If {{\_\_reregisterSlave()}} is 
> executed before {{\_markUnreachable()}}, it breaks the assertion.
> Example:
> {noformat}
> I1215 02:10:02.657672 498611 master.cpp:2170] Elected as the leading master!
> I1215 02:10:08.415233 498563 master.cpp:1819] Recovered ??? agents from the 
> registry (???B); allowing 10mins for agents to reregister
> I1215 02:20:08.128789 498569 master.cpp:2037] Scheduling removal of agent 
> 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 (meta-slave-test-3-82-50); did 
> not reregister within 10mins after master failover
> I1215 02:20:16.480931 498596 master.cpp:9469] Marking agent 
> 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 (meta-slave-test-3-82-50) 
> unreachable: did not reregister within 10mins after master failover
> I1215 02:20:16.864944 498560 master.cpp:7439] Received reregister agent 
> message from agent 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 at 
> slave(1)@10.1.2.3:31478 (meta-slave-test-3-82-50)
> I1215 02:20:16.865509 498560 master.cpp:7980] Re-registered agent 
> 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 at slave(1)@10.1.2.3:31478 
> (meta-slave-test-3-82-50) with cpus:64; mem:32000; disk:320000; 
> ports:[31000-32000]
> I1215 02:20:16.869235 498553 master.cpp:8370] Received update of agent 
> 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 at slave(1)@10.1.2.3:31478 
> (meta-slave-test-3-82-50) with total oversubscribed resources {}
> I1215 02:20:16.869263 498553 master.cpp:8487] Ignoring update on agent 
> 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 at slave(1)@10.1.2.3:31478 
> (meta-slave-test-3-82-50) as it reports no changes
> I1215 02:20:16.869755 498605 hierarchical.cpp:854] Added agent 
> 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 (meta-slave-test-3-82-50) with 
> cpus:64; mem:32000; disk:320000; ports:[31000-32000] (allocated: {})
> I1215 02:20:22.541494 498591 master.cpp:9512] Marked agent 
> 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 (meta-slave-test-3-82-50) 
> unreachable: did not reregister within 10mins after master failover
> F1215 02:20:22.541508 498591 master.cpp:9523] Check failed: 
> slaves.recovered.contains(slave.id())
> *** Check failure stack trace: ***
>     @     0x7fcda8a90fdd  google::LogMessage::Fail()
>     @     0x7fcda8a93263  google::LogMessage::SendToLog()
>     @     0x7fcda8a90b59  google::LogMessage::Flush()
>     @     0x7fcda8a93c69  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7fcda75d05d8  mesos::internal::master::Master::_markUnreachable()
>     @     0x7fcda75d083d  (unknown)
>     @     0x7fcda72b0f93  
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureIbEEEclINS0_IFSC_vEEEEESC_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseIbEESt14default_deleteISO_EEOSG_S3_E_ISR_SG_St12_PlaceholderILi1EEEEEEclEOS3_
>     @     0x7fcda89f68f1  process::ProcessBase::consume()
>     @     0x7fcda8a0f09b  process::ProcessManager::resume()
>     @     0x7fcda8a15986  (unknown)
>     @     0x7fcda45ce070  (unknown)
>     @     0x7fcda4c33ea5  start_thread
>     @     0x7fcda3d318dd  __clone
> {noformat}
> [1] https://github.com/apache/mesos/blob/1.11.0/src/master/master.cpp#L8698
> [2] https://github.com/apache/mesos/blob/1.11.0/src/master/master.cpp#L7110



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to