[
https://issues.apache.org/jira/browse/MESOS-10209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263011#comment-17263011
]
Ilya commented on MESOS-10209:
------------------------------
Review request: https://reviews.apache.org/r/73131/
> Agent reregistration and marking race
> -------------------------------------
>
> Key: MESOS-10209
> URL: https://issues.apache.org/jira/browse/MESOS-10209
> Project: Mesos
> Issue Type: Bug
> Components: master
> Affects Versions: 1.11.0
> Reporter: Ilya
> Assignee: Ilya
> Priority: Major
>
> After master failover if an agent attempts to reregister while it is being
> marked as unreachable and reregistration finishes before the
> {{MarkUnreachable}} operation is complete, the assertion that the agent is in
> the {{recovered}} set in {{Master::\_markUnreachable()}} [1] fails. When
> readmitting the agent the master removes it from the {{recovered}} set in
> {{Master::\_\_reregisterSlave()}} [2]. If {{\_\_reregisterSlave()}} is
> executed before {{\_markUnreachable()}}, it breaks the assertion.
> Example:
> {noformat}
> I1215 02:10:02.657672 498611 master.cpp:2170] Elected as the leading master!
> I1215 02:10:08.415233 498563 master.cpp:1819] Recovered ??? agents from the
> registry (???B); allowing 10mins for agents to reregister
> I1215 02:20:08.128789 498569 master.cpp:2037] Scheduling removal of agent
> 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 (meta-slave-test-3-82-50); did
> not reregister within 10mins after master failover
> I1215 02:20:16.480931 498596 master.cpp:9469] Marking agent
> 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 (meta-slave-test-3-82-50)
> unreachable: did not reregister within 10mins after master failover
> I1215 02:20:16.864944 498560 master.cpp:7439] Received reregister agent
> message from agent 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 at
> slave(1)@10.1.2.3:31478 (meta-slave-test-3-82-50)
> I1215 02:20:16.865509 498560 master.cpp:7980] Re-registered agent
> 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 at slave(1)@10.1.2.3:31478
> (meta-slave-test-3-82-50) with cpus:64; mem:32000; disk:320000;
> ports:[31000-32000]
> I1215 02:20:16.869235 498553 master.cpp:8370] Received update of agent
> 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 at slave(1)@10.1.2.3:31478
> (meta-slave-test-3-82-50) with total oversubscribed resources {}
> I1215 02:20:16.869263 498553 master.cpp:8487] Ignoring update on agent
> 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 at slave(1)@10.1.2.3:31478
> (meta-slave-test-3-82-50) as it reports no changes
> I1215 02:20:16.869755 498605 hierarchical.cpp:854] Added agent
> 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 (meta-slave-test-3-82-50) with
> cpus:64; mem:32000; disk:320000; ports:[31000-32000] (allocated: {})
> I1215 02:20:22.541494 498591 master.cpp:9512] Marked agent
> 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 (meta-slave-test-3-82-50)
> unreachable: did not reregister within 10mins after master failover
> F1215 02:20:22.541508 498591 master.cpp:9523] Check failed:
> slaves.recovered.contains(slave.id())
> *** Check failure stack trace: ***
> @ 0x7fcda8a90fdd google::LogMessage::Fail()
> @ 0x7fcda8a93263 google::LogMessage::SendToLog()
> @ 0x7fcda8a90b59 google::LogMessage::Flush()
> @ 0x7fcda8a93c69 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7fcda75d05d8 mesos::internal::master::Master::_markUnreachable()
> @ 0x7fcda75d083d (unknown)
> @ 0x7fcda72b0f93
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureIbEEEclINS0_IFSC_vEEEEESC_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseIbEESt14default_deleteISO_EEOSG_S3_E_ISR_SG_St12_PlaceholderILi1EEEEEEclEOS3_
> @ 0x7fcda89f68f1 process::ProcessBase::consume()
> @ 0x7fcda8a0f09b process::ProcessManager::resume()
> @ 0x7fcda8a15986 (unknown)
> @ 0x7fcda45ce070 (unknown)
> @ 0x7fcda4c33ea5 start_thread
> @ 0x7fcda3d318dd __clone
> {noformat}
> [1] https://github.com/apache/mesos/blob/1.11.0/src/master/master.cpp#L8698
> [2] https://github.com/apache/mesos/blob/1.11.0/src/master/master.cpp#L7110
--
This message was sent by Atlassian Jira
(v8.3.4#803005)