[
https://issues.apache.org/jira/browse/MESOS-10014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16957497#comment-16957497
]
Meng Zhu commented on MESOS-10014:
----------------------------------
Hmm, the following log message looks problematic:
{noformat}
I1018 09:05:14.228754 21394 hierarchical.cpp:955] Added agent
e6284079-cb6a-4a47-8f9a-ea9b84ff622a-S0 (ip-172-16-10-17.ec2.internal) with
cpus:2; mem:1024; disk:1024; ports:[31000-32000] (offered or allocated: {})
I1018 09:05:14.229159 21394 hierarchical.cpp:1100] Grew agent
e6284079-cb6a-4a47-8f9a-ea9b84ff622a-S0 by disk[RAW(,,profile)]:200 (total), {
} (used)
I1018 09:05:14.229632 21394 hierarchical.cpp:1057] Agent
e6284079-cb6a-4a47-8f9a-ea9b84ff622a-S0 (ip-172-16-10-17.ec2.internal) updated
with total resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]
I1018 09:05:14.230063 21394 hierarchical.cpp:1843] Performed allocation for 1
agents in 128843ns
I1018 09:05:14.230569 21391 master.cpp:10926] Recovered orphan operation
71647a26-b5fe-4b97-9162-0abb2785b909 (ID: operation) on agent
e6284079-cb6a-4a47-8f9a-ea9b84ff622a-S0 belonging to framework
e6284079-cb6a-4a47-8f9a-ea9b84ff622a-0000 in state OPERATION_PENDING
I1018 09:05:14.230813 21391 master.cpp:10824] Adding framework
e6284079-cb6a-4a47-8f9a-ea9b84ff622a-0000 (default) with roles { } suppressed
I1018 09:05:14.230991 21391 master.cpp:8295] Updating framework
e6284079-cb6a-4a47-8f9a-ea9b84ff622a-0000 (default) with roles { } suppressed
I1018 09:05:14.231298 21390 hierarchical.cpp:1100] Grew agent
e6284079-cb6a-4a47-8f9a-ea9b84ff622a-S0 by disk[RAW(,,profile)]:200 (total), {
e6284079-cb6a-4a47-8f9a-ea9b84ff622a-0000: disk(allocated:
default-role)[RAW(,,profile)]:200 } (used)
{noformat}
This happens after the master failover. In particular, there are two `Grew
agent ...` indicating two resource providers (each with 200 disk) are added.
And the latter one contains *used* 200 disk. This is probably the same 200 disk
resource printed out above by [~bmahler]
I suspect this relates to orphan operations cc/[~greggomann]
> `tryUntrackFrameworkUnderRole` check failed in
> `HierarchicalAllocatorProcess::removeFramework`.
> -----------------------------------------------------------------------------------------------
>
> Key: MESOS-10014
> URL: https://issues.apache.org/jira/browse/MESOS-10014
> Project: Mesos
> Issue Type: Bug
> Components: master, test
> Affects Versions: 1.10
> Reporter: Andrei Budnik
> Priority: Major
> Labels: flaky-test, resource-management
> Attachments: AgentPendingOperationAfterMasterFailover-badrun.txt
>
>
> `ContentType/OperationReconciliationTest.AgentPendingOperationAfterMasterFailover/0`
> test failed:
> {code:java}
> F1018 09:05:14.310616 21391 hierarchical.cpp:745] Check failed:
> tryUntrackFrameworkUnderRole(framework, role) Framework:
> e6284079-cb6a-4a47-8f9a-ea9b84ff622a-0000 role: default-role
> *** Check failure stack trace: ***
> @ 0x7f40fff0a1f6 google::LogMessage::Fail()
> @ 0x7f40fff0a14f google::LogMessage::SendToLog()
> @ 0x7f40fff09a91 google::LogMessage::Flush()
> @ 0x7f40fff0d12f google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f410fd828ac
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeFramework()
> @ 0x186b29f
> _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_11FrameworkIDES8_EEvRKNS_3PIDIT_EEMSA_FvT0_EOT1_ENKUlOS6_PNS_11ProcessBaseEE_clESJ_SL_
> @ 0x189c273
> _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS3_11FrameworkIDESA_EEvRKNS1_3PIDIT_EEMSC_FvT0_EOT1_EUlOS8_PNS1_11ProcessBaseEE_JS8_SN_EEEDTclcl7forwardISC_Efp_Espcl7forwardIT0_Efp0_EEEOSC_DpOSP_
> @ 0x18990b7
> _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS4_11FrameworkIDESB_EEvRKNS2_3PIDIT_EEMSD_FvT0_EOT1_EUlOS9_PNS2_11ProcessBaseEE_JS9_St12_PlaceholderILi1EEEE13invoke_expandISP_St5tupleIJS9_SR_EESU_IJOSO_EEJLm0ELm1EEEEDTcl6invokecl7forwardISD_Efp_Espcl6expandcl3getIXT2_EEcl7forwardISH_Efp0_EEcl7forwardISK_Efp2_EEEEOSD_OSH_N5cpp1416integer_sequenceImJXspT2_EEEESL_
> @ 0x1896100
> _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS4_11FrameworkIDESB_EEvRKNS2_3PIDIT_EEMSD_FvT0_EOT1_EUlOS9_PNS2_11ProcessBaseEE_IS9_St12_PlaceholderILi1EEEEclIISO_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImILm0ELm1EEEE_Ecl16forward_as_tuplespcl7forwardIT_Efp_EEEEDpOSX_
> @ 0x1895174
> _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS6_11FrameworkIDESD_EEvRKNS4_3PIDIT_EEMSF_FvT0_EOT1_EUlOSB_PNS4_11ProcessBaseEE_ISB_St12_PlaceholderILi1EEEEEISQ_EEEDTclcl7forwardISF_Efp_Espcl7forwardIT0_Efp0_EEEOSF_DpOSV_
> @ 0x1894b2b
> _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS7_11FrameworkIDESE_EEvRKNS5_3PIDIT_EEMSG_FvT0_EOT1_EUlOSC_PNS5_11ProcessBaseEE_JSC_St12_PlaceholderILi1EEEEEJSR_EEEvOSG_DpOT0_
> @ 0x18943bc
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNSA_11FrameworkIDESH_EEvRKNS1_3PIDIT_EEMSJ_FvT0_EOT1_EUlOSF_S3_E_ISF_St12_PlaceholderILi1EEEEEEclEOS3_
> @ 0x7f41016deb22
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
> @ 0x7f410169620c process::ProcessBase::consume()
> @ 0x7f41016c0696
> _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE
> @ 0x1822baa process::ProcessBase::serve()
> @ 0x7f4101692af1 process::ProcessManager::resume()
> @ 0x7f410168ed68
> _ZZN7process14ProcessManager12init_threadsEvENKUlvE_clEv
> @ 0x7f41016b81e2
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
> @ 0x7f41016b7244
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEclEv
> @ 0x7f41016b6088
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> @ 0x7f40fca44590 execute_native_thread_routine
> @ 0x7f40ffa77e25 start_thread
> @ 0x7f40fa396bad __clone
> @ (nil) (unknown)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)