Naveen created MESOS-10146:
------------------------------
Summary: Removing task from slave when framework is disconnected
causes master to crash
Key: MESOS-10146
URL: https://issues.apache.org/jira/browse/MESOS-10146
Project: Mesos
Issue Type: Bug
Components: c++ api, framework
Affects Versions: 1.9.0
Environment: Mesos master with three master nodes
Reporter: Naveen
Hello,
we want to report an issue we observed when remove tasks from slave. There
is condition to check for valid framework before tasks can be removed. There
can be several reasons framework can be disconnected. This check fails and
crashes mesos master node.
[https://github.com/apache/mesos/blob/1.9.0/src/master/master.cpp#L11842]
There is also unguarded access to the internal framework state on line 11853.
Error logs -
{noformat}
mesos-master[5483]: I0618 14:05:20.859189 5491 master.cpp:9512] Marked agent
3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 (10.160.73.79) unreachable: health
check timed out
mesos-master[5483]: F0618 14:05:20.859347 5491 master.cpp:11842] Check failed:
framework != nullptr Framework 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-0067 not
found while removing agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 at
slave(1)@10.160.73.79:5051 (10.160.73.79); agent tasks: {
3c26f984-5adb-48f8-a656-3dfba1f9f0c1-0067: { } }
mesos-master[5483]: *** Check failure stack trace: ***
mesos-master[5483]: I0618 14:05:20.859781 5490 hierarchical.cpp:1013] Removed
all filters for agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303
mesos-master[5483]: I0618 14:05:20.872217 5490 hierarchical.cpp:890] Removed
agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303
mesos-master[5483]: I0618 14:05:20.859922 5487 replica.cpp:695] Replica
received learned notice for position 42070 from
log-network(1)@10.160.73.212:5050
mesos-master[5483]: @ 0x7f2fdf6a5b1d google::LogMessage::Fail()
mesos-master[5483]: @ 0x7f2fdf6a7dfd google::LogMessage::SendToLog()
mesos-master[5483]: @ 0x7f2fdf6a56ab google::LogMessage::Flush()
mesos-master[5483]: @ 0x7f2fdf6a8859 google::LogMessageFatal::~LogMessageFatal()
mesos-master[5483]: @ 0x7f2fde2677f2
mesos::internal::master::Master::__removeSlave()
mesos-master[5483]: @ 0x7f2fde267ebe
mesos::internal::master::Master::_markUnreachable()
mesos-master[5483]: @ 0x7f2fde268215
_ZNO6lambda12CallableOnceIFN7process6FutureIbEEvEE10CallableFnINS_8internal7PartialIZN5mesos8internal6master6Master15markUnreachableERKNS9_9SlaveInfoEbRKSsEUlbE_JbEEEEclEv
mesos-master[5483]: @ 0x7f2fddf30688
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureIbEEEclINS0_IFSC_vEEEEESC_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseIbEESt14default_deleteISO_EEOSG_S3_E_ISR_SG_St12_PlaceholderILi1EEEEEEclEOS3_
mesos-master[5483]: @ 0x7f2fdf5e3b91 process::ProcessBase::consume()
mesos-master[5483]: @ 0x7f2fdf608f77 process::ProcessManager::resume()
mesos-master[5483]: @ 0x7f2fdf60cb36
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
mesos-master[5483]: @ 0x7f2fdf8c34d0 execute_native_thread_routine
mesos-master[5483]: @ 0x7f2fdba02ea5 start_thread
mesos-master[5483]: @ 0x7f2fdb20e8dd __clone
systemd[1]: mesos-master.service: main process exited, code=killed,
status=6/ABRT
systemd[1]: Unit mesos-master.service entered failed state.
systemd[1]: mesos-master.service failed.
systemd[1]: mesos-master.service holdoff time over, scheduling restart.
systemd[1]: Stopped Mesos Master.
systemd[1]: Started Mesos Master.
mesos-master[28757]: I0618 14:05:41.461403 28748 logging.cpp:201] INFO level
logging started!
mesos-master[28757]: I0618 14:05:41.461712 28748 main.cpp:243] Build:
2020-05-09 10:42:00 by centos
mesos-master[28757]: I0618 14:05:41.461721 28748 main.cpp:244] Version: 1.9.0
mesos-master[28757]: I0618 14:05:41.461726 28748 main.cpp:247] Git tag: 1.9.0
mesos-master[28757]: I0618 14:05:41.461730 28748 main.cpp:251] Git SHA:
5e79a584e6ec3e9e2f96e8bf418411df9dafac2e{noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)