[ https://issues.apache.org/jira/browse/MESOS-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16699724#comment-16699724 ]
Benjamin Mahler commented on MESOS-8623: ---------------------------------------- Looks like we really dropped the ball on this one, linking in MESOS-9419 and upgrading to blocker. > Crashed framework brings down the whole Mesos cluster > ----------------------------------------------------- > > Key: MESOS-8623 > URL: https://issues.apache.org/jira/browse/MESOS-8623 > Project: Mesos > Issue Type: Bug > Components: master > Affects Versions: 1.4.1 > Environment: Debian 8 > Mesos 1.4.1 > Reporter: Tomas Barton > Priority: Critical > > It might be hard to replicate, but when you do your Mesos cluster is gone. > The issue was caused by an unresponsive Docker engine on a single agent node. > Unfortunately even after fixing Docker issues, all Mesos masters repeatedly > failed to start. In despair I've deleted all {{replicated_log}} data from > Master and ZooKeeper. Even after that messages agent's {{replicated_log}} got > replayed and the master crashed again. Average lifetime for Mesos master was > less than 1min. > {code} > mesos-master[3814]: I0228 00:25:55.269835 3828 network.hpp:436] ZooKeeper > group memberships changed > mesos-master[3814]: I0228 00:25:55.269979 3832 group.cpp:700] Trying to get > '/mesos/log_replicas/0000002519' in ZooKeeper > mesos-master[3814]: I0228 00:25:55.271117 3832 group.cpp:700] Trying to get > '/mesos/log_replicas/0000002520' in ZooKeeper > mesos-master[3814]: I0228 00:25:55.277971 3832 group.cpp:700] Trying to get > '/mesos/log_replicas/0000002521' in ZooKeeper > mesos-master[3814]: I0228 00:25:55.279296 3827 network.hpp:484] ZooKeeper > group PIDs: { log-replica(1) > mesos-master[3814]: W0228 00:26:15.261255 3831 master.hpp:2372] Master > attempted to send message to disconnected framework > 911c4b47-2ba7-4959-b59e-c48d896fe210-0005 (kafka) > mesos-master[3814]: F0228 00:26:15.261318 3831 master.hpp:2382] > CHECK_SOME(pid): is NONE > mesos-master[3814]: *** Check failure stack trace: *** > mesos-master[3814]: @ 0x7f7187ca073d google::LogMessage::Fail() > mesos-master[3814]: @ 0x7f7187ca23bd google::LogMessage::SendToLog() > mesos-master[3814]: @ 0x7f7187ca0302 google::LogMessage::Flush() > mesos-master[3814]: @ 0x7f7187ca2da9 > google::LogMessageFatal::~LogMessageFatal() > mesos-master[3814]: @ 0x7f7186d6d769 _CheckFatal::~_CheckFatal() > mesos-master[3814]: @ 0x7f71870465d5 > mesos::internal::master::Framework::send<>() > mesos-master[3814]: @ 0x7f7186fcfe8a > mesos::internal::master::Master::executorMessage() > mesos-master[3814]: @ 0x7f718706b1a1 ProtobufProcess<>::handler4<>() > mesos-master[3814]: @ 0x7f7187008e36 > std::_Function_handler<>::_M_invoke() > mesos-master[3814]: @ 0x7f71870293d1 ProtobufProcess<>::visit() > mesos-master[3814]: @ 0x7f7186fb7ee4 > mesos::internal::master::Master::_visit() > mesos-master[3814]: @ 0x7f7186fd0d5d > mesos::internal::master::Master::visit() > mesos-master[3814]: @ 0x7f7187c02e22 process::ProcessManager::resume() > mesos-master[3814]: @ 0x7f7187c08d46 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vE > mesos-master[3814]: @ 0x7f7185babca0 (unknown) > mesos-master[3814]: @ 0x7f71853c6064 start_thread > mesos-master[3814]: @ 0x7f71850fb62d (unknown) > systemd[1]: mesos-master.service: main process exited, code=killed, > status=6/ABRT > systemd[1]: Unit mesos-master.service entered failed state. > systemd[1]: mesos-master.service holdoff time over, scheduling restart. > systemd[1]: Stopping Mesos Master... > systemd[1]: Starting Mesos Master... > systemd[1]: Started Mesos Master. > mesos-master[27840]: WARNING: Logging before InitGoogleLogging() is written > to STDERR > mesos-master[27840]: I0228 01:32:38.294122 27829 main.cpp:232] Build: > 2017-11-18 02:15:41 by admin > mesos-master[27840]: I0228 01:32:38.294168 27829 main.cpp:233] Version: 1.4.1 > mesos-master[27840]: I0228 01:32:38.294178 27829 main.cpp:236] Git tag: 1.4.1 > mesos-master[27840]: I0228 01:32:38.294186 27829 main.cpp:240] Git SHA: > c844db9ac7c0cef59be87438c6781bfb71adcc42 > mesos-master[27840]: I0228 01:32:38.296067 27829 main.cpp:340] Using > 'HierarchicalDRF' allocator > mesos-master[27840]: I0228 01:32:38.411576 27829 replica.cpp:779] Replica > recovered with log positions 13 -> 14 with 0 holes and 0 unlearned > mesos-master[27840]: 2018-02-28 > 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@726: Client > environment:zookeeper.version=zookeeper C client 3.4.8 > mesos-master[27840]: 2018-02-28 > 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@730: Client > environment:host.name=svc01 > mesos-master[27840]: 2018-02-28 > 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@737: Client > environment:os.name=Linux > mesos-master[27840]: I0228 01:32:38.412711 27841 log.cpp:107] Attempting to > join replica to ZooKeeper group > mesos-master[27840]: 2018-02-28 > 01:32:38,412:27829(0x7fab4775b700):ZOO_INFO@log_env@726: Client > environment:zookeeper.version=zookeeper C client 3.4.8 > mesos-master[27840]: 2018-02-28 > 01:32:38,412:27829(0x7fab4775b700):ZOO_INFO@log_env@730: Client > environment:host.name=svc01 > mesos-master[27840]: I0228 01:32:38.412932 27841 recover.cpp:451] Starting > replica recovery > mesos-master[27840]: I0228 01:32:38.413024 27846 recover.cpp:477] Replica is > in VOTING status > mesos-master[27840]: 2018-02-28 > 01:32:38,412:27829(0x7fab4775b700):ZOO_INFO@log_env@737: Client > environment:os.name=Linux > mesos-master[27840]: 2018-02-28 > 01:32:38,413:27829(0x7fab4775b700):ZOO_INFO@log_env@738: Client > environment:os.arch=3.16.0-5-amd64 > mesos-master[27840]: 2018-02-28 > 01:32:38,413:27829(0x7fab4775b700):ZOO_INFO@log_env@739: Client > environment:os.version=#1 SMP Debian 3.16.51-3+deb8u1 (2018-01-08) > mesos-master[27840]: 2018-02-28 > 01:32:38,413:27829(0x7fab47f5c700):ZOO_INFO@log_env@726: Client > environment:zookeeper.version=zookeeper C client 3.4.8 > {code} > on slave Mesos is trying many times to kill all instances of kafka framework: > {code} > I0227 23:14:57.993875 26049 slave.cpp:2931] Asked to kill task > service_mesos-kafka_kafka.3bb2ccf6-1c13-11e8-a38c-fef9d3423c7f of framework > ecd3a4be-d34 > W0227 23:14:57.993914 26049 slave.cpp:3099] Ignoring kill task > service_mesos-kafka_kafka.3bb2ccf6-1c13-11e8-a38c-fef9d3423c7f because the > executor 'ser > I0227 23:15:02.993985 26048 slave.cpp:2931] Asked to kill task > service_mesos-kafka_kafka.3bb2ccf6-1c13-11e8-a38c-fef9d3423c7f of framework > ecd3a4be-d34 > W0227 23:15:02.994027 26048 slave.cpp:3099] Ignoring kill task > service_mesos-kafka_kafka.3bb2ccf6-1c13-11e8-a38c-fef9d3423c7f because the > executor 'ser > I0227 23:15:07.992681 26041 slave.cpp:2931] Asked to kill task > service_mesos-kafka_kafka.3bb2ccf6-1c13-11e8-a38c-fef9d3423c7f of framework > ecd3a4be-d34 > W0227 23:15:07.992720 26041 slave.cpp:3099] Ignoring kill task > service_mesos-kafka_kafka.3bb2ccf6-1c13-11e8-a38c-fef9d3423c7f because the > executor 'ser > I0227 23:15:12.992703 26039 slave.cpp:2931] Asked to kill task > service_mesos-kafka_kafka.3bb2ccf6-1c13-11e8-a38c-fef9d3423c7f of framework > ecd3a4be-d34 > {code} > the Docker daemon stopped to respond, thus any tasks were failing: > {code} > E0227 23:06:20.440865 26044 slave.cpp:5292] Container > '10763986-e133-4483-b34e-cfe10903a46f' for executor > 'api_rapi-a.d25fc1e0-1c12-11e8-a38c-fef9d3423 > *** Aborted at 1519772780 (unix time) try "date -d @1519772780" if you are > using GNU date *** > PC: @ 0x7f1326961067 (unknown) > PC: @ 0x7f1326961067 (unknown) > *** SIGABRT (@0x47c0) received by PID 18368 (TID 0x7f131c27b700) from PID > 18368; stack trace: *** > @ 0x7f1326ce6890 (unknown) > @ 0x7f1326961067 (unknown) > @ 0x7f1326962448 (unknown) > @ 0x560218e2f740 (unknown) > @ 0x560218e2f77c (unknown) > @ 0x7f1329261ff9 (unknown) > @ 0x7f1329261c81 (unknown) > @ 0x7f1329261c41 (unknown) > @ 0x7f1329263ecf (unknown) > @ 0x7f1329261561 (unknown) > @ 0x7f1328422509 (unknown) > @ 0x7f13284253ff (unknown) > @ 0x7f1328433134 (unknown) > @ 0x7f132844ab26 (unknown) > @ 0x7f1328398a47 (unknown) > @ 0x7f132839a9ed (unknown) > @ 0x7f1329260bcc (unknown) > @ 0x7f1328398a47 (unknown) > @ 0x7f132839a9ed (unknown) > @ 0x7f13289ec817 (unknown) > @ 0x7f13289ec91f (unknown) > @ 0x7f1329259818 (unknown) > @ 0x7f1329259ac3 (unknown) > @ 0x7f1329218e29 (unknown) > @ 0x7f132921ed16 (unknown) > @ 0x7f13271c3ca0 (unknown) > @ 0x7f1326cdf064 start_thread > @ 0x7f1326a1462d (unknown) > ' > I0227 23:06:20.441429 26043 slave.cpp:5405] Executor > 'api_rapi-a.d25fc1e0-1c12-11e8-a38c-fef9d3423c7f' of framework > ecd3a4be-d34c-46f3-b358-c4e26ac0d13 > I0227 23:06:20.441520 26043 slave.cpp:4399] Handling status update > TASK_FAILED (UUID: 003e072e-341b-48ce-abb5-79849128ba6f) for task > api_rapi-a.d25fc1e > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)