[ https://issues.apache.org/jira/browse/MESOS-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neil Conway reassigned MESOS-4842: ---------------------------------- Assignee: Neil Conway > Sudden framework crash may bring down mesos master > -------------------------------------------------- > > Key: MESOS-4842 > URL: https://issues.apache.org/jira/browse/MESOS-4842 > Project: Mesos > Issue Type: Bug > Components: framework, master > Affects Versions: 0.27.1 > Reporter: Guillermo Rodriguez > Assignee: Neil Conway > > Using: > swarm 1.1.3-rc1 > CoreOS 899.9 > Mesos 0.27.1 > Marathon 0.15.3 > When swarm is stopped/restarted it may crash the mesos-master. It doesn't > happens always but frequently enough to be a pain. > If for some reason the swarm service fails, marathon will try and restart the > service. When this happens mesos may crash. I would say it happens around 50% > of the times. > This looks like a swarm/mesos problem so I will report the same error to both > lists. > This is the final lines of mesos logs: > {code} > 0303 04:32:45.327628 8 master.cpp:5202] Framework failover timeout, removing > framework b4149972-942d-46bf-b886-644cd3d0c6f0-0004 (swarm) at > scheduler(1)@172.31.39.68:3375 > I0303 04:32:45.327651 8 master.cpp:5933] Removing framework > b4149972-942d-46bf-b886-644cd3d0c6f0-0004 (swarm) at > scheduler(1)@172.31.39.68:3375 > I0303 04:32:45.327847 8 master.cpp:6445] Updating the state of task > trth_download-SES.1fba45934ce4 of framework > b4149972-942d-46bf-b886-644cd3d0c6f0-0004 (latest state: TASK_FAILED, status > update state: TASK_KILLED) > I0303 04:32:45.327879 8 master.cpp:6511] Removing task > trth_download-SES.1fba45934ce4 with resources cpus(*):0.3; mem(*):450 of > framework b4149972-942d-46bf-b886-644cd3d0c6f0-0004 on slave > 2ce3e7f3-1712-4a4b-8338-04077c371a67-S7 at slave(1)@172.31.33.80:5051 > (172.31.33.80) > F0303 04:32:45.328032 8 sorter.cpp:251] Check failed: > total_.resources.contains(slaveId) > *** Check failure stack trace: *** > E0303 04:32:45.328198 13 process.cpp:1966] Failed to shutdown socket with fd > 53: Transport endpoint is not connected > @ 0x7fc22173893d google::LogMessage::Fail() > @ 0x7fc22173a76d google::LogMessage::SendToLog() > @ 0x7fc22173852c google::LogMessage::Flush() > @ 0x7fc22173b069 google::LogMessageFatal::~LogMessageFatal() > @ 0x7fc22105773b mesos::internal::master::allocator::DRFSorter::remove() > @ 0x7fc22104623e > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeFramework() > @ 0x7fc2216e5681 process::ProcessManager::resume() > @ 0x7fc2216e5987 > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7fc22022ea60 (unknown) > @ 0x7fc21fa4b182 start_thread > @ 0x7fc21f77847d (unknown) > *** Aborted at 1456979565 (unix time) try "date -d @1456979565" if you are > using GNU date *** > PC: @ 0x7fc21f6b8227 (unknown) > *** SIGSEGV (@0x0) received by PID 1 (TID 0x7fc2181a2700) from PID 0; stack > trace: *** > @ 0x7fc21fa53340 (unknown) > @ 0x7fc21f6b8227 (unknown) > @ 0x7fc221740be9 google::DumpStackTraceAndExit() > @ 0x7fc22173893d google::LogMessage::Fail() > @ 0x7fc22173a76d google::LogMessage::SendToLog() > @ 0x7fc22173852c google::LogMessage::Flush() > @ 0x7fc22173b069 google::LogMessageFatal::~LogMessageFatal() > @ 0x7fc22105773b mesos::internal::master::allocator::DRFSorter::remove() > @ 0x7fc22104623e > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeFramework() > @ 0x7fc2216e5681 process::ProcessManager::resume() > @ 0x7fc2216e5987 > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7fc22022ea60 (unknown) > @ 0x7fc21fa4b182 start_thread > @ 0x7fc21f77847d (unknown) > {code} > If you ask me, looks like swarm is terminated and connections are lost, but > just at that moment there is a tasks that was running on swarm that was > finishing. Then mesos tries to inform the already deceased framework that the > task is finished and the resources need to be recovered but the framework is > no longer there... so it crashes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)