[ 
https://issues.apache.org/jira/browse/MESOS-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway reassigned MESOS-4842:
----------------------------------

    Assignee: Neil Conway

> Sudden framework crash may bring down mesos master
> --------------------------------------------------
>
>                 Key: MESOS-4842
>                 URL: https://issues.apache.org/jira/browse/MESOS-4842
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework, master
>    Affects Versions: 0.27.1
>            Reporter: Guillermo Rodriguez
>            Assignee: Neil Conway
>
> Using:
> swarm 1.1.3-rc1
> CoreOS 899.9
> Mesos 0.27.1
> Marathon 0.15.3
> When swarm is stopped/restarted it may crash the mesos-master. It doesn't 
> happens always but frequently enough to be a pain.
> If for some reason the swarm service fails, marathon will try and restart the 
> service. When this happens mesos may crash. I would say it happens around 50% 
> of the times.
> This looks like a swarm/mesos problem so I will report the same error to both 
> lists.
> This is the final lines of mesos logs:
> {code}
> 0303 04:32:45.327628 8 master.cpp:5202] Framework failover timeout, removing 
> framework b4149972-942d-46bf-b886-644cd3d0c6f0-0004 (swarm) at 
> scheduler(1)@172.31.39.68:3375
> I0303 04:32:45.327651 8 master.cpp:5933] Removing framework 
> b4149972-942d-46bf-b886-644cd3d0c6f0-0004 (swarm) at 
> scheduler(1)@172.31.39.68:3375
> I0303 04:32:45.327847 8 master.cpp:6445] Updating the state of task 
> trth_download-SES.1fba45934ce4 of framework 
> b4149972-942d-46bf-b886-644cd3d0c6f0-0004 (latest state: TASK_FAILED, status 
> update state: TASK_KILLED)
> I0303 04:32:45.327879 8 master.cpp:6511] Removing task 
> trth_download-SES.1fba45934ce4 with resources cpus(*):0.3; mem(*):450 of 
> framework b4149972-942d-46bf-b886-644cd3d0c6f0-0004 on slave 
> 2ce3e7f3-1712-4a4b-8338-04077c371a67-S7 at slave(1)@172.31.33.80:5051 
> (172.31.33.80)
> F0303 04:32:45.328032 8 sorter.cpp:251] Check failed: 
> total_.resources.contains(slaveId)
> *** Check failure stack trace: ***
> E0303 04:32:45.328198 13 process.cpp:1966] Failed to shutdown socket with fd 
> 53: Transport endpoint is not connected
>  @ 0x7fc22173893d google::LogMessage::Fail()
>  @ 0x7fc22173a76d google::LogMessage::SendToLog()
>  @ 0x7fc22173852c google::LogMessage::Flush()
>  @ 0x7fc22173b069 google::LogMessageFatal::~LogMessageFatal()
>  @ 0x7fc22105773b mesos::internal::master::allocator::DRFSorter::remove()
>  @ 0x7fc22104623e 
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeFramework()
>  @ 0x7fc2216e5681 process::ProcessManager::resume()
>  @ 0x7fc2216e5987 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
>  @ 0x7fc22022ea60 (unknown)
>  @ 0x7fc21fa4b182 start_thread
>  @ 0x7fc21f77847d (unknown)
> *** Aborted at 1456979565 (unix time) try "date -d @1456979565" if you are 
> using GNU date ***
> PC: @ 0x7fc21f6b8227 (unknown)
> *** SIGSEGV (@0x0) received by PID 1 (TID 0x7fc2181a2700) from PID 0; stack 
> trace: ***
>  @ 0x7fc21fa53340 (unknown)
>  @ 0x7fc21f6b8227 (unknown)
>  @ 0x7fc221740be9 google::DumpStackTraceAndExit()
>  @ 0x7fc22173893d google::LogMessage::Fail()
>  @ 0x7fc22173a76d google::LogMessage::SendToLog()
>  @ 0x7fc22173852c google::LogMessage::Flush()
>  @ 0x7fc22173b069 google::LogMessageFatal::~LogMessageFatal()
>  @ 0x7fc22105773b mesos::internal::master::allocator::DRFSorter::remove()
>  @ 0x7fc22104623e 
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeFramework()
>  @ 0x7fc2216e5681 process::ProcessManager::resume()
>  @ 0x7fc2216e5987 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
>  @ 0x7fc22022ea60 (unknown)
>  @ 0x7fc21fa4b182 start_thread
>  @ 0x7fc21f77847d (unknown)
> {code}
> If you ask me, looks like swarm is terminated and connections are lost, but 
> just at that moment there is a tasks that was running on swarm that was 
> finishing. Then mesos tries to inform the already deceased framework that the 
> task is finished and the resources need to be recovered but the framework is 
> no longer there... so it crashes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to