[
https://issues.apache.org/jira/browse/MESOS-473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13822053#comment-13822053
]
Chi Zhang commented on MESOS-473:
---------------------------------
The destroy code path does try to write "FROZEN" in every attempt, so it could
be the case that we are not giving it enough time for the processes to finish.
We could potentially set up a listener mechanism to get notified when FREEZING
turns FROZEN. The question is whether that happenes definitively.
The fact that they go to D while there is no real 'physical IO' here smells
like a kernel issue to me.
> Freezer fails fatally when it is unable to write 'FROZEN' to freezer.state
> --------------------------------------------------------------------------
>
> Key: MESOS-473
> URL: https://issues.apache.org/jira/browse/MESOS-473
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 0.10.0, 0.11.0, 0.12.0, 0.13.0
> Reporter: Vinod Kone
> Assignee: Ian Downes
> Labels: twitter
> Fix For: 0.16.0
>
>
> Observed this when running tests in a loop. This was
> SlaveRecoveryTest.RecoverTerminatedExecutor.
> F0517 22:40:00.163806 9004 cgroups_isolator.cpp:1165] Failed to destroy
> cgroup
> mesos_test/framework_201305172240-1740121354-46893-8981-0000_executor_59f49d23-9b61-4d08-868c-87af1b06a019_tag_8be5f3f8-e0ce-40d6-83dc-9866a984cbb8:
> Failed to kill tasks in nested cgroups: Collect failed: Failed to write
> control 'freezer.state': Device or resource busy
> *** Check failure stack trace: ***
> @ 0x7facb0d080ed google::LogMessage::Fail()
> @ 0x7facb0d0dd57 google::LogMessage::SendToLog()
> @ 0x7facb0d0999c google::LogMessage::Flush()
> @ 0x7facb0d09c06 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7facb0a96837
> mesos::internal::slave::CgroupsIsolator::_killExecutor()
> @ 0x7facb0aaa6b0 std::tr1::_Mem_fn<>::operator()()
> @ 0x7facb0aabdce std::tr1::_Bind<>::operator()<>()
> @ 0x7facb0aabdfd std::tr1::_Function_handler<>::_M_invoke()
> @ 0x7facb0ab1043 std::tr1::function<>::operator()()
> @ 0x7facb0ab875e process::internal::vdispatcher<>()
> @ 0x7facb0ab9b98 std::tr1::_Bind<>::operator()<>()
> @ 0x7facb0ab9bed std::tr1::_Function_handler<>::_M_invoke()
> @ 0x7facb0c09059 std::tr1::function<>::operator()()
> @ 0x7facb0bcf54d process::ProcessBase::visit()
> @ 0x7facb0be43ca process::DispatchEvent::visit()
> @ 0x5fcd90 process::ProcessBase::serve()
> @ 0x7facb0bd8e3d process::ProcessManager::resume()
> @ 0x7facb0bd9688 process::schedule()
> @ 0x7facafcb473d start_thread
> @ 0x7facae698f6d clone
> The process state of tasks in cgroup are either in un-interruptible sleep
> ('D') or traced ('T'):
> [vinod@smfd-bkq-03-sr4
> framework_201305172240-1740121354-46893-8981-0000_executor_59f49d23-9b61-4d08-868c-87af1b06a019_tag_8be5f3f8-e0ce-40d6-83dc-9866a984cbb8]$
> cat tasks | xargs ps -F -p
> UID PID PPID C SZ RSS PSR STIME TTY STAT TIME CMD
> root 25761 1 0 91854 15648 4 22:39 ? Dl 0:00
> /home/vinod/mesos/build/src/.libs/lt-mesos-executor
> root 25802 25761 0 14734 544 13 22:39 ? Ts 0:00 sleep 1000
> root 25804 25761 0 15961 1296 7 22:39 ? D 0:00 /bin/bash
> /home/vinod/mesos/build/../src/scripts/killtree.sh -p 25802 -s 15 -g -x -v
> root 25814 25804 0 15961 224 14 22:39 ? D 0:00 /bin/bash
> /home/vinod/mesos/build/../src/scripts/killtree.sh -p 25802 -s 15 -g -x -v
> gdb hangs when trying to attach to the mesos executor, likely because its in
> 'D' state.
--
This message was sent by Atlassian JIRA
(v6.1#6144)