[jira] [Commented] (MESOS-473) Freezer fails fatally when it is unable to write 'FROZEN' to freezer.state

Chi Zhang (JIRA) Wed, 13 Nov 2013 17:11:31 -0800

    [ 
https://issues.apache.org/jira/browse/MESOS-473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13822053#comment-13822053
 ]


Chi Zhang commented on MESOS-473:
---------------------------------

The destroy code path does try to write "FROZEN" in every attempt, so it could 
be the case that we are not giving it enough time for the processes to finish.

We could potentially set up a listener mechanism to get notified when FREEZING 
turns FROZEN. The question is whether that happenes definitively.

The fact that they go to D while there is no real 'physical IO' here smells 
like a kernel issue to me.

> Freezer fails fatally when it is unable to write 'FROZEN' to freezer.state
> --------------------------------------------------------------------------
>
>                 Key: MESOS-473
>                 URL: https://issues.apache.org/jira/browse/MESOS-473
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.10.0, 0.11.0, 0.12.0, 0.13.0
>            Reporter: Vinod Kone
>            Assignee: Ian Downes
>              Labels: twitter
>             Fix For: 0.16.0
>
>
> Observed this when running tests in a loop. This was 
> SlaveRecoveryTest.RecoverTerminatedExecutor.
> F0517 22:40:00.163806  9004 cgroups_isolator.cpp:1165] Failed to destroy 
> cgroup 
> mesos_test/framework_201305172240-1740121354-46893-8981-0000_executor_59f49d23-9b61-4d08-868c-87af1b06a019_tag_8be5f3f8-e0ce-40d6-83dc-9866a984cbb8:
>  Failed to kill tasks in nested cgroups: Collect failed: Failed to write 
> control 'freezer.state': Device or resource busy
> *** Check failure stack trace: ***
>     @     0x7facb0d080ed  google::LogMessage::Fail()
>     @     0x7facb0d0dd57  google::LogMessage::SendToLog()
>     @     0x7facb0d0999c  google::LogMessage::Flush()
>     @     0x7facb0d09c06  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7facb0a96837  
> mesos::internal::slave::CgroupsIsolator::_killExecutor()
>     @     0x7facb0aaa6b0  std::tr1::_Mem_fn<>::operator()()
>     @     0x7facb0aabdce  std::tr1::_Bind<>::operator()<>()
>     @     0x7facb0aabdfd  std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7facb0ab1043  std::tr1::function<>::operator()()
>     @     0x7facb0ab875e  process::internal::vdispatcher<>()
>     @     0x7facb0ab9b98  std::tr1::_Bind<>::operator()<>()
>     @     0x7facb0ab9bed  std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7facb0c09059  std::tr1::function<>::operator()()
>     @     0x7facb0bcf54d  process::ProcessBase::visit()
>     @     0x7facb0be43ca  process::DispatchEvent::visit()
>     @           0x5fcd90  process::ProcessBase::serve()
>     @     0x7facb0bd8e3d  process::ProcessManager::resume()
>     @     0x7facb0bd9688  process::schedule()
>     @     0x7facafcb473d  start_thread
>     @     0x7facae698f6d  clone
> The process state of tasks in cgroup are either in un-interruptible sleep 
> ('D') or traced ('T'):
> [vinod@smfd-bkq-03-sr4 
> framework_201305172240-1740121354-46893-8981-0000_executor_59f49d23-9b61-4d08-868c-87af1b06a019_tag_8be5f3f8-e0ce-40d6-83dc-9866a984cbb8]$
>  cat tasks | xargs ps -F -p
> UID        PID  PPID  C    SZ   RSS PSR STIME TTY      STAT   TIME CMD
> root     25761     1  0 91854 15648   4 22:39 ?        Dl     0:00 
> /home/vinod/mesos/build/src/.libs/lt-mesos-executor
> root     25802 25761  0 14734   544  13 22:39 ?        Ts     0:00 sleep 1000
> root     25804 25761  0 15961  1296   7 22:39 ?        D      0:00 /bin/bash 
> /home/vinod/mesos/build/../src/scripts/killtree.sh -p 25802 -s 15 -g -x -v
> root     25814 25804  0 15961   224  14 22:39 ?        D      0:00 /bin/bash 
> /home/vinod/mesos/build/../src/scripts/killtree.sh -p 25802 -s 15 -g -x -v
> gdb hangs when trying to attach to the mesos executor, likely because its in 
> 'D' state.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (MESOS-473) Freezer fails fatally when it is unable to write 'FROZEN' to freezer.state

Reply via email to