[ 
https://issues.apache.org/jira/browse/MESOS-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872132#comment-15872132
 ] 

Benjamin Mahler commented on MESOS-7122:
----------------------------------------

{quote}
Future::await is still called in many places in the code base and if all active 
actors block like this
{quote}

Have you done a survey? Where are they?

There is also the loss of our Clock based time control if we were to use sleep, 
that seems undesired. Another thought is that we don't need an explicit thread, 
we can just use an actorless approach but that means that the timer thread is 
executing the non-blocking waitpid calls (not sure how long these can take in 
the worst case, and we generally don't want the timer thread making system 
calls so this may not be an option). What worries me about this approach is 
that we're trying to be careful in using actors in libprocess for deadlock 
prevention reasons, which seems to be a symptom of a bigger issue.

More generally, if Processes are blocking, then we may deadlock regardless of 
the reaper's involvement. For example, the rate limiter. In the case of the 
rate limiter an actorless implementation seems more feasible, but again this 
seems like an optimization rather than something that has anything to do with 
deadlock prevention.

It's also not guaranteed generally that events coming from io::poll are not 
scheduled through a Process in order to unblock the blocked Process.

To me it seems we should focus our attention on (1) not writing any blocking 
code in mesos, which means updating the hdfs client and whichever other 
components are blocking. This would allow us to reduce the number of libprocess 
threads needed by default. (2) Explore general solutions to deadlocking (e.g. 
adding worker threads dynamically as needed, better blocking prevention 
enforcement, making blocking safe, etc.

> Process reaper should have a dedicated thread to avoid deadlock.
> ----------------------------------------------------------------
>
>                 Key: MESOS-7122
>                 URL: https://issues.apache.org/jira/browse/MESOS-7122
>             Project: Mesos
>          Issue Type: Bug
>          Components: libprocess
>            Reporter: James Peach
>
> In a test environment, we saw that libprocess can deadlock when the process 
> reaper is unable to run. 
> This happens in the Mesos HDFS client, which synchronously runs a {{hadoop}} 
> subprocess. If this happens too many times, the {{ReaperProcess}} is never 
> scheduled to reap the subprocess statuses. Since the HDFS {{Future}} never 
> completes, we deadlock with all the threads in the call stack below. If there 
> was a dedicated thread for the {{ReaperProcess}} to run on, or some other way 
> to endure that is is scheduled we could avoid the deadlock.
> {noformat}
> #0  0x00007f67b6ffc68c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x00007f67b6da12fc in 
> std::condition_variable::wait(std::unique_lock<std::mutex>&) () from 
> /usr/lib64/libstdc++.so.6
> #2  0x00007f67b8b864f6 in process::ProcessManager::wait(process::UPID const&) 
> () from /usr/lib64/libmesos-1.2.0.so
> #3  0x00007f67b8b8d347 in process::wait(process::UPID const&, Duration 
> const&) () from /usr/lib64/libmesos-1.2.0.so
> #4  0x00007f67b8b51a85 in process::Latch::await(Duration const&) () from 
> /usr/lib64/libmesos-1.2.0.so
> #5  0x00007f67b834fc9f in process::Future<Bytes>::await(Duration const&) 
> const () from /usr/lib64/libmesos-1.2.0.so
> #6  0x00007f67b833d700 in 
> mesos::internal::slave::fetchSize(std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const&, 
> Option<std::basic_string<char, std::char_traits<char>, std::allocator<char> > 
> > const&) () from /usr/lib64/libmesos-1.2.0.so
> #7  0x00007f67b833df5e in 
> std::result_of<mesos::internal::slave::FetcherProcess::fetch(mesos::ContainerID
>  const&, mesos::CommandInfo const&, std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const&, 
> Option<std::basic_string<char, std::char_traits<char>, std::allocator<char> > 
> > const&, mesos::SlaveID const&, mesos::internal::slave::Flags 
> const&)::{lambda()#2} ()()>::type 
> process::AsyncExecutorProcess::execute<mesos::internal::slave::FetcherProcess::fetch(mesos::ContainerID
>  const&, mesos::CommandInfo const&, std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const&, 
> Option<std::basic_string<char, std::char_traits<char>, std::allocator<char> > 
> > const&, mesos::SlaveID const&, mesos::internal::slave::Flags 
> const&)::{lambda()#2}>(std::result_of const&, 
> boost::disable_if<std::result_of 
> const&::is_void<std::result_of<mesos::internal::slave::FetcherProcess::fetch(mesos::ContainerID
>  const&, mesos::CommandInfo const&, std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const&, 
> Option<std::basic_string<char, std::char_traits<char>, std::allocator<char> > 
> > const&, mesos::SlaveID const&, mesos::internal::slave::Flags 
> const&)::{lambda()#2} ()()> >, void>::type*) () from 
> /usr/lib64/libmesos-1.2.0.so
> #8  0x00007f67b833a3d5 in std::_Function_handler<void 
> ()(process::ProcessBase*), process::Future<Try<Bytes, Error> > 
> process::dispatch<Try<Bytes, Error>, process::AsyncExecutorProcess, 
> mesos::internal::slave::FetcherProcess::fetch(mesos::ContainerID const&, 
> mesos::CommandInfo const&, std::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > const&, Option<std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > > const&, mesos::SlaveID 
> const&, mesos::internal::slave::Flags const&)::{lambda()#2} const&, void*, 
> {lambda()#2}, 
> mesos::internal::slave::FetcherProcess::fetch(mesos::ContainerID const&, 
> mesos::CommandInfo const&, std::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > const&, Option<std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > > const&, mesos::SlaveID 
> const&, mesos::internal::slave::Flags const&)::{lambda()#2} 
> const&>(process::PID<process::AsyncExecutorProcess> const&, process::Future 
> (process::PID::*)(mesos::internal::slave::FetcherProcess::fetch(mesos::ContainerID
>  const&, mesos::CommandInfo const&, std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const&, 
> Option<std::basic_string<char, std::char_traits<char>, std::allocator<char> > 
> > const&, mesos::SlaveID const&, mesos::internal::slave::Flags 
> const&)::{lambda()#2} const&, void*), {lambda()#2}, 
> mesos::internal::slave::FetcherProcess::fetch(mesos::ContainerID const&, 
> mesos::CommandInfo const&, std::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > const&, Option<std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > > const&, mesos::SlaveID 
> const&, mesos::internal::slave::Flags const&)::{lambda()#2} 
> const&)::{lambda(process::ProcessBase*)#1}>::_M_invoke(std::_Any_data const&, 
> process::ProcessBase*) () from /usr/lib64/libmesos-1.2.0.so
> #9  0x00007f67b8b85ede in 
> process::ProcessManager::resume(process::ProcessBase*) () from 
> /usr/lib64/libmesos-1.2.0.so
> #10 0x00007f67b8b8fc8f in 
> std::thread::_Impl<std::_Bind_simple<process::ProcessManager::init_threads()::{unnamed
>  type#1} ()()> >::_M_run() () from /usr/lib64/libmesos-1.2.0.so
> #11 0x00007f67b6da1470 in ?? () from /usr/lib64/libstdc++.so.6
> #12 0x00007f67b6ff8aa1 in start_thread () from /lib64/libpthread.so.0
> #13 0x00007f67b6a3faad in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to