[ https://issues.apache.org/jira/browse/MESOS-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872132#comment-15872132 ]
Benjamin Mahler commented on MESOS-7122: ---------------------------------------- {quote} Future::await is still called in many places in the code base and if all active actors block like this {quote} Have you done a survey? Where are they? There is also the loss of our Clock based time control if we were to use sleep, that seems undesired. Another thought is that we don't need an explicit thread, we can just use an actorless approach but that means that the timer thread is executing the non-blocking waitpid calls (not sure how long these can take in the worst case, and we generally don't want the timer thread making system calls so this may not be an option). What worries me about this approach is that we're trying to be careful in using actors in libprocess for deadlock prevention reasons, which seems to be a symptom of a bigger issue. More generally, if Processes are blocking, then we may deadlock regardless of the reaper's involvement. For example, the rate limiter. In the case of the rate limiter an actorless implementation seems more feasible, but again this seems like an optimization rather than something that has anything to do with deadlock prevention. It's also not guaranteed generally that events coming from io::poll are not scheduled through a Process in order to unblock the blocked Process. To me it seems we should focus our attention on (1) not writing any blocking code in mesos, which means updating the hdfs client and whichever other components are blocking. This would allow us to reduce the number of libprocess threads needed by default. (2) Explore general solutions to deadlocking (e.g. adding worker threads dynamically as needed, better blocking prevention enforcement, making blocking safe, etc. > Process reaper should have a dedicated thread to avoid deadlock. > ---------------------------------------------------------------- > > Key: MESOS-7122 > URL: https://issues.apache.org/jira/browse/MESOS-7122 > Project: Mesos > Issue Type: Bug > Components: libprocess > Reporter: James Peach > > In a test environment, we saw that libprocess can deadlock when the process > reaper is unable to run. > This happens in the Mesos HDFS client, which synchronously runs a {{hadoop}} > subprocess. If this happens too many times, the {{ReaperProcess}} is never > scheduled to reap the subprocess statuses. Since the HDFS {{Future}} never > completes, we deadlock with all the threads in the call stack below. If there > was a dedicated thread for the {{ReaperProcess}} to run on, or some other way > to endure that is is scheduled we could avoid the deadlock. > {noformat} > #0 0x00007f67b6ffc68c in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x00007f67b6da12fc in > std::condition_variable::wait(std::unique_lock<std::mutex>&) () from > /usr/lib64/libstdc++.so.6 > #2 0x00007f67b8b864f6 in process::ProcessManager::wait(process::UPID const&) > () from /usr/lib64/libmesos-1.2.0.so > #3 0x00007f67b8b8d347 in process::wait(process::UPID const&, Duration > const&) () from /usr/lib64/libmesos-1.2.0.so > #4 0x00007f67b8b51a85 in process::Latch::await(Duration const&) () from > /usr/lib64/libmesos-1.2.0.so > #5 0x00007f67b834fc9f in process::Future<Bytes>::await(Duration const&) > const () from /usr/lib64/libmesos-1.2.0.so > #6 0x00007f67b833d700 in > mesos::internal::slave::fetchSize(std::basic_string<char, > std::char_traits<char>, std::allocator<char> > const&, > Option<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) () from /usr/lib64/libmesos-1.2.0.so > #7 0x00007f67b833df5e in > std::result_of<mesos::internal::slave::FetcherProcess::fetch(mesos::ContainerID > const&, mesos::CommandInfo const&, std::basic_string<char, > std::char_traits<char>, std::allocator<char> > const&, > Option<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, mesos::SlaveID const&, mesos::internal::slave::Flags > const&)::{lambda()#2} ()()>::type > process::AsyncExecutorProcess::execute<mesos::internal::slave::FetcherProcess::fetch(mesos::ContainerID > const&, mesos::CommandInfo const&, std::basic_string<char, > std::char_traits<char>, std::allocator<char> > const&, > Option<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, mesos::SlaveID const&, mesos::internal::slave::Flags > const&)::{lambda()#2}>(std::result_of const&, > boost::disable_if<std::result_of > const&::is_void<std::result_of<mesos::internal::slave::FetcherProcess::fetch(mesos::ContainerID > const&, mesos::CommandInfo const&, std::basic_string<char, > std::char_traits<char>, std::allocator<char> > const&, > Option<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, mesos::SlaveID const&, mesos::internal::slave::Flags > const&)::{lambda()#2} ()()> >, void>::type*) () from > /usr/lib64/libmesos-1.2.0.so > #8 0x00007f67b833a3d5 in std::_Function_handler<void > ()(process::ProcessBase*), process::Future<Try<Bytes, Error> > > process::dispatch<Try<Bytes, Error>, process::AsyncExecutorProcess, > mesos::internal::slave::FetcherProcess::fetch(mesos::ContainerID const&, > mesos::CommandInfo const&, std::basic_string<char, std::char_traits<char>, > std::allocator<char> > const&, Option<std::basic_string<char, > std::char_traits<char>, std::allocator<char> > > const&, mesos::SlaveID > const&, mesos::internal::slave::Flags const&)::{lambda()#2} const&, void*, > {lambda()#2}, > mesos::internal::slave::FetcherProcess::fetch(mesos::ContainerID const&, > mesos::CommandInfo const&, std::basic_string<char, std::char_traits<char>, > std::allocator<char> > const&, Option<std::basic_string<char, > std::char_traits<char>, std::allocator<char> > > const&, mesos::SlaveID > const&, mesos::internal::slave::Flags const&)::{lambda()#2} > const&>(process::PID<process::AsyncExecutorProcess> const&, process::Future > (process::PID::*)(mesos::internal::slave::FetcherProcess::fetch(mesos::ContainerID > const&, mesos::CommandInfo const&, std::basic_string<char, > std::char_traits<char>, std::allocator<char> > const&, > Option<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, mesos::SlaveID const&, mesos::internal::slave::Flags > const&)::{lambda()#2} const&, void*), {lambda()#2}, > mesos::internal::slave::FetcherProcess::fetch(mesos::ContainerID const&, > mesos::CommandInfo const&, std::basic_string<char, std::char_traits<char>, > std::allocator<char> > const&, Option<std::basic_string<char, > std::char_traits<char>, std::allocator<char> > > const&, mesos::SlaveID > const&, mesos::internal::slave::Flags const&)::{lambda()#2} > const&)::{lambda(process::ProcessBase*)#1}>::_M_invoke(std::_Any_data const&, > process::ProcessBase*) () from /usr/lib64/libmesos-1.2.0.so > #9 0x00007f67b8b85ede in > process::ProcessManager::resume(process::ProcessBase*) () from > /usr/lib64/libmesos-1.2.0.so > #10 0x00007f67b8b8fc8f in > std::thread::_Impl<std::_Bind_simple<process::ProcessManager::init_threads()::{unnamed > type#1} ()()> >::_M_run() () from /usr/lib64/libmesos-1.2.0.so > #11 0x00007f67b6da1470 in ?? () from /usr/lib64/libstdc++.so.6 > #12 0x00007f67b6ff8aa1 in start_thread () from /lib64/libpthread.so.0 > #13 0x00007f67b6a3faad in clone () from /lib64/libc.so.6 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)