[
https://issues.apache.org/jira/browse/MESOS-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872235#comment-15872235
]
Benjamin Mahler commented on MESOS-7122:
----------------------------------------
It seems to me the statement applies generally:
{quote}
While I agree that blocking should be avoided, the point of this bug is that it
is possible for <some process not to make progress when the worker threads are
blocked>. The <process> has to be able to <make progress> in the unfortunate
event of code blocking on <a result from this process>.
{quote}
I'm just trying to understand what makes the reaper special here. As far as I
can tell we can apply this template to nearly every Process that exposes
futures to callers and justify using threads instead of Processes?
{quote}
Running a separate thread for each waitpid seems expensive but would work. You
could probably also implement this by having an event loop in kevent to monitor
the PIDs directly, or by using signalfd on Linux to intercept SIGCHLD and reap
any registered PIDs.
{quote}
Yeah, we should split out the discussion for non-delay based child reaping into
a separate improvement as it seems orthogonal to the issue of deadlock, since
AFAICT none of these solutions work for non-child processes, correct? Would it
be enough for you guys if we solve the child case but leave the non-child case
as a delay loop?
> Process reaper should have a dedicated thread to avoid deadlock.
> ----------------------------------------------------------------
>
> Key: MESOS-7122
> URL: https://issues.apache.org/jira/browse/MESOS-7122
> Project: Mesos
> Issue Type: Bug
> Components: libprocess
> Reporter: James Peach
>
> In a test environment, we saw that libprocess can deadlock when the process
> reaper is unable to run.
> This happens in the Mesos HDFS client, which synchronously runs a {{hadoop}}
> subprocess. If this happens too many times, the {{ReaperProcess}} is never
> scheduled to reap the subprocess statuses. Since the HDFS {{Future}} never
> completes, we deadlock with all the threads in the call stack below. If there
> was a dedicated thread for the {{ReaperProcess}} to run on, or some other way
> to endure that is is scheduled we could avoid the deadlock.
> {noformat}
> #0 0x00007f67b6ffc68c in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1 0x00007f67b6da12fc in
> std::condition_variable::wait(std::unique_lock<std::mutex>&) () from
> /usr/lib64/libstdc++.so.6
> #2 0x00007f67b8b864f6 in process::ProcessManager::wait(process::UPID const&)
> () from /usr/lib64/libmesos-1.2.0.so
> #3 0x00007f67b8b8d347 in process::wait(process::UPID const&, Duration
> const&) () from /usr/lib64/libmesos-1.2.0.so
> #4 0x00007f67b8b51a85 in process::Latch::await(Duration const&) () from
> /usr/lib64/libmesos-1.2.0.so
> #5 0x00007f67b834fc9f in process::Future<Bytes>::await(Duration const&)
> const () from /usr/lib64/libmesos-1.2.0.so
> #6 0x00007f67b833d700 in
> mesos::internal::slave::fetchSize(std::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const&,
> Option<std::basic_string<char, std::char_traits<char>, std::allocator<char> >
> > const&) () from /usr/lib64/libmesos-1.2.0.so
> #7 0x00007f67b833df5e in
> std::result_of<mesos::internal::slave::FetcherProcess::fetch(mesos::ContainerID
> const&, mesos::CommandInfo const&, std::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const&,
> Option<std::basic_string<char, std::char_traits<char>, std::allocator<char> >
> > const&, mesos::SlaveID const&, mesos::internal::slave::Flags
> const&)::{lambda()#2} ()()>::type
> process::AsyncExecutorProcess::execute<mesos::internal::slave::FetcherProcess::fetch(mesos::ContainerID
> const&, mesos::CommandInfo const&, std::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const&,
> Option<std::basic_string<char, std::char_traits<char>, std::allocator<char> >
> > const&, mesos::SlaveID const&, mesos::internal::slave::Flags
> const&)::{lambda()#2}>(std::result_of const&,
> boost::disable_if<std::result_of
> const&::is_void<std::result_of<mesos::internal::slave::FetcherProcess::fetch(mesos::ContainerID
> const&, mesos::CommandInfo const&, std::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const&,
> Option<std::basic_string<char, std::char_traits<char>, std::allocator<char> >
> > const&, mesos::SlaveID const&, mesos::internal::slave::Flags
> const&)::{lambda()#2} ()()> >, void>::type*) () from
> /usr/lib64/libmesos-1.2.0.so
> #8 0x00007f67b833a3d5 in std::_Function_handler<void
> ()(process::ProcessBase*), process::Future<Try<Bytes, Error> >
> process::dispatch<Try<Bytes, Error>, process::AsyncExecutorProcess,
> mesos::internal::slave::FetcherProcess::fetch(mesos::ContainerID const&,
> mesos::CommandInfo const&, std::basic_string<char, std::char_traits<char>,
> std::allocator<char> > const&, Option<std::basic_string<char,
> std::char_traits<char>, std::allocator<char> > > const&, mesos::SlaveID
> const&, mesos::internal::slave::Flags const&)::{lambda()#2} const&, void*,
> {lambda()#2},
> mesos::internal::slave::FetcherProcess::fetch(mesos::ContainerID const&,
> mesos::CommandInfo const&, std::basic_string<char, std::char_traits<char>,
> std::allocator<char> > const&, Option<std::basic_string<char,
> std::char_traits<char>, std::allocator<char> > > const&, mesos::SlaveID
> const&, mesos::internal::slave::Flags const&)::{lambda()#2}
> const&>(process::PID<process::AsyncExecutorProcess> const&, process::Future
> (process::PID::*)(mesos::internal::slave::FetcherProcess::fetch(mesos::ContainerID
> const&, mesos::CommandInfo const&, std::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const&,
> Option<std::basic_string<char, std::char_traits<char>, std::allocator<char> >
> > const&, mesos::SlaveID const&, mesos::internal::slave::Flags
> const&)::{lambda()#2} const&, void*), {lambda()#2},
> mesos::internal::slave::FetcherProcess::fetch(mesos::ContainerID const&,
> mesos::CommandInfo const&, std::basic_string<char, std::char_traits<char>,
> std::allocator<char> > const&, Option<std::basic_string<char,
> std::char_traits<char>, std::allocator<char> > > const&, mesos::SlaveID
> const&, mesos::internal::slave::Flags const&)::{lambda()#2}
> const&)::{lambda(process::ProcessBase*)#1}>::_M_invoke(std::_Any_data const&,
> process::ProcessBase*) () from /usr/lib64/libmesos-1.2.0.so
> #9 0x00007f67b8b85ede in
> process::ProcessManager::resume(process::ProcessBase*) () from
> /usr/lib64/libmesos-1.2.0.so
> #10 0x00007f67b8b8fc8f in
> std::thread::_Impl<std::_Bind_simple<process::ProcessManager::init_threads()::{unnamed
> type#1} ()()> >::_M_run() () from /usr/lib64/libmesos-1.2.0.so
> #11 0x00007f67b6da1470 in ?? () from /usr/lib64/libstdc++.so.6
> #12 0x00007f67b6ff8aa1 in start_thread () from /lib64/libpthread.so.0
> #13 0x00007f67b6a3faad in clone () from /lib64/libc.so.6
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)