Andrei Sekretenko created MESOS-9808:
----------------------------------------
Summary: libprocess can deadlock on termination (cleanup() vs
use() + terminate())
Key: MESOS-9808
URL: https://issues.apache.org/jira/browse/MESOS-9808
Project: Mesos
Issue Type: Bug
Reporter: Andrei Sekretenko
Attachments: deadlock_stacks.txt, deadlock_stacks_filtered.txt
Using the process::loop() together with the common pattern of using libprocess
(Process wrapper + dispatching) is prone to causing a deadlock on libprocess
termination if the code does not wait for the loop exit before termination.
*The deadlock itself is not directly caused by the process::loop(), though.*
It occurs in a following setup with two processes (let's name them A and B).
Thread 1 tries to cleanup process A. It locks processes_mutex and hangs here:
[https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L3079]
waiting for the process A to have no strong references.
Thread 2 begins with creating a ProcessReference in
ProcessManager::deliver(UPID&) called for process:
[https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L2799]
and ends up waiting for processes_mutex in ProcessManager::terminate() for
process B:
[https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L3155]
-----------------
In the observed case, terminate() for process B was triggered by a destructor
of a process-wrapping object owned by a libprocess loop executing on A.
I'm attaching the stacks captured at the deadlock. Stacks of the threads which
lock one another are in deadlocks_stacks_filtered.txt. Note frame #1 in Thread
5 (waiting for all references to expire) and frames #48 and #8 in Thread 19
(creating a reference and waiting for a processes_mutex).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)