[
https://issues.apache.org/jira/browse/MESOS-9808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855677#comment-16855677
]
Andrei Sekretenko commented on MESOS-9808:
------------------------------------------
Great!
Unfortunately, my issue failed to show up in a reasonable number of runs.
So I gave up and managed to more or less reliably get into a deadlock WITH a
race: [https://reviews.apache.org/r/70782/]
The fix almost helps - with the fix, this setup hangs only sometimes, with a
slightly different stack:[^deadlock_stacks_with_fix.txt]
The event is being deleted by EventQueue::enqueue() this time....
> libprocess can deadlock on termination (cleanup() vs use() + terminate())
> -------------------------------------------------------------------------
>
> Key: MESOS-9808
> URL: https://issues.apache.org/jira/browse/MESOS-9808
> Project: Mesos
> Issue Type: Bug
> Reporter: Andrei Sekretenko
> Priority: Major
> Labels: foundations
> Attachments: deadlock_stacks.txt, deadlock_stacks_filtered.txt,
> deadlock_stacks_with_fix.txt
>
>
> Using the process::loop() together with the common pattern of using
> libprocess (Process wrapper + dispatching) is prone to causing a deadlock on
> libprocess termination if the code does not wait for the loop exit before
> termination.
> *The deadlock itself is not directly caused by the process::loop(), though.*
> It occurs in a following setup with two processes (let's name them A and B).
> Thread 1 tries to cleanup process A. It locks processes_mutex and hangs here:
>
> [https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L3079]
> waiting for the process A to have no strong references.
> Thread 2 begins with creating a ProcessReference in
> ProcessManager::deliver(UPID&) called for process:
> [https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L2799]
> and ends up waiting for processes_mutex in ProcessManager::terminate() for
> process B:
>
> [https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L3155]
> -----------------
> In the observed case, terminate() for process B was triggered by a
> destructor of a process-wrapping object owned by a libprocess loop executing
> on A.
> I'm attaching the stacks captured at the deadlock. Stacks of the threads
> which lock one another are in [^deadlock_stacks_filtered.txt] Note frame #1
> in Thread 5 (waiting for all references to expire) and frames #48 and #8 in
> Thread 19 (creating a reference and waiting for a processes_mutex).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)