[ 
https://issues.apache.org/jira/browse/MESOS-9808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855677#comment-16855677
 ] 

Andrei Sekretenko commented on MESOS-9808:
------------------------------------------

Great!
Unfortunately,  my issue failed to show up in a reasonable number of runs. 
So I gave up and managed to more or less reliably get into a deadlock WITH a 
race: [https://reviews.apache.org/r/70782/]

The fix almost helps - with the fix, this setup hangs only sometimes, with a 
slightly different stack:[^deadlock_stacks_with_fix.txt]
The event is being deleted by EventQueue::enqueue() this time....

> libprocess can deadlock on termination (cleanup() vs use() + terminate())
> -------------------------------------------------------------------------
>
>                 Key: MESOS-9808
>                 URL: https://issues.apache.org/jira/browse/MESOS-9808
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Andrei Sekretenko
>            Priority: Major
>              Labels: foundations
>         Attachments: deadlock_stacks.txt, deadlock_stacks_filtered.txt, 
> deadlock_stacks_with_fix.txt
>
>
> Using the process::loop() together with the common pattern of using 
> libprocess (Process wrapper + dispatching) is prone to causing a deadlock on 
> libprocess termination if the code does not wait for the loop exit before 
> termination.
> *The deadlock itself is not directly caused by the process::loop(), though.*
>  It occurs in a following setup with two processes (let's name them A and B).
> Thread 1 tries to cleanup process A. It locks processes_mutex and hangs here:
>  
> [https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L3079]
>  waiting for the process A to have no strong references.
> Thread 2 begins with creating a ProcessReference in 
> ProcessManager::deliver(UPID&) called for process: 
> [https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L2799]
> and ends up waiting for processes_mutex in ProcessManager::terminate() for 
> process B:
>  
> [https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L3155]
> -----------------
>  In the observed case, terminate() for process B was triggered by a 
> destructor of a process-wrapping object owned by a libprocess loop executing 
> on A.
> I'm attaching the stacks captured at the deadlock. Stacks of the threads 
> which lock one another are in [^deadlock_stacks_filtered.txt] Note frame #1 
> in Thread 5 (waiting for all references to expire) and frames #48 and #8 in 
> Thread 19 (creating a reference and waiting for a processes_mutex).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to