[ https://issues.apache.org/jira/browse/MESOS-9808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855677#comment-16855677 ]
Andrei Sekretenko commented on MESOS-9808: ------------------------------------------ Great! Unfortunately, my issue failed to show up in a reasonable number of runs. So I gave up and managed to more or less reliably get into a deadlock WITH a race: [https://reviews.apache.org/r/70782/] The fix almost helps - with the fix, this setup hangs only sometimes, with a slightly different stack:[^deadlock_stacks_with_fix.txt] The event is being deleted by EventQueue::enqueue() this time.... > libprocess can deadlock on termination (cleanup() vs use() + terminate()) > ------------------------------------------------------------------------- > > Key: MESOS-9808 > URL: https://issues.apache.org/jira/browse/MESOS-9808 > Project: Mesos > Issue Type: Bug > Reporter: Andrei Sekretenko > Priority: Major > Labels: foundations > Attachments: deadlock_stacks.txt, deadlock_stacks_filtered.txt, > deadlock_stacks_with_fix.txt > > > Using the process::loop() together with the common pattern of using > libprocess (Process wrapper + dispatching) is prone to causing a deadlock on > libprocess termination if the code does not wait for the loop exit before > termination. > *The deadlock itself is not directly caused by the process::loop(), though.* > It occurs in a following setup with two processes (let's name them A and B). > Thread 1 tries to cleanup process A. It locks processes_mutex and hangs here: > > [https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L3079] > waiting for the process A to have no strong references. > Thread 2 begins with creating a ProcessReference in > ProcessManager::deliver(UPID&) called for process: > [https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L2799] > and ends up waiting for processes_mutex in ProcessManager::terminate() for > process B: > > [https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L3155] > ----------------- > In the observed case, terminate() for process B was triggered by a > destructor of a process-wrapping object owned by a libprocess loop executing > on A. > I'm attaching the stacks captured at the deadlock. Stacks of the threads > which lock one another are in [^deadlock_stacks_filtered.txt] Note frame #1 > in Thread 5 (waiting for all references to expire) and frames #48 and #8 in > Thread 19 (creating a reference and waiting for a processes_mutex). -- This message was sent by Atlassian JIRA (v7.6.3#76005)