[jira] [Commented] (QPID-7317) Deadlock on publish

ASF subversion and git services (JIRA) Fri, 23 Sep 2016 14:28:38 -0700

    [ 
https://issues.apache.org/jira/browse/QPID-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15517624#comment-15517624
 ]


ASF subversion and git services commented on QPID-7317:
-------------------------------------------------------

Commit 037c5738734d8fecb7b7f7e7af4e4f14f9cd3a64 in qpid-python's branch 
refs/heads/master from [~aconway]
[ https://git-wip-us.apache.org/repos/asf?p=qpid-python.git;h=037c573 ]

QPID-7317: Fix hangs in qpid.messaging.

Hang is observed in processes using qpid.messaging with a thread blocked waiting
for the Selector to wake it, but no Selector.run thread.

This patch removes all the known ways that this hang can occur. Either we
function normally or immediately raise an exception and log to the
"qpid.messaging" logger a message starting with "qpid.messaging:"

The following issues are fixed:

1. The Selector.run() thread raises a fatal exception.

Use of qpid.messaging will re-raise the exception immediately, not hang.

2. The process forks, so child has no Selector thread.

https://issues.apache.org/jira/browse/QPID-5637 resets the Selector after a 
fork.
In addition we now:

- Close Selector.waiter: its file descriptors are shared with the parent which
  can cause havoc if they "steal" each other's wakeups.

- Replace Endpoint._lock in related endpoints with a BrokenLock. If the parent
  is holding locks when it forks, they remain locked forever in the child.
  BrokenLock.acquire() raises instead of hanging.

3. Selector.stop() called on atexit.

Selector.stop was registered via atexit, which could cause a hang if
qpid.messaging was used in a later-executing atexit function. That has been
removed, Selector.run() is in a daemon thread so there is no need for stop()

4. User calls Selector.stop() directly

There is no reason to do this for the default Selector used by qpid.messaging,
so for that case stop() is now ignored. It works as before for code that creates
its own qpid.Selector instances.


> Deadlock on publish
> -------------------
>
>                 Key: QPID-7317
>                 URL: https://issues.apache.org/jira/browse/QPID-7317
>             Project: Qpid
>          Issue Type: Bug
>          Components: Python Client
>    Affects Versions: 0.32
>         Environment: python-qpid-0.32-13.fc23.noarch
>            Reporter: Brian Bouterse
>            Assignee: Alan Conway
>         Attachments: bad_child.py, bad_child.py, bt.txt, lsof.txt, 
> spout-hang-trace.txt, spout-hang.py, taabt.txt
>
>
> When publishing a task with qpid.messaging it deadlocks and our application 
> cannot continue. This has not been a problem for several releases, but within 
> a few days recently, another Satellite developer and I both experienced the 
> issue on separate machines, different distros. He is using a MRG built 
> pacakge (not sure of version). I am using python-qpid-0.32-13.fc23.
> Both deadlocked machines had core dumps taken on the deadlocked processes and 
> only show only 1 Qpid thread when I expect there to be 2. There are other 
> mongo threads, but those are idle as expected and not related. The traces 
> show our application calling into qpid.messaging to publish a message to the 
> message bus.
> This problem happens intermittently, and in cases where message publish is 
> successful I've verified by core dump that there are the expected 2 threads 
> for Qpid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (QPID-7317) Deadlock on publish

Reply via email to