Ravi Nirmal created QPIDJMS-534:
-----------------------------------
Summary: BalancedProviderFuture.sync stuck forever during
connection recovery
Key: QPIDJMS-534
URL: https://issues.apache.org/jira/browse/QPIDJMS-534
Project: Qpid JMS
Issue Type: Bug
Components: qpid-jms-client
Affects Versions: 0.42.0
Reporter: Ravi Nirmal
Attachments: full-thread-dump.txt, logs.txt, partial-thread-dump.txt
Recently, we observed an issue on our production environment where we can see
that BalancedProviderFuture.sync method during connection recovery is stuck
forever and never returns. We have observed this in 2 hosts in last one week,
the only solution is to restart the server.
I am attaching the thread dump which indicates the issue and how it blocks
other threads, [^partial-thread-dump.txt] file will have the stuck threads and
[^full-thread-dump.txt] will have all the threads.
h3. Details of Investigation
* This issue is happening on connection recovery during failover from one
server to another.
* By debugging I can see that BalancedProviderFuture.sync method is waiting
for its state to be updated, and its state is updated by AmqpProvider thread.
In thread dump I don't see any AmqpProvider thread which is in stuck state
which indicates that AmqpProvider has done its job but still the state for
given BalancedProviderFuture object is not updated.
* In the successful event, I can see that the state of BalancedProviderFuture
object is updated in below sequence:
** JmsSession.onConnectionRecovery method calls provider.create after creating
BalancedProviderFuture object.
** provider.create (aka AmqpProvider.create) is start a thread using
serializer, this create method has proper handling and it either calls
pumpToProtonTransport OR request.onFailure(which will update the state of
BalancedProviderFuture in case of exception).
** Once the above thread gets finished(basically after pumpToProtonTransport),
the serializer will call the AmqpProvider.onData method which will update the
state of BalancedProviderFuture object.
* I have observed that if we get the exception in AmqpProvider.onData method
then the state of BalancedProviderFuture is not getting updated and the
BalancedProviderFuture.sync method gets stuck forever, the exception can come
in case of protonTransport tail is closed already(probably because of idle
timeout issue OR any other transport related issue).
* I have also observed that in some cases(of idle timeout OR transport errors)
after completion of a thread which was started by provider.create (aka
AmqpProvider.create), the serializer is not calling AmqpProvider.onData but
instead it calls AmqpProvider.onTransportError OR
AmqpProvider.onTransportClosed and I can not see any handling of updating the
state of BalancedProviderFuture object in onTransportError OR onTransportClosed
method.
* I am attaching some [^logs.txt] which shows some errors, these error came
when the state of BalancedProviderFuture is not updated and sync mehod stuck
forever.
* Please note we are using URL -
failover:(amqp://localhost:5672\\,amqp://localhost:5682)?jms.sendTimeout=5000
and qpid version 0.42.0.
I have found two old tickets QPIDJMS-458 & QPIDJMS-464 which shows the similar
issue, but I believe this issue is different and might needs to be fixed
separately.
Can someone please take a look at this as this becomes critical issue in our
production environment and we don't have any option except restart of our
services?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]