Ravi Nirmal created QPIDJMS-534:
-----------------------------------

             Summary: BalancedProviderFuture.sync stuck forever during 
connection recovery
                 Key: QPIDJMS-534
                 URL: https://issues.apache.org/jira/browse/QPIDJMS-534
             Project: Qpid JMS
          Issue Type: Bug
          Components: qpid-jms-client
    Affects Versions: 0.42.0
            Reporter: Ravi Nirmal
         Attachments: full-thread-dump.txt, logs.txt, partial-thread-dump.txt

Recently, we observed an issue on our production environment where we can see 
that BalancedProviderFuture.sync method during connection recovery is stuck 
forever and never returns. We have observed this in 2 hosts in last one week, 
the only solution is to restart the server.

I am attaching the thread dump which indicates the issue and how it blocks 
other threads, [^partial-thread-dump.txt] file will have the stuck threads and 
[^full-thread-dump.txt] will have all the threads.
h3. Details of Investigation
 * This issue is happening on connection recovery during failover from one 
server to another.
 * By debugging I can see that BalancedProviderFuture.sync method is waiting 
for its state to be updated, and its state is updated by AmqpProvider thread. 
In thread dump I don't see any AmqpProvider thread which is in stuck state 
which indicates that AmqpProvider has done its job but still the state for 
given BalancedProviderFuture object is not updated.
 * In the successful event, I can see that the state of BalancedProviderFuture 
object is updated in below sequence:
 ** JmsSession.onConnectionRecovery method calls provider.create after creating 
BalancedProviderFuture object.
 ** provider.create (aka AmqpProvider.create) is start a thread using 
serializer, this create method has proper handling and it either calls 
pumpToProtonTransport OR request.onFailure(which will update the state of 
BalancedProviderFuture in case of exception).
 ** Once the above thread gets finished(basically after pumpToProtonTransport), 
the serializer will call the AmqpProvider.onData method which will update the 
state of BalancedProviderFuture object.
 * I have observed that if we get the exception in AmqpProvider.onData method 
then the state of BalancedProviderFuture is not getting updated and the 
BalancedProviderFuture.sync method gets stuck forever, the exception can come 
in case of protonTransport tail is closed already(probably because of idle 
timeout issue OR any other transport related issue).
 * I have also observed that in some cases(of idle timeout OR transport errors) 
after completion of a thread which was started by provider.create (aka 
AmqpProvider.create), the serializer is not calling AmqpProvider.onData but 
instead it calls AmqpProvider.onTransportError OR 
AmqpProvider.onTransportClosed and I can not see any handling of updating the 
state of BalancedProviderFuture object in onTransportError OR onTransportClosed 
method.
 * I am attaching some [^logs.txt] which shows some errors, these error came 
when the state of BalancedProviderFuture is not updated and sync mehod stuck 
forever.
 * Please note we are using URL - 
failover:(amqp://localhost:5672\\,amqp://localhost:5682)?jms.sendTimeout=5000 
and qpid version 0.42.0.

I have found two old tickets QPIDJMS-458 & QPIDJMS-464 which shows the similar 
issue, but I believe this issue is different and might needs to be fixed 
separately.

Can someone please take a look at this as this becomes critical issue in our 
production environment and we don't have any option except restart of our 
services?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to