[jira] [Commented] (ARTEMIS-2870) CORE connection failure sometimes doesn't cleanup sessions

Markus Meierhofer (Jira) Tue, 16 Mar 2021 02:30:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARTEMIS-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302361#comment-17302361
 ]


Markus Meierhofer commented on ARTEMIS-2870:
--------------------------------------------

Hello,

I reopened the issue because I was able to reproduce the issue consistently now 
while testing a fix for https://issues.apache.org/jira/browse/ARTEMIS-3174.

The issue only happens when confirmationWindowSize of the client connection != 
-1 and if session reattachment actually happens (client can reconnect in less 
than connectionTTL, the sessions of the old connection still exist on the 
server). It seems that the log lines where only the connection and not their 
sessions are cleaned up are actually desirable in most cases (after session 
reattachment happened), but when the connection linked to the reattached 
sessions then fails (and no subsequent reattachment by the client happens), the 
sessions are also not cleaned up, leading to "dead sessions".

Without the fix I provided for ARTEMIS-3174, the issue can be reproduced with 
the following ways:
 * Have a client with a CORE connection, confirmationWindowSize != -1, 
reconnectAttempts=-1 and connectionTTL=60 seconds (default) that is actively 
sending and receiving messages
 * Block client connection to broker (e.g. iptables drop communication port) 
for LESS than 60 seconds (=connection TTL)
 * Ensure that artemis client has detected connection loss ("Connection failure 
to... has been detected"), it doesn't happen every time (for whatever reason). 
In our case, we have a "callTimeout" of 4 seconds, after which the client in 
most cases detects a connection failure when it's gone, e.g.
{code:java}
[WARN 2021-03-16 08:01:29,243 l-threads)  
org.apache.activemq.artemis.core.client]: AMQ212037: Connection failure to 
fms/10.1.4.204:61616 has been detected: AMQ219014: Timed out after waiting 
4,000 ms for response when sending packet 71 [code=CONNECTION_TIMEDOUT]
{code}
 

 * Allow client connection to broker again before the 60 seconds run out (best 
already after ~20 seconds)
 * The client will failover to a new connection and transfer the sessions as 
they still exist on the broker, which in general works but -> When checking the 
broker console, you can see the new connection reports "0" sessions, and the 
sessions still report the old connectionID. Although the sessions contain the 
old connection, they were successfully reattached on the new connection (the 
client is able to communicate flawlessly after failover). This is the first bug 
as described in ARTEMIS3174, which I fixed by setting the new connection on the 
ServerSessionImpl during session reattachment 
([https://github.com/apache/activemq-artemis/pull/3486).]
 * If you now stop the client non-gracefully (don't close any 
consumers/producers/sessions/connection), the broker ~60 seconds later will 
close the connection, but none of the sessions that were previously reattached 
-> This is exactly the issue as described in this ticket.

I applied my fix for ARTEMIS-3174 (set new connection in ServerSessionImpl), 
built the broker locally myself and retried the scenario. The reattached 
sessions now report the new connection ID, BUT the original issue still exists: 
If you non-gracefully stop the client using sessions that were previously 
reattached, the broker ~60 seconds later will only close the connection, but 
not the linked sessions.

The issue here might be that the new connection doesn't get the reattached 
sessions set as "failure listeners" during reattachment. I will investigate 
into this issue, but it would be great if you could also look into the issue 
and possibly provide a fix for it.

 

Thanks and best regards,

Markus Meierhofer

> CORE connection failure sometimes doesn't cleanup sessions
> ----------------------------------------------------------
>
>                 Key: ARTEMIS-2870
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2870
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 2.10.1, 2.14.0, 2.15.0
>            Reporter: Markus Meierhofer
>            Priority: Blocker
>             Fix For: 2.16.0
>
>         Attachments: all_connections_list.png, artemis.log, broker.log, 
> broker.xml, connection_nonexistent.png, consumer_list_for_one_queue.png, 
> duplicated consumers.png, multiple_consumers_per_queue.png, 
> session_with_connection_id.png, three consumers per queue.png
>
>
> h3. Summary
> Since the upgrade of our deployed artemis instances from version 2.6.4 to 
> 2.10.1 we have noticed the problem that sometimes, a connection failure 
> doesn't include the cleanup of its connected sessions, leading to "zombie" 
> consumers and producers on queues.
>  
> h3. The issue
> Our Artemis Clients are connected to the broker via the provided JMS 
> abstraction, using the default connection TTL of 60 seconds. we are using 
> both JMS Topics and JMS Queues.
> As most of our Clients are mobile and in a WiFi, connection losses may occur 
> frequently, depending on the quality of the network. When the client is 
> disconnected for 60 seconds, the broker usually closes the connection and 
> cleans up all the sessions connected to it. The mobile Clients then create 
> reconnect when they are online again. What we have noticed is that after many 
> connection failures, messages may to be sent twice to the mobile clients. 
> When analyzing the problem on the broker console, we found out that there 
> were two consumers connected to each of the queues one mobile client usually 
> consumes from. One of them belonged to the new connection of the mobile 
> Client, which is fine.
> The other consumer belonged to a session whose connection already failed and 
> was closed at that time. When analyzing the logs, we saw that for these 
> connections, it contained a "Connection failure to ... has been detected" 
> line, but no following "clearing up resources for session ..." log lines for 
> these connections.
>  
> h3. Instance of the issue
>  
> The broken Session is the "7a9292cb-xxx" in the picture. In the logs you can 
> see that the connection failure was detected, but the session was never 
> cleared by the broker (mind the timestamp).
> !duplicated consumers.png!
> {code:java}
> [WARN 2020-07-27 14:33:29,794  Thread-13  
> org.apache.activemq.artemis.core.client]: AMQ212037: Connection failure to 
> /10.255.0.2:54812 has been detected: syscall:read(..) failed: Connection 
> reset by peer [code=GENERIC_EXCEPTION]
> [WARN 2020-07-29 09:31:30,828 Thread-20   
> org.apache.activemq.artemis.core.client]: AMQ212037: Connection failure to 
> /10.255.0.2:55994 has been detected: AMQ229014: Did not receive data from 
> /10.255.0.2:55994 within the 60,000ms connection TTL. The connection will now 
> be closed. [code=CONNECTION_TIMEDOUT]
> {code}
>  
> Attached you can find the full [^artemis.log] and our [^broker.xml]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARTEMIS-2870) CORE connection failure sometimes doesn't cleanup sessions

Reply via email to