[jira] [Commented] (CASSANDRA-8621) For streaming operations, when a socket is closed/reset, we should retry/reinitiate that stream
[ https://issues.apache.org/jira/browse/CASSANDRA-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300475#comment-15300475 ] Paulo Motta commented on CASSANDRA-8621: Closing this because the issue that originated this ticket was likely caused by CASSANDRA-11286 and stream sockets will no longer be idle after CASSANDRA-11841, so a closed/reset stream socket will generally mean the node is unreachable (see more details above). > For streaming operations, when a socket is closed/reset, we should > retry/reinitiate that stream > --- > > Key: CASSANDRA-8621 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8621 > Project: Cassandra > Issue Type: Improvement > Components: Streaming and Messaging >Reporter: Jeremy Hanna >Assignee: Paulo Motta > > Currently we have a setting (streaming_socket_timeout_in_ms) that will > timeout and retry the stream operation in the case where tcp is idle for a > period of time. However in the case where the socket is closed or reset, we > do not retry the operation. This can happen for a number of reasons, > including when a firewall sends a reset message on a socket during a > streaming operation, such as nodetool rebuild necessarily across DCs or > repairs. > Doing a retry would make the streaming operations more resilient. It would > be good to log the retry clearly as well (with the stream session ID and node > address). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8621) For streaming operations, when a socket is closed/reset, we should retry/reinitiate that stream
[ https://issues.apache.org/jira/browse/CASSANDRA-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15191015#comment-15191015 ] Paulo Motta commented on CASSANDRA-8621: Given that the stalled stream issue that originated this ticket was likely caused by CASSANDRA-11286, and with that in place and a properly configured network (ie. smaller keepalive interval) connections won't die if there is no network partition, I think this feature loses relevance, as it will add more state/complexity to the streaming protocol without clear benefits. So I propose we close this a later and re-evaluate if there are still broken connections after CASSANDRA-11286. WDYT [~yukim] ? > For streaming operations, when a socket is closed/reset, we should > retry/reinitiate that stream > --- > > Key: CASSANDRA-8621 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8621 > Project: Cassandra > Issue Type: Improvement > Components: Streaming and Messaging >Reporter: Jeremy Hanna >Assignee: Paulo Motta > > Currently we have a setting (streaming_socket_timeout_in_ms) that will > timeout and retry the stream operation in the case where tcp is idle for a > period of time. However in the case where the socket is closed or reset, we > do not retry the operation. This can happen for a number of reasons, > including when a firewall sends a reset message on a socket during a > streaming operation, such as nodetool rebuild necessarily across DCs or > repairs. > Doing a retry would make the streaming operations more resilient. It would > be good to log the retry clearly as well (with the stream session ID and node > address). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8621) For streaming operations, when a socket is closed/reset, we should retry/reinitiate that stream
[ https://issues.apache.org/jira/browse/CASSANDRA-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15191010#comment-15191010 ] Paulo Motta commented on CASSANDRA-8621: You probably want to check your TCP keepalive settings: https://docs.datastax.com/en/cassandra/2.0/cassandra/troubleshooting/trblshootIdleFirewall.html > For streaming operations, when a socket is closed/reset, we should > retry/reinitiate that stream > --- > > Key: CASSANDRA-8621 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8621 > Project: Cassandra > Issue Type: Improvement > Components: Streaming and Messaging >Reporter: Jeremy Hanna >Assignee: Paulo Motta > > Currently we have a setting (streaming_socket_timeout_in_ms) that will > timeout and retry the stream operation in the case where tcp is idle for a > period of time. However in the case where the socket is closed or reset, we > do not retry the operation. This can happen for a number of reasons, > including when a firewall sends a reset message on a socket during a > streaming operation, such as nodetool rebuild necessarily across DCs or > repairs. > Doing a retry would make the streaming operations more resilient. It would > be good to log the retry clearly as well (with the stream session ID and node > address). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8621) For streaming operations, when a socket is closed/reset, we should retry/reinitiate that stream
[ https://issues.apache.org/jira/browse/CASSANDRA-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15191003#comment-15191003 ] Paulo Motta commented on CASSANDRA-8621: You probably want to check your TCP keepalive settings: https://docs.datastax.com/en/cassandra/2.0/cassandra/troubleshooting/trblshootIdleFirewall.html > For streaming operations, when a socket is closed/reset, we should > retry/reinitiate that stream > --- > > Key: CASSANDRA-8621 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8621 > Project: Cassandra > Issue Type: Improvement > Components: Streaming and Messaging >Reporter: Jeremy Hanna >Assignee: Paulo Motta > > Currently we have a setting (streaming_socket_timeout_in_ms) that will > timeout and retry the stream operation in the case where tcp is idle for a > period of time. However in the case where the socket is closed or reset, we > do not retry the operation. This can happen for a number of reasons, > including when a firewall sends a reset message on a socket during a > streaming operation, such as nodetool rebuild necessarily across DCs or > repairs. > Doing a retry would make the streaming operations more resilient. It would > be good to log the retry clearly as well (with the stream session ID and node > address). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8621) For streaming operations, when a socket is closed/reset, we should retry/reinitiate that stream
[ https://issues.apache.org/jira/browse/CASSANDRA-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982842#comment-14982842 ] Nicholas Gaugler commented on CASSANDRA-8621: - I constantly suffer from Broken Pipe issues. Although I've attempted to tweak the value of streaming_socket_timeout_in_ms to work around it, rebuilds still completely fail. Is this related? > For streaming operations, when a socket is closed/reset, we should > retry/reinitiate that stream > --- > > Key: CASSANDRA-8621 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8621 > Project: Cassandra > Issue Type: Improvement >Reporter: Jeremy Hanna >Assignee: Paulo Motta > > Currently we have a setting (streaming_socket_timeout_in_ms) that will > timeout and retry the stream operation in the case where tcp is idle for a > period of time. However in the case where the socket is closed or reset, we > do not retry the operation. This can happen for a number of reasons, > including when a firewall sends a reset message on a socket during a > streaming operation, such as nodetool rebuild necessarily across DCs or > repairs. > Doing a retry would make the streaming operations more resilient. It would > be good to log the retry clearly as well (with the stream session ID and node > address). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8621) For streaming operations, when a socket is closed/reset, we should retry/reinitiate that stream
[ https://issues.apache.org/jira/browse/CASSANDRA-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635393#comment-14635393 ] Paulo Motta commented on CASSANDRA-8621: I'd like to discuss/validate a possible solution before diving into implementation. Upon receiving a SocketException during a stablished StreamSession, the reconnection initiator will: # Mark its view of the StreamSession as isReconnecting; # Stop/close both incoming and outgoing message handlers and respective sockets; #* Since the closing of sockets might generate additional SocketExceptions, we may ignore/log them while isReconnecting is set to true. # Create new incoming and outgoing message handlers and sockets. # Send a StreamInitMessage to the session peer with isReconnecting flag set to true. # After the initialization is complete, the StreamSession.isReconnecting flag is set to false and the onInitializationComplete() is called to resume the streaming protocol. # In case of failure during the process, the initiator will retry to stablish the connection up to max_streaming_retries property, and fail the stream session if it's not able to reconnect. Upon receiving a StreamInitMessage with isReconnecting=true the reconnection follower will: # Fetch the StreamSession object for that session: #* If StreamSession.isReconnecting is set to true on the reconnection follower, it means that peer is also trying to act as a reconnection initiator, so we have a conflict. We can use the node identifier or IP as a universal tie-breaker. Only the peer with the lowest IP/ID will have it's StreamInitMessage accepted by the other peer in case of a conflict. The other peer will have its init socket closed. #* Otherwise, it will set its StreamSession.isReconnecting flag to true. # Stop/close both incoming and outgoing message handlers and respective sockets; #* Since the closing of sockets might generate additional SocketExceptions, we may ignore them while isReconnecting is set to true. # Create new incoming and outgoing message handlers and sockets. # Attach the outgoing socket to the new outgoing message handler. # After the incoming socket is attached to the incoming message handler, the session is restablished and the StreamSession.isReconnecting is set to false. # The session is restablished and everybody is happy. What do you think of this approach [~yukim]? For streaming operations, when a socket is closed/reset, we should retry/reinitiate that stream --- Key: CASSANDRA-8621 URL: https://issues.apache.org/jira/browse/CASSANDRA-8621 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jeremy Hanna Assignee: Paulo Motta Currently we have a setting (streaming_socket_timeout_in_ms) that will timeout and retry the stream operation in the case where tcp is idle for a period of time. However in the case where the socket is closed or reset, we do not retry the operation. This can happen for a number of reasons, including when a firewall sends a reset message on a socket during a streaming operation, such as nodetool rebuild necessarily across DCs or repairs. Doing a retry would make the streaming operations more resilient. It would be good to log the retry clearly as well (with the stream session ID and node address). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8621) For streaming operations, when a socket is closed/reset, we should retry/reinitiate that stream
[ https://issues.apache.org/jira/browse/CASSANDRA-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279768#comment-14279768 ] Jonathan Shook commented on CASSANDRA-8621: --- For the scenario that prompted this ticket, it appeared that the streaming process was completely stalled. One side of the stream (the sender side) had an exception that appeared to be a connection reset. The receiving side appeared to think that the connection was still active, at least in terms of the netstats reported by nodetool. We were unable to verify whether this was specifically the case in terms of connected sockets due to the fact that there were multiple streams for those peers, and there is no simple way to correlate a specific stream to a tcp session. [~yukim] If there is a diagnostic method that we can use to provide more information about specific stalled streams, please let us know so that we can approach the user to get more data. For streaming operations, when a socket is closed/reset, we should retry/reinitiate that stream --- Key: CASSANDRA-8621 URL: https://issues.apache.org/jira/browse/CASSANDRA-8621 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jeremy Hanna Assignee: Yuki Morishita Currently we have a setting (streaming_socket_timeout_in_ms) that will timeout and retry the stream operation in the case where tcp is idle for a period of time. However in the case where the socket is closed or reset, we do not retry the operation. This can happen for a number of reasons, including when a firewall sends a reset message on a socket during a streaming operation, such as nodetool rebuild necessarily across DCs or repairs. Doing a retry would make the streaming operations more resilient. It would be good to log the retry clearly as well (with the stream session ID and node address). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8621) For streaming operations, when a socket is closed/reset, we should retry/reinitiate that stream
[ https://issues.apache.org/jira/browse/CASSANDRA-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279774#comment-14279774 ] Jonathan Shook commented on CASSANDRA-8621: --- As well, there were no TCP level errors showing for the receiving side. So it is unclear whether exceptions are being omitted, or whether there was something really strange occurring with the network. For streaming operations, when a socket is closed/reset, we should retry/reinitiate that stream --- Key: CASSANDRA-8621 URL: https://issues.apache.org/jira/browse/CASSANDRA-8621 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jeremy Hanna Assignee: Yuki Morishita Currently we have a setting (streaming_socket_timeout_in_ms) that will timeout and retry the stream operation in the case where tcp is idle for a period of time. However in the case where the socket is closed or reset, we do not retry the operation. This can happen for a number of reasons, including when a firewall sends a reset message on a socket during a streaming operation, such as nodetool rebuild necessarily across DCs or repairs. Doing a retry would make the streaming operations more resilient. It would be good to log the retry clearly as well (with the stream session ID and node address). -- This message was sent by Atlassian JIRA (v6.3.4#6332)