[jira] [Commented] (CASSANDRA-8621) For streaming operations, when a socket is closed/reset, we should retry/reinitiate that stream

2016-05-25 Thread Paulo Motta (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300475#comment-15300475
 ] 

Paulo Motta commented on CASSANDRA-8621:


Closing this because the issue that originated this ticket was likely caused by 
CASSANDRA-11286 and stream sockets will no longer be idle after 
CASSANDRA-11841, so a closed/reset stream socket will generally mean the node 
is unreachable (see more details above).

> For streaming operations, when a socket is closed/reset, we should 
> retry/reinitiate that stream
> ---
>
> Key: CASSANDRA-8621
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8621
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Streaming and Messaging
>Reporter: Jeremy Hanna
>Assignee: Paulo Motta
>
> Currently we have a setting (streaming_socket_timeout_in_ms) that will 
> timeout and retry the stream operation in the case where tcp is idle for a 
> period of time.  However in the case where the socket is closed or reset, we 
> do not retry the operation.  This can happen for a number of reasons, 
> including when a firewall sends a reset message on a socket during a 
> streaming operation, such as nodetool rebuild necessarily across DCs or 
> repairs.
> Doing a retry would make the streaming operations more resilient.  It would 
> be good to log the retry clearly as well (with the stream session ID and node 
> address).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8621) For streaming operations, when a socket is closed/reset, we should retry/reinitiate that stream

2016-03-11 Thread Paulo Motta (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15191015#comment-15191015
 ] 

Paulo Motta commented on CASSANDRA-8621:


Given that the stalled stream issue that originated this ticket was likely 
caused by CASSANDRA-11286, and with that in place and a properly configured 
network (ie. smaller keepalive interval) connections  won't die if there is no 
network partition, I think this feature loses relevance, as it will add more 
state/complexity to the streaming protocol without clear benefits. So I propose 
we close this a later and re-evaluate if there are still broken connections 
after CASSANDRA-11286. WDYT [~yukim] ?

> For streaming operations, when a socket is closed/reset, we should 
> retry/reinitiate that stream
> ---
>
> Key: CASSANDRA-8621
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8621
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Streaming and Messaging
>Reporter: Jeremy Hanna
>Assignee: Paulo Motta
>
> Currently we have a setting (streaming_socket_timeout_in_ms) that will 
> timeout and retry the stream operation in the case where tcp is idle for a 
> period of time.  However in the case where the socket is closed or reset, we 
> do not retry the operation.  This can happen for a number of reasons, 
> including when a firewall sends a reset message on a socket during a 
> streaming operation, such as nodetool rebuild necessarily across DCs or 
> repairs.
> Doing a retry would make the streaming operations more resilient.  It would 
> be good to log the retry clearly as well (with the stream session ID and node 
> address).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8621) For streaming operations, when a socket is closed/reset, we should retry/reinitiate that stream

2016-03-11 Thread Paulo Motta (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15191010#comment-15191010
 ] 

Paulo Motta commented on CASSANDRA-8621:


You probably want to check your TCP keepalive settings: 
https://docs.datastax.com/en/cassandra/2.0/cassandra/troubleshooting/trblshootIdleFirewall.html

> For streaming operations, when a socket is closed/reset, we should 
> retry/reinitiate that stream
> ---
>
> Key: CASSANDRA-8621
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8621
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Streaming and Messaging
>Reporter: Jeremy Hanna
>Assignee: Paulo Motta
>
> Currently we have a setting (streaming_socket_timeout_in_ms) that will 
> timeout and retry the stream operation in the case where tcp is idle for a 
> period of time.  However in the case where the socket is closed or reset, we 
> do not retry the operation.  This can happen for a number of reasons, 
> including when a firewall sends a reset message on a socket during a 
> streaming operation, such as nodetool rebuild necessarily across DCs or 
> repairs.
> Doing a retry would make the streaming operations more resilient.  It would 
> be good to log the retry clearly as well (with the stream session ID and node 
> address).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8621) For streaming operations, when a socket is closed/reset, we should retry/reinitiate that stream

2016-03-11 Thread Paulo Motta (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15191003#comment-15191003
 ] 

Paulo Motta commented on CASSANDRA-8621:


You probably want to check your TCP keepalive settings: 
https://docs.datastax.com/en/cassandra/2.0/cassandra/troubleshooting/trblshootIdleFirewall.html

> For streaming operations, when a socket is closed/reset, we should 
> retry/reinitiate that stream
> ---
>
> Key: CASSANDRA-8621
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8621
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Streaming and Messaging
>Reporter: Jeremy Hanna
>Assignee: Paulo Motta
>
> Currently we have a setting (streaming_socket_timeout_in_ms) that will 
> timeout and retry the stream operation in the case where tcp is idle for a 
> period of time.  However in the case where the socket is closed or reset, we 
> do not retry the operation.  This can happen for a number of reasons, 
> including when a firewall sends a reset message on a socket during a 
> streaming operation, such as nodetool rebuild necessarily across DCs or 
> repairs.
> Doing a retry would make the streaming operations more resilient.  It would 
> be good to log the retry clearly as well (with the stream session ID and node 
> address).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8621) For streaming operations, when a socket is closed/reset, we should retry/reinitiate that stream

2015-10-30 Thread Nicholas Gaugler (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982842#comment-14982842
 ] 

Nicholas Gaugler commented on CASSANDRA-8621:
-

I constantly suffer from Broken Pipe issues.  Although I've attempted to tweak 
the value of streaming_socket_timeout_in_ms to work around it, rebuilds still 
completely fail.  Is this related?

> For streaming operations, when a socket is closed/reset, we should 
> retry/reinitiate that stream
> ---
>
> Key: CASSANDRA-8621
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8621
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jeremy Hanna
>Assignee: Paulo Motta
>
> Currently we have a setting (streaming_socket_timeout_in_ms) that will 
> timeout and retry the stream operation in the case where tcp is idle for a 
> period of time.  However in the case where the socket is closed or reset, we 
> do not retry the operation.  This can happen for a number of reasons, 
> including when a firewall sends a reset message on a socket during a 
> streaming operation, such as nodetool rebuild necessarily across DCs or 
> repairs.
> Doing a retry would make the streaming operations more resilient.  It would 
> be good to log the retry clearly as well (with the stream session ID and node 
> address).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8621) For streaming operations, when a socket is closed/reset, we should retry/reinitiate that stream

2015-07-21 Thread Paulo Motta (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635393#comment-14635393
 ] 

Paulo Motta commented on CASSANDRA-8621:


I'd like to discuss/validate a possible solution before diving into 
implementation.

Upon receiving a SocketException during a stablished StreamSession, the 
reconnection initiator will:
# Mark its view of the StreamSession as isReconnecting;
# Stop/close both incoming and outgoing message handlers and respective sockets;
#* Since the closing of sockets might generate additional SocketExceptions, we 
may ignore/log them while isReconnecting is set to true.
# Create new incoming and outgoing message handlers and sockets.
# Send a StreamInitMessage to the session peer with isReconnecting flag set 
to true.
# After the initialization is complete, the StreamSession.isReconnecting flag 
is set to false and the onInitializationComplete() is called to resume the 
streaming protocol.
# In case of failure during the process, the initiator will retry to stablish 
the connection up to max_streaming_retries property, and fail the stream 
session if it's not able to reconnect.

Upon receiving a StreamInitMessage with isReconnecting=true the reconnection 
follower will:
# Fetch the StreamSession object for that session: 
#* If StreamSession.isReconnecting is set to true on the reconnection follower, 
it means that peer is also trying to act as a reconnection initiator, so we 
have a conflict. We can use the node identifier or IP as a universal 
tie-breaker. Only the peer with the lowest IP/ID will have it's 
StreamInitMessage accepted by the other peer in case of a conflict. The other 
peer will have its init socket closed.
#* Otherwise, it will set its StreamSession.isReconnecting flag to true.
# Stop/close both incoming and outgoing message handlers and respective sockets;
#* Since the closing of sockets might generate additional SocketExceptions, we 
may ignore them while isReconnecting is set to true.
# Create new incoming and outgoing message handlers and sockets.
# Attach the outgoing socket to the new outgoing message handler.
# After the incoming socket is attached to the incoming message handler, the 
session is restablished and the StreamSession.isReconnecting is set to false.
# The session is restablished and everybody is happy.

What do you think of this approach [~yukim]?

 For streaming operations, when a socket is closed/reset, we should 
 retry/reinitiate that stream
 ---

 Key: CASSANDRA-8621
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8621
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jeremy Hanna
Assignee: Paulo Motta

 Currently we have a setting (streaming_socket_timeout_in_ms) that will 
 timeout and retry the stream operation in the case where tcp is idle for a 
 period of time.  However in the case where the socket is closed or reset, we 
 do not retry the operation.  This can happen for a number of reasons, 
 including when a firewall sends a reset message on a socket during a 
 streaming operation, such as nodetool rebuild necessarily across DCs or 
 repairs.
 Doing a retry would make the streaming operations more resilient.  It would 
 be good to log the retry clearly as well (with the stream session ID and node 
 address).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8621) For streaming operations, when a socket is closed/reset, we should retry/reinitiate that stream

2015-01-15 Thread Jonathan Shook (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279768#comment-14279768
 ] 

Jonathan Shook commented on CASSANDRA-8621:
---

For the scenario that prompted this ticket, it appeared that the streaming 
process was completely stalled. One side of the stream (the sender side) had an 
exception that appeared to be a connection reset. The receiving side appeared 
to think that the connection was still active, at least in terms of the 
netstats reported by nodetool. We were unable to verify whether this was 
specifically the case in terms of connected sockets due to the fact that there 
were multiple streams for those peers, and there is no simple way to correlate 
a specific stream to a tcp session.

[~yukim]
If there is a diagnostic method that we can use to provide more information 
about specific stalled streams, please let us know so that we can approach the 
user to get more data.


 For streaming operations, when a socket is closed/reset, we should 
 retry/reinitiate that stream
 ---

 Key: CASSANDRA-8621
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8621
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jeremy Hanna
Assignee: Yuki Morishita

 Currently we have a setting (streaming_socket_timeout_in_ms) that will 
 timeout and retry the stream operation in the case where tcp is idle for a 
 period of time.  However in the case where the socket is closed or reset, we 
 do not retry the operation.  This can happen for a number of reasons, 
 including when a firewall sends a reset message on a socket during a 
 streaming operation, such as nodetool rebuild necessarily across DCs or 
 repairs.
 Doing a retry would make the streaming operations more resilient.  It would 
 be good to log the retry clearly as well (with the stream session ID and node 
 address).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8621) For streaming operations, when a socket is closed/reset, we should retry/reinitiate that stream

2015-01-15 Thread Jonathan Shook (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279774#comment-14279774
 ] 

Jonathan Shook commented on CASSANDRA-8621:
---

As well, there were no TCP level errors showing for the receiving side. So it 
is unclear whether exceptions are being omitted, or whether there was something 
really strange occurring with the network.

 For streaming operations, when a socket is closed/reset, we should 
 retry/reinitiate that stream
 ---

 Key: CASSANDRA-8621
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8621
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jeremy Hanna
Assignee: Yuki Morishita

 Currently we have a setting (streaming_socket_timeout_in_ms) that will 
 timeout and retry the stream operation in the case where tcp is idle for a 
 period of time.  However in the case where the socket is closed or reset, we 
 do not retry the operation.  This can happen for a number of reasons, 
 including when a firewall sends a reset message on a socket during a 
 streaming operation, such as nodetool rebuild necessarily across DCs or 
 repairs.
 Doing a retry would make the streaming operations more resilient.  It would 
 be good to log the retry clearly as well (with the stream session ID and node 
 address).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)