[ 
https://issues.apache.org/jira/browse/CASSANDRA-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Meichle updated CASSANDRA-13984:
-------------------------------------
    Description: 
Hello. For the last two months we've been fighting performance issues relating 
to node restarts and hint playback, and were able to get a pretty good bit of 
proof for the issue last night when debugging one of these restarts.

The main issue we've been fighting with these clusters is very slow and 
unstable node restarts which seem to be due to hint playback with logs 
indicating "Finished hinted handoff of file ... hints to endpoint: 1-2-3-4, 
partially". We've posted about this on the mailing list as well and this bug 
seems to be the cause of this issue 
https://lists.apache.org/thread.html/41e9cb7b20696dd177fe925aba30d372614fcfe11d98275c7d9579cc@%3Cuser.cassandra.apache.org%3E

Our process for restarting nodes is to run nodetool drain and then restart the 
service via init. When originally watching the logs we saw on the restarting 
node the standard cassandra startup process of initializing keyspaces, loading 
sstables, and finally starting to handshake with the cluster, which takes about 
5 minutes. After this, the logs are very quiet until 15 minutes later.

During this 15 minute period, some peers (between 0 and half of the cluster) 
are reported as DN status by nodetool status.

When checking one of the nodes that is reported as DN, we see hint playback 
logging lines such as:

{code}
INFO [HintsDispatcher:12] 2017-10-31 01:06:18,876 
HintsDispatchExecutor.java:289 - Finished hinted handoff of file 
8724f664-dff1-4c20-887b-6a26ae54a9b5-1509410768866-1.hints to endpoint 
/20.0.131.175: 8724f664-dff1-4c20-887b-6a26ae54a9b5, partially
{code}

We traced the codepath from this log line, starting in HintsDispatcher. It 
appears the callback that is created for the hint sending is timing out as we 
verified this the JMX metrics for the HintsService:

{code}
org.apache.cassandra.metrics:name=HintsFailed,type=HintsService
org.apache.cassandra.metrics:name=HintsSucceeded,type=HintsService
org.apache.cassandra.metrics:name=HintsTimedOut,type=HintsService
{code}

in which HintsTimedOut was incrementing during this period. We suspected this 
might be due to the restarting node being overloaded, as we do see heavy IO on 
the restart. However that IO pattern is not for the complete duration of the 
"partial" hint playback period. We also generated a flame graph (using the 
jvm-tools stcap and ssa) of the restarting node during this time period, 
attached as 3.11_node_restart_without_disablegossip.svg. This seemed to confirm 
that the node isn't doing much during the idle period as threads are just 
parked.

What is interesting is checking the debug logs on one of the nodes that is 
marked as DN and is partially playing back hints. While the log lines for hint 
playback are not helpful, what lead us to this apparent bug is the debug logs 
for OutboundTcpConnection. Below is a pruned snippet of these logs:

{code}
 [GossipStage:1] 2017-10-31 00:46:00,586 OutboundTcpConnection.java:205 - 
Enqueuing socket close for /20.0.131.175
 [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:00,586 
OutboundTcpConnection.java:410 - Socket to /20.0.131.175 closed
 [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 00:46:00,586 
OutboundTcpConnection.java:410 - Socket to /20.0.131.175 closed
 [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 00:46:00,598 
OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
 [HANDSHAKE-/20.0.131.175] 2017-10-31 00:46:00,599 
OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
 [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 00:46:00,601 
OutboundTcpConnection.java:532 - Done connecting to /20.0.131.175
 [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:47,795 
OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
 [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:47,797 
OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
 [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:47,899 
OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
 [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:48,002 
OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
    ... repeats until the node restarts gossip transport ...
 [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:49:04,703 
OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
 [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:49:04,806 
OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
 [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:50:46,545 
OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
 [HANDSHAKE-/20.0.131.175] 2017-10-31 00:50:46,547 
OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
 [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:50:46,600 
OutboundTcpConnection.java:532 - Done connecting to /20.0.131.175
 [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,376 
OutboundTcpConnection.java:410 - Socket to /20.0.131.175 closed
 [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,377 
OutboundTcpConnection.java:349 - Error writing to /20.0.131.175
 [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,382 
OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
 [HANDSHAKE-/20.0.131.175] 2017-10-31 01:06:25,383 
OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
 [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,385 
OutboundTcpConnection.java:532 - Done connecting to /20.0.131.175
{code}

The timeline of events here is:

{code}
T 00:00 - restarting node closed the small + gossip socket to the hints host, 
and restarted
T 00:00 - hints host then reattempted a connection and handshaked for small 
instantly after
(node restarted)
T 00:47 - T4:46 hints host repeatedly attempts outbound gossip connections for 
this time range hints are still playing back "partially"
T 21:27 - hints hosts logs the previously (reattempted) small connection as 
errored (timeout)
T 21:27 - hints host retries the connection
T 21:27 - hints work, things are done.
{code}

Non pruned debug log snippet for the restarting node in the time period of the 
reconnection is:

{code}
debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,537 
StorageService.java:1439 - DRAINING: starting drain process
debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,537 
HintsService.java:220 - Paused hints dispatch
debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,578 
Server.java:176 - Stop listening for CQL clients
debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,581 
Gossiper.java:1530 - Announcing shutdown
debug.log:DEBUG [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,582 
StorageService.java:2252 - Node /20.0.131.175 state shutdown, token [ .... 
token range .... ]
debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,582 
StorageService.java:2255 - Node /20.0.131.175 state jump to shutdown
{code}

and on the hint playing host:

{code}
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:45:59,642 
FailureDetector.java:457 - Ignoring interval time of 2408381657 for /20.0.132.6
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:45:59,642 
FailureDetector.java:457 - Ignoring interval time of 3000155164 for 
/20.0.131.252
debug.log.8.zip:INFO  [GossipStage:1] 2017-10-31 00:46:00,586 
Gossiper.java:1044 - InetAddress /20.0.131.175 is now DOWN
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586 
OutboundTcpConnection.java:205 - Enqueuing socket close for /20.0.131.175
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586 
OutboundTcpConnection.java:205 - Enqueuing socket close for /20.0.131.175
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586 
OutboundTcpConnection.java:205 - Enqueuing socket close for /20.0.131.175
debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Gossip] 
2017-10-31 00:46:00,586 OutboundTcpConnection.java:410 - Socket to 
/20.0.131.175 closed
debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Small] 
2017-10-31 00:46:00,586 OutboundTcpConnection.java:410 - Socket to 
/20.0.131.175 closed
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586 
FailureDetector.java:323 - Forcing conviction of /20.0.131.175
debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Small] 
2017-10-31 00:46:00,598 OutboundTcpConnection.java:424 - Attempting to connect 
to /20.0.131.175
debug.log.8.zip:INFO  [HANDSHAKE-/20.0.131.175] 2017-10-31 00:46:00,599 
OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Small] 
2017-10-31 00:46:00,601 OutboundTcpConnection.java:532 - Done connecting to 
/20.0.131.175
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,642 
FailureDetector.java:457 - Ignoring interval time of 2001490285 for /20.0.131.6
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,642 
FailureDetector.java:457 - Ignoring interval time of 2001552636 for /20.0.132.8
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,642 
FailureDetector.java:457 - Ignoring interval time of 2001576461 for /20.0.131.8
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
FailureDetector.java:457 - Ignoring interval time of 2001599351 for /20.0.130.18
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
FailureDetector.java:457 - Ignoring interval time of 2930107432 for 
/20.0.131.148
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
FailureDetector.java:457 - Ignoring interval time of 2001646958 for 
/20.0.130.154
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
FailureDetector.java:457 - Ignoring interval time of 2001665876 for /20.0.132.26
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
FailureDetector.java:457 - Ignoring interval time of 2001689764 for /20.0.130.31
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
FailureDetector.java:457 - Ignoring interval time of 2001624719 for /20.0.130.42
{code}

and lastly we generated a flame graph with the disablegossip step included, 
attached as 3.11_node_restart_with_disablegossip.svg

Based on this, it would appear that nodetool drain is correctly gossiping to 
its peers that it is shutting down, and the peers teardown their own gossip and 
small connections. The bug appears to be the fact that peers nearly instantly 
reconnect the small connection which then lingers around in a dead state until 
a 20 minute timeout kills it. The OutboundTcpConnection class does not seem to 
detect this condition, and OutboundTcpConnectionPool reuses this dead 
connection for hint playback which is causing the partially log lines and the 
JMX timeout metrics for the hints service.

We have worked around this issue by issuing a nodetool disablegossip after 
nodetool drain to further kill any of these connections and have not seen any 
issues on the 8 node restarts we've done since that process change. This has 
then caused the small connection to get killed correctly, which allows it to 
reconnect correctly once the node restarts and have hints playback correctly.

I have also not been able to recreate this issue on a smaller non production 
cluster or locally via CCM. It appears to be a two folded issue:

- On the peer nodes there seems to be some kind of race condition in the 
handling of of the gossip message indicating the restarting node is shutting 
down (resulting from nodetool drain) and whichever thread is using the small 
tcp connection.
- On the restarting node a new small connection is accepted within the duration 
of nodetool drain.

If there is anything else I should provide here, please let me know. 

We've been discussing this with [~rustyrazorblade] as well.

  was:
Hello. For the last two months we've been fighting performance issues relating 
to node restarts and hint playback, and were able to get a pretty good bit of 
proof for the issue last night when debugging one of these restarts.

The main issues we've been fighting with these clusters. Very slow and unstable 
node restarts which seem to be due to hint playback with logs indicating 
"Finished hinted handoff of file ... hints to endpoint: 1-2-3-4, partially". 
We've posted about this on the mailing list as well and this bug seems to be 
the cause of this issue 
https://lists.apache.org/thread.html/41e9cb7b20696dd177fe925aba30d372614fcfe11d98275c7d9579cc@%3Cuser.cassandra.apache.org%3E

Our process for restarting nodes is to run nodetool drain and then restart the 
service via init. When originally watching the logs we saw on the restarting 
node the standard cassandra startup process of initializing keyspaces, loading 
sstables, and finally starting to handshake with the cluster, which takes about 
5 minutes. After this, the logs are very quiet until 15 minutes later.

During this 15 minute period, some peers (between 0 and half of the cluster) 
are reported as DN status by nodetool status.

When checking one of the nodes that is reported as DN, we see hint playback 
logging lines such as:

{code}
INFO [HintsDispatcher:12] 2017-10-31 01:06:18,876 
HintsDispatchExecutor.java:289 - Finished hinted handoff of file 
8724f664-dff1-4c20-887b-6a26ae54a9b5-1509410768866-1.hints to endpoint 
/20.0.131.175: 8724f664-dff1-4c20-887b-6a26ae54a9b5, partially
{code}

We traced the codepath from this log line, starting in HintsDispatcher. It 
appears the callback that is created for the hint sending is timing out as we 
verified this the JMX metrics for the HintsService:

{code}
org.apache.cassandra.metrics:name=HintsFailed,type=HintsService
org.apache.cassandra.metrics:name=HintsSucceeded,type=HintsService
org.apache.cassandra.metrics:name=HintsTimedOut,type=HintsService
{code}

in which HintsTimedOut was incrementing during this period. We suspected this 
might be due to the restarting node being overloaded, as we do see heavy IO on 
the restart. However that IO pattern is not for the complete duration of the 
"partial" hint playback period. We also generated a flame graph (using the 
jvm-tools stcap and ssa) of the restarting node during this time period, 
attached as 3.11_node_restart_without_disablegossip.svg. This seemed to confirm 
that the node isn't doing much during the idle period as threads are just 
parked.

What is interesting is checking the debug logs on one of the nodes that is 
marked as DN and is partially playing back hints. While the log lines for hint 
playback are not helpful, what lead us to this apparent bug is the debug logs 
for OutboundTcpConnection. Below is a pruned snippet of these logs:

{code}
 [GossipStage:1] 2017-10-31 00:46:00,586 OutboundTcpConnection.java:205 - 
Enqueuing socket close for /20.0.131.175
 [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:00,586 
OutboundTcpConnection.java:410 - Socket to /20.0.131.175 closed
 [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 00:46:00,586 
OutboundTcpConnection.java:410 - Socket to /20.0.131.175 closed
 [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 00:46:00,598 
OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
 [HANDSHAKE-/20.0.131.175] 2017-10-31 00:46:00,599 
OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
 [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 00:46:00,601 
OutboundTcpConnection.java:532 - Done connecting to /20.0.131.175
 [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:47,795 
OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
 [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:47,797 
OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
 [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:47,899 
OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
 [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:48,002 
OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
    ... repeats until the node restarts gossip transport ...
 [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:49:04,703 
OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
 [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:49:04,806 
OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
 [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:50:46,545 
OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
 [HANDSHAKE-/20.0.131.175] 2017-10-31 00:50:46,547 
OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
 [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:50:46,600 
OutboundTcpConnection.java:532 - Done connecting to /20.0.131.175
 [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,376 
OutboundTcpConnection.java:410 - Socket to /20.0.131.175 closed
 [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,377 
OutboundTcpConnection.java:349 - Error writing to /20.0.131.175
 [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,382 
OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
 [HANDSHAKE-/20.0.131.175] 2017-10-31 01:06:25,383 
OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
 [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,385 
OutboundTcpConnection.java:532 - Done connecting to /20.0.131.175
{code}

The timeline of events here is:

{code}
T 00:00 - restarting node closed the small + gossip socket to the hints host, 
and restarted
T 00:00 - hints host then reattempted a connection and handshaked for small 
instantly after
(node restarted)
T 00:47 - T4:46 hints host repeatedly attempts outbound gossip connections for 
this time range hints are still playing back "partially"
T 21:27 - hints hosts logs the previously (reattempted) small connection as 
errored (timeout)
T 21:27 - hints host retries the connection
T 21:27 - hints work, things are done.
{code}

Non pruned debug log snippet for the restarting node in the time period of the 
reconnection is:

{code}
debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,537 
StorageService.java:1439 - DRAINING: starting drain process
debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,537 
HintsService.java:220 - Paused hints dispatch
debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,578 
Server.java:176 - Stop listening for CQL clients
debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,581 
Gossiper.java:1530 - Announcing shutdown
debug.log:DEBUG [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,582 
StorageService.java:2252 - Node /20.0.131.175 state shutdown, token [ .... 
token range .... ]
debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,582 
StorageService.java:2255 - Node /20.0.131.175 state jump to shutdown
{code}

and on the hint playing host:

{code}
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:45:59,642 
FailureDetector.java:457 - Ignoring interval time of 2408381657 for /20.0.132.6
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:45:59,642 
FailureDetector.java:457 - Ignoring interval time of 3000155164 for 
/20.0.131.252
debug.log.8.zip:INFO  [GossipStage:1] 2017-10-31 00:46:00,586 
Gossiper.java:1044 - InetAddress /20.0.131.175 is now DOWN
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586 
OutboundTcpConnection.java:205 - Enqueuing socket close for /20.0.131.175
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586 
OutboundTcpConnection.java:205 - Enqueuing socket close for /20.0.131.175
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586 
OutboundTcpConnection.java:205 - Enqueuing socket close for /20.0.131.175
debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Gossip] 
2017-10-31 00:46:00,586 OutboundTcpConnection.java:410 - Socket to 
/20.0.131.175 closed
debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Small] 
2017-10-31 00:46:00,586 OutboundTcpConnection.java:410 - Socket to 
/20.0.131.175 closed
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586 
FailureDetector.java:323 - Forcing conviction of /20.0.131.175
debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Small] 
2017-10-31 00:46:00,598 OutboundTcpConnection.java:424 - Attempting to connect 
to /20.0.131.175
debug.log.8.zip:INFO  [HANDSHAKE-/20.0.131.175] 2017-10-31 00:46:00,599 
OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Small] 
2017-10-31 00:46:00,601 OutboundTcpConnection.java:532 - Done connecting to 
/20.0.131.175
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,642 
FailureDetector.java:457 - Ignoring interval time of 2001490285 for /20.0.131.6
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,642 
FailureDetector.java:457 - Ignoring interval time of 2001552636 for /20.0.132.8
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,642 
FailureDetector.java:457 - Ignoring interval time of 2001576461 for /20.0.131.8
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
FailureDetector.java:457 - Ignoring interval time of 2001599351 for /20.0.130.18
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
FailureDetector.java:457 - Ignoring interval time of 2930107432 for 
/20.0.131.148
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
FailureDetector.java:457 - Ignoring interval time of 2001646958 for 
/20.0.130.154
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
FailureDetector.java:457 - Ignoring interval time of 2001665876 for /20.0.132.26
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
FailureDetector.java:457 - Ignoring interval time of 2001689764 for /20.0.130.31
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
FailureDetector.java:457 - Ignoring interval time of 2001624719 for /20.0.130.42
{code}

and lastly we generated a flame graph with the disablegossip step included, 
attached as 3.11_node_restart_with_disablegossip.svg

Based on this, it would appear that nodetool drain is correctly gossiping to 
its peers that it is shutting down, and the peers teardown their own gossip and 
small connections. The bug appears to be the fact that peers nearly instantly 
reconnect the small connection which then lingers around in a dead state until 
a 20 minute timeout kills it. The OutboundTcpConnection class does not seem to 
detect this condition, and OutboundTcpConnectionPool reuses this dead 
connection for hint playback which is causing the partially log lines and the 
JMX timeout metrics for the hints service.

We have worked around this issue by issuing a nodetool disablegossip after 
nodetool drain to further kill any of these connections and have not seen any 
issues on the 8 node restarts we've done since that process change. This has 
then caused the small connection to get killed correctly, which allows it to 
reconnect correctly once the node restarts and have hints playback correctly.

I have also not been able to recreate this issue on a smaller non production 
cluster or locally via CCM. It appears to be a two folded issue:

- On the peer nodes there seems to be some kind of race condition in the 
handling of of the gossip message indicating the restarting node is shutting 
down (resulting from nodetool drain) and whichever thread is using the small 
tcp connection.
- On the restarting node a new small connection is accepted within the duration 
of nodetool drain.

If there is anything else I should provide here, please let me know. 

We've been discussing this with [~rustyrazorblade] as well.


> dead Small OutboundTcpConnection to restarted nodes blocking hint delivery
> --------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13984
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13984
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Lifecycle
>         Environment: - We run entirely a counter based workload the clusters 
> we've been seeing this issue on (although it might happen for non counter 
> clusters).
> - We originally had a 48 node cassandra 3.7 cluster (averaging ~500GB of load 
> on each node) running on c4.4xlarge instances in EC2.
> - We've since split this cluster into two cassandra 3.11.0 clusters (for 
> other performance reasons), one with 1/3 of our dataset on 48 c4.4xlarge 
> instances, and the other 2/3 of our data set onto a 96 node c4.4xlarge 
> cluster.
> - All of these clusters run on ubuntu 14.04 with enhanced networking enabled.
>            Reporter: John Meichle
>         Attachments: 3.11_node_restart_with_disablegossip.svg, 
> 3.11_node_restart_without_disablegossip.svg
>
>
> Hello. For the last two months we've been fighting performance issues 
> relating to node restarts and hint playback, and were able to get a pretty 
> good bit of proof for the issue last night when debugging one of these 
> restarts.
> The main issue we've been fighting with these clusters is very slow and 
> unstable node restarts which seem to be due to hint playback with logs 
> indicating "Finished hinted handoff of file ... hints to endpoint: 1-2-3-4, 
> partially". We've posted about this on the mailing list as well and this bug 
> seems to be the cause of this issue 
> https://lists.apache.org/thread.html/41e9cb7b20696dd177fe925aba30d372614fcfe11d98275c7d9579cc@%3Cuser.cassandra.apache.org%3E
> Our process for restarting nodes is to run nodetool drain and then restart 
> the service via init. When originally watching the logs we saw on the 
> restarting node the standard cassandra startup process of initializing 
> keyspaces, loading sstables, and finally starting to handshake with the 
> cluster, which takes about 5 minutes. After this, the logs are very quiet 
> until 15 minutes later.
> During this 15 minute period, some peers (between 0 and half of the cluster) 
> are reported as DN status by nodetool status.
> When checking one of the nodes that is reported as DN, we see hint playback 
> logging lines such as:
> {code}
> INFO [HintsDispatcher:12] 2017-10-31 01:06:18,876 
> HintsDispatchExecutor.java:289 - Finished hinted handoff of file 
> 8724f664-dff1-4c20-887b-6a26ae54a9b5-1509410768866-1.hints to endpoint 
> /20.0.131.175: 8724f664-dff1-4c20-887b-6a26ae54a9b5, partially
> {code}
> We traced the codepath from this log line, starting in HintsDispatcher. It 
> appears the callback that is created for the hint sending is timing out as we 
> verified this the JMX metrics for the HintsService:
> {code}
> org.apache.cassandra.metrics:name=HintsFailed,type=HintsService
> org.apache.cassandra.metrics:name=HintsSucceeded,type=HintsService
> org.apache.cassandra.metrics:name=HintsTimedOut,type=HintsService
> {code}
> in which HintsTimedOut was incrementing during this period. We suspected this 
> might be due to the restarting node being overloaded, as we do see heavy IO 
> on the restart. However that IO pattern is not for the complete duration of 
> the "partial" hint playback period. We also generated a flame graph (using 
> the jvm-tools stcap and ssa) of the restarting node during this time period, 
> attached as 3.11_node_restart_without_disablegossip.svg. This seemed to 
> confirm that the node isn't doing much during the idle period as threads are 
> just parked.
> What is interesting is checking the debug logs on one of the nodes that is 
> marked as DN and is partially playing back hints. While the log lines for 
> hint playback are not helpful, what lead us to this apparent bug is the debug 
> logs for OutboundTcpConnection. Below is a pruned snippet of these logs:
> {code}
>  [GossipStage:1] 2017-10-31 00:46:00,586 OutboundTcpConnection.java:205 - 
> Enqueuing socket close for /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:00,586 
> OutboundTcpConnection.java:410 - Socket to /20.0.131.175 closed
>  [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 00:46:00,586 
> OutboundTcpConnection.java:410 - Socket to /20.0.131.175 closed
>  [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 00:46:00,598 
> OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
>  [HANDSHAKE-/20.0.131.175] 2017-10-31 00:46:00,599 
> OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 00:46:00,601 
> OutboundTcpConnection.java:532 - Done connecting to /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:47,795 
> OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:47,797 
> OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:47,899 
> OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:48,002 
> OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
>     ... repeats until the node restarts gossip transport ...
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:49:04,703 
> OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:49:04,806 
> OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:50:46,545 
> OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
>  [HANDSHAKE-/20.0.131.175] 2017-10-31 00:50:46,547 
> OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:50:46,600 
> OutboundTcpConnection.java:532 - Done connecting to /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,376 
> OutboundTcpConnection.java:410 - Socket to /20.0.131.175 closed
>  [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,377 
> OutboundTcpConnection.java:349 - Error writing to /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,382 
> OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
>  [HANDSHAKE-/20.0.131.175] 2017-10-31 01:06:25,383 
> OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,385 
> OutboundTcpConnection.java:532 - Done connecting to /20.0.131.175
> {code}
> The timeline of events here is:
> {code}
> T 00:00 - restarting node closed the small + gossip socket to the hints host, 
> and restarted
> T 00:00 - hints host then reattempted a connection and handshaked for small 
> instantly after
> (node restarted)
> T 00:47 - T4:46 hints host repeatedly attempts outbound gossip connections 
> for this time range hints are still playing back "partially"
> T 21:27 - hints hosts logs the previously (reattempted) small connection as 
> errored (timeout)
> T 21:27 - hints host retries the connection
> T 21:27 - hints work, things are done.
> {code}
> Non pruned debug log snippet for the restarting node in the time period of 
> the reconnection is:
> {code}
> debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,537 
> StorageService.java:1439 - DRAINING: starting drain process
> debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,537 
> HintsService.java:220 - Paused hints dispatch
> debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,578 
> Server.java:176 - Stop listening for CQL clients
> debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,581 
> Gossiper.java:1530 - Announcing shutdown
> debug.log:DEBUG [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,582 
> StorageService.java:2252 - Node /20.0.131.175 state shutdown, token [ .... 
> token range .... ]
> debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,582 
> StorageService.java:2255 - Node /20.0.131.175 state jump to shutdown
> {code}
> and on the hint playing host:
> {code}
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:45:59,642 
> FailureDetector.java:457 - Ignoring interval time of 2408381657 for 
> /20.0.132.6
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:45:59,642 
> FailureDetector.java:457 - Ignoring interval time of 3000155164 for 
> /20.0.131.252
> debug.log.8.zip:INFO  [GossipStage:1] 2017-10-31 00:46:00,586 
> Gossiper.java:1044 - InetAddress /20.0.131.175 is now DOWN
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586 
> OutboundTcpConnection.java:205 - Enqueuing socket close for /20.0.131.175
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586 
> OutboundTcpConnection.java:205 - Enqueuing socket close for /20.0.131.175
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586 
> OutboundTcpConnection.java:205 - Enqueuing socket close for /20.0.131.175
> debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Gossip] 
> 2017-10-31 00:46:00,586 OutboundTcpConnection.java:410 - Socket to 
> /20.0.131.175 closed
> debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Small] 
> 2017-10-31 00:46:00,586 OutboundTcpConnection.java:410 - Socket to 
> /20.0.131.175 closed
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586 
> FailureDetector.java:323 - Forcing conviction of /20.0.131.175
> debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Small] 
> 2017-10-31 00:46:00,598 OutboundTcpConnection.java:424 - Attempting to 
> connect to /20.0.131.175
> debug.log.8.zip:INFO  [HANDSHAKE-/20.0.131.175] 2017-10-31 00:46:00,599 
> OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
> debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Small] 
> 2017-10-31 00:46:00,601 OutboundTcpConnection.java:532 - Done connecting to 
> /20.0.131.175
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,642 
> FailureDetector.java:457 - Ignoring interval time of 2001490285 for 
> /20.0.131.6
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,642 
> FailureDetector.java:457 - Ignoring interval time of 2001552636 for 
> /20.0.132.8
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,642 
> FailureDetector.java:457 - Ignoring interval time of 2001576461 for 
> /20.0.131.8
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
> FailureDetector.java:457 - Ignoring interval time of 2001599351 for 
> /20.0.130.18
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
> FailureDetector.java:457 - Ignoring interval time of 2930107432 for 
> /20.0.131.148
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
> FailureDetector.java:457 - Ignoring interval time of 2001646958 for 
> /20.0.130.154
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
> FailureDetector.java:457 - Ignoring interval time of 2001665876 for 
> /20.0.132.26
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
> FailureDetector.java:457 - Ignoring interval time of 2001689764 for 
> /20.0.130.31
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
> FailureDetector.java:457 - Ignoring interval time of 2001624719 for 
> /20.0.130.42
> {code}
> and lastly we generated a flame graph with the disablegossip step included, 
> attached as 3.11_node_restart_with_disablegossip.svg
> Based on this, it would appear that nodetool drain is correctly gossiping to 
> its peers that it is shutting down, and the peers teardown their own gossip 
> and small connections. The bug appears to be the fact that peers nearly 
> instantly reconnect the small connection which then lingers around in a dead 
> state until a 20 minute timeout kills it. The OutboundTcpConnection class 
> does not seem to detect this condition, and OutboundTcpConnectionPool reuses 
> this dead connection for hint playback which is causing the partially log 
> lines and the JMX timeout metrics for the hints service.
> We have worked around this issue by issuing a nodetool disablegossip after 
> nodetool drain to further kill any of these connections and have not seen any 
> issues on the 8 node restarts we've done since that process change. This has 
> then caused the small connection to get killed correctly, which allows it to 
> reconnect correctly once the node restarts and have hints playback correctly.
> I have also not been able to recreate this issue on a smaller non production 
> cluster or locally via CCM. It appears to be a two folded issue:
> - On the peer nodes there seems to be some kind of race condition in the 
> handling of of the gossip message indicating the restarting node is shutting 
> down (resulting from nodetool drain) and whichever thread is using the small 
> tcp connection.
> - On the restarting node a new small connection is accepted within the 
> duration of nodetool drain.
> If there is anything else I should provide here, please let me know. 
> We've been discussing this with [~rustyrazorblade] as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to