[ https://issues.apache.org/jira/browse/CASSANDRA-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504266#comment-13504266 ]
Edward Capriolo commented on CASSANDRA-4740: -------------------------------------------- I have set the Xss to its default which is 256. The cost of trying to micro optimize this setting is outweighed by the trouble it causes imho. You saving a couple k a socket but give yourself many hard to debug problems. > Phantom TCP connections, failing hinted handoff > ----------------------------------------------- > > Key: CASSANDRA-4740 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4740 > Project: Cassandra > Issue Type: Bug > Affects Versions: 1.1.2 > Environment: Linux 3.4.9, java 1.6.0_35-b10 > Reporter: Mina Naguib > Priority: Minor > Labels: connection, handoff, hinted, orphan, phantom, tcp, zombie > Attachments: write_latency.png > > > IP addresses in report anonymized: > Had a server running cassandra (1.1.1.10) reboot ungracefully. Reboot and > startup was successful and uneventful. cassandra went back into service ok. > From that point onwards however, several (but not all) machines in the > cassandra cluster started having difficulty with hinted handoff to that > machine. This was despite nodetool ring showing Up across the board. > Here's an example of an attempt, every 10 minutes, by a node (1.1.1.11) to > replay hints to the node that was rebooted: > {code} > INFO [HintedHandoff:1] 2012-10-01 11:07:23,293 HintedHandOffManager.java > (line 294) Started hinted handoff for token: > 122879743610338889583996386017027409691 with IP: /1.1.1.10 > INFO [HintedHandoff:1] 2012-10-01 11:07:33,295 HintedHandOffManager.java > (line 372) Timed out replaying hints to /1.1.1.10; aborting further deliveries > INFO [HintedHandoff:1] 2012-10-01 11:07:33,295 HintedHandOffManager.java > (line 390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10 > INFO [HintedHandoff:1] 2012-10-01 11:17:23,312 HintedHandOffManager.java > (line 294) Started hinted handoff for token: > 122879743610338889583996386017027409691 with IP: /1.1.1.10 > INFO [HintedHandoff:1] 2012-10-01 11:17:33,319 HintedHandOffManager.java > (line 372) Timed out replaying hints to /1.1.1.10; aborting further deliveries > INFO [HintedHandoff:1] 2012-10-01 11:17:33,319 HintedHandOffManager.java > (line 390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10 > INFO [HintedHandoff:1] 2012-10-01 11:27:23,335 HintedHandOffManager.java > (line 294) Started hinted handoff for token: > 122879743610338889583996386017027409691 with IP: /1.1.1.10 > INFO [HintedHandoff:1] 2012-10-01 11:27:33,337 HintedHandOffManager.java > (line 372) Timed out replaying hints to /1.1.1.10; aborting further deliveries > INFO [HintedHandoff:1] 2012-10-01 11:27:33,337 HintedHandOffManager.java > (line 390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10 > INFO [HintedHandoff:1] 2012-10-01 11:37:23,357 HintedHandOffManager.java > (line 294) Started hinted handoff for token: > 122879743610338889583996386017027409691 with IP: /1.1.1.10 > INFO [HintedHandoff:1] 2012-10-01 11:37:33,358 HintedHandOffManager.java > (line 372) Timed out replaying hints to /1.1.1.10; aborting further deliveries > INFO [HintedHandoff:1] 2012-10-01 11:37:33,359 HintedHandOffManager.java > (line 390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10 > INFO [HintedHandoff:1] 2012-10-01 11:47:23,412 HintedHandOffManager.java > (line 294) Started hinted handoff for token: > 122879743610338889583996386017027409691 with IP: /1.1.1.10 > INFO [HintedHandoff:1] 2012-10-01 11:47:33,414 HintedHandOffManager.java > (line 372) Timed out replaying hints to /1.1.1.10; aborting further deliveries > INFO [HintedHandoff:1] 2012-10-01 11:47:33,414 HintedHandOffManager.java > (line 390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10 > {code} > I started poking around, and discovered that several nodes held ESTABLISHED > TCP connections that didn't have a live endpoint on the rebooted node. My > guess is they were live prior to the reboot, and after the reboot the nodes > still see them as live and unsuccessfully try to use them. > Example, on the node that was rebooted: > {code} > .10 ~ # netstat -tn | grep 1.1.1.11 > tcp 0 0 1.1.1.10:7000 1.1.1.11:40960 ESTABLISHED > tcp 0 0 1.1.1.10:34370 1.1.1.11:7000 ESTABLISHED > tcp 0 0 1.1.1.10:45518 1.1.1.11:7000 ESTABLISHED > {code} > While on the node that's failing to hint to it: > {code} > .11 ~ # netstat -tn | grep 1.1.1.10 > tcp 0 0 1.1.1.11:7000 1.1.1.10:34370 ESTABLISHED > tcp 0 0 1.1.1.11:7000 1.1.1.10:45518 ESTABLISHED > tcp 0 0 1.1.1.11:7000 1.1.1.10:53316 ESTABLISHED > tcp 0 0 1.1.1.11:7000 1.1.1.10:43239 ESTABLISHED > tcp 0 0 1.1.1.11:40960 1.1.1.10:7000 ESTABLISHED > {code} > Notice the phantom connections on :53316 and :43239 which do not appear on > the remote 1.1.1.10 > On .11 I tried disabling and enabling gossip, but that did not restart :7000 > nor clean up the 2 phantom connections. For good measure I also tried > disabling and enabling thrift (long shot), and that didn't help either. > The only thing that helped was to actually stop and start cassandra, in a > rolling fashion, on each node that was having trouble hinting to the machine > that was rebooted. The phantom connections naturally disappeared, write > volume on 1.1.1.10 rose for a while, and all the hints were sent successfully. > I'm unsure whether the phantom TCP connections are a cause, or just loosely > correlated, to the hinted handoff failure every 10 minutes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira