[ 
https://issues.apache.org/jira/browse/CASSANDRA-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13476138#comment-13476138
 ] 

Mina Naguib commented on CASSANDRA-4740:
----------------------------------------

Unfortunately logging level TRACE is too verbose.  On this node it produces 
1.7MB of logs per second.  I switched it back to INFO.

I discovered however that you don't need to restart cassandra for log level 
changes in log4j-server.properties to take effect.  Next time I see the problem 
I'll switch to TRACE and post back anything interesting I find.
                
> Phantom TCP connections, failing hinted handoff
> -----------------------------------------------
>
>                 Key: CASSANDRA-4740
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4740
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 1.1.2
>         Environment: Linux 3.4.9, java 1.6.0_35-b10
>            Reporter: Mina Naguib
>            Priority: Minor
>              Labels: connection, handoff, hinted, orphan, phantom, tcp, zombie
>         Attachments: write_latency.png
>
>
> IP addresses in report anonymized:
> Had a server running cassandra (1.1.1.10) reboot ungracefully.  Reboot and 
> startup was successful and uneventful.  cassandra went back into service ok.
> From that point onwards however, several (but not all) machines in the 
> cassandra cluster started having difficulty with hinted handoff to that 
> machine.  This was despite nodetool ring showing Up across the board.
> Here's an example of an attempt, every 10 minutes, by a node (1.1.1.11) to 
> replay hints to the node that was rebooted:
> {code}
> INFO [HintedHandoff:1] 2012-10-01 11:07:23,293 HintedHandOffManager.java 
> (line 294) Started hinted handoff for token: 
> 122879743610338889583996386017027409691 with IP: /1.1.1.10
> INFO [HintedHandoff:1] 2012-10-01 11:07:33,295 HintedHandOffManager.java 
> (line 372) Timed out replaying hints to /1.1.1.10; aborting further deliveries
> INFO [HintedHandoff:1] 2012-10-01 11:07:33,295 HintedHandOffManager.java 
> (line 390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10
> INFO [HintedHandoff:1] 2012-10-01 11:17:23,312 HintedHandOffManager.java 
> (line 294) Started hinted handoff for token: 
> 122879743610338889583996386017027409691 with IP: /1.1.1.10
> INFO [HintedHandoff:1] 2012-10-01 11:17:33,319 HintedHandOffManager.java 
> (line 372) Timed out replaying hints to /1.1.1.10; aborting further deliveries
> INFO [HintedHandoff:1] 2012-10-01 11:17:33,319 HintedHandOffManager.java 
> (line 390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10
> INFO [HintedHandoff:1] 2012-10-01 11:27:23,335 HintedHandOffManager.java 
> (line 294) Started hinted handoff for token: 
> 122879743610338889583996386017027409691 with IP: /1.1.1.10
> INFO [HintedHandoff:1] 2012-10-01 11:27:33,337 HintedHandOffManager.java 
> (line 372) Timed out replaying hints to /1.1.1.10; aborting further deliveries
> INFO [HintedHandoff:1] 2012-10-01 11:27:33,337 HintedHandOffManager.java 
> (line 390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10
> INFO [HintedHandoff:1] 2012-10-01 11:37:23,357 HintedHandOffManager.java 
> (line 294) Started hinted handoff for token: 
> 122879743610338889583996386017027409691 with IP: /1.1.1.10
> INFO [HintedHandoff:1] 2012-10-01 11:37:33,358 HintedHandOffManager.java 
> (line 372) Timed out replaying hints to /1.1.1.10; aborting further deliveries
> INFO [HintedHandoff:1] 2012-10-01 11:37:33,359 HintedHandOffManager.java 
> (line 390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10
> INFO [HintedHandoff:1] 2012-10-01 11:47:23,412 HintedHandOffManager.java 
> (line 294) Started hinted handoff for token: 
> 122879743610338889583996386017027409691 with IP: /1.1.1.10
> INFO [HintedHandoff:1] 2012-10-01 11:47:33,414 HintedHandOffManager.java 
> (line 372) Timed out replaying hints to /1.1.1.10; aborting further deliveries
> INFO [HintedHandoff:1] 2012-10-01 11:47:33,414 HintedHandOffManager.java 
> (line 390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10
> {code}
> I started poking around, and discovered that several nodes held ESTABLISHED 
> TCP connections that didn't have a live endpoint on the rebooted node.  My 
> guess is they were live prior to the reboot, and after the reboot the nodes 
> still see them as live and unsuccessfully try to use them.
> Example, on the node that was rebooted:
> {code}
> .10 ~ # netstat -tn | grep 1.1.1.11
> tcp        0      0 1.1.1.10:7000        1.1.1.11:40960        ESTABLISHED
> tcp        0      0 1.1.1.10:34370       1.1.1.11:7000         ESTABLISHED
> tcp        0      0 1.1.1.10:45518       1.1.1.11:7000         ESTABLISHED
> {code}
> While on the node that's failing to hint to it:
> {code}
> .11 ~ # netstat -tn | grep 1.1.1.10
> tcp        0      0 1.1.1.11:7000         1.1.1.10:34370       ESTABLISHED
> tcp        0      0 1.1.1.11:7000         1.1.1.10:45518       ESTABLISHED
> tcp        0      0 1.1.1.11:7000         1.1.1.10:53316       ESTABLISHED
> tcp        0      0 1.1.1.11:7000         1.1.1.10:43239       ESTABLISHED
> tcp        0      0 1.1.1.11:40960        1.1.1.10:7000        ESTABLISHED
> {code}
> Notice the phantom connections on :53316 and :43239 which do not appear on 
> the remote 1.1.1.10
> On .11 I tried disabling and enabling gossip, but that did not restart :7000 
> nor clean up the 2 phantom connections.  For good measure I also tried 
> disabling and enabling thrift (long shot), and that didn't help either.
> The only thing that helped was to actually stop and start cassandra, in a 
> rolling fashion, on each node that was having trouble hinting to the machine 
> that was rebooted.  The phantom connections naturally disappeared, write 
> volume on 1.1.1.10 rose for a while, and all the hints were sent successfully.
> I'm unsure whether the phantom TCP connections are a cause, or just loosely 
> correlated, to the hinted handoff failure every 10 minutes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to