[ https://issues.apache.org/jira/browse/CASSANDRA-9218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505013#comment-14505013 ]
Brandon Williams commented on CASSANDRA-9218: --------------------------------------------- bq. I have also observed while this was happening that other nodes were trying to establish connections (SYN packets sent) but the trouble node (A) were not picking up the line (no accept()). If you mean a TCP SYN (vs a gossip SYN) then that is pretty strange, since it would seem to indicate a network problem. Do you have netstat output from both sides when this happens? > Node thinks other nodes are down after heavy GC > ----------------------------------------------- > > Key: CASSANDRA-9218 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9218 > Project: Cassandra > Issue Type: Bug > Reporter: Erik Forsberg > > I have a few troublesome nodes which often end up doing very long GC pauses. > The root cause of this is yet to be found, but it's causing another problem - > the affected node(s) mark other nodes as down, and they never recover. > Here's how it goes: > 1. Node goes into troublesome mode, doing heavy GC with long (10+ seconds) GC > pauses. > 2. While this happens, node will mark other nodes as down. > 3. Once the overload situation resolves, the node still thinks the other > nodes are down (they are not). It's also quite common that other nodes think > the affected node is down. > So we often end up with node A thinking there's some 30 nodes down, then a > bunch of other nodes beliving node A is down. This in a cluster with 56 > nodes. > The only way to get out of the situation is to restart node A, and sometimes > a few other nodes. And while node A is in this state, any queries that use > node A as coordinator have a high risk of getting errors about not enough > replicas being available. > I have enabled TRACE level gossip debugging while this happens, and on node > A, there will be multiple messages about, "has already a pending echo, > skipping it" - i.e the debug line in Gossiper.java line 882. > I have also observed while this was happening that other nodes were trying to > establish connections (SYN packets sent) but the trouble node (A) were not > picking up the line (no accept()). > Not knowing exactly how Gossiper works here but it looks like node A is > sending out some gossiper echo messages, but then is too busy to get the > replies, and never retries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)