To elaborate a bit on what Marcin said: * Once a node starts to believe that a few other nodes are down, it seems to stay that way for a very long time (hours). I'm not even sure it will recover without a restart. * I've tried to stop then start gossip with nodetool on the node that thinks several other nodes is down. Did not help. * nodetool gossipinfo when run on an affected node claims STATUS:NORMAL for all nodes (including the ones marked as down in status output) * It is quite possible that the problem starts at the time of day when we have a lot of bulkloading going on. But why does it then stay for several hours after the load goes down? * I have the feeling this started with our upgrade from 1.2.18 to 2.0.12 about a month ago, but I have no hard data to back that up.
Regarding region/snitch - this is not an AWS deployment, we run on our own datacenter with GossipingPropertyFileSnitch. Right now I have this situation with one node (04-05) thinking that there are 4 nodes down. The rest of the cluster (56 nodes in total) thinks all nodes are up. Load on cluster right now is minimal, there's no GC going on. Heap usage is approximately 3.5/6Gb. root@cssa04-05:~# nodetool status|grep DN DN 2001:4c28:1:413:0:1:2:5 1.07 TB 256 1.8% 114ff46e-57d0-40dd-87fb-3e4259e96c16 rack2 DN 2001:4c28:1:413:0:1:2:6 1.06 TB 256 1.8% b161a6f3-b940-4bba-9aa3-cfb0fc1fe759 rack2 DN 2001:4c28:1:413:0:1:2:13 896.82 GB 256 1.6% 4a488366-0db9-4887-b538-4c5048a6d756 rack2 DN 2001:4c28:1:413:0:1:3:7 1.04 TB 256 1.8% 95cf2cdb-d364-4b30-9b91-df4c37f3d670 rack3 Excerpt from nodetool gossipinfo showing one node that status thinks is down (2:5) and one that status thinks is up (3:12): /2001:4c28:1:413:0:1:2:5 generation:1427712750 heartbeat:2310212 NET_VERSION:7 RPC_ADDRESS:0.0.0.0 RELEASE_VERSION:2.0.13 RACK:rack2 LOAD:1.172524771195E12 INTERNAL_IP:2001:4c28:1:413:0:1:2:5 HOST_ID:114ff46e-57d0-40dd-87fb-3e4259e96c16 DC:iceland SEVERITY:0.0 STATUS:NORMAL,100493381707736523347375230104768602825 SCHEMA:4b994277-19a5-3458-b157-f69ef9ad3cda /2001:4c28:1:413:0:1:3:12 generation:1427714889 heartbeat:2305710 NET_VERSION:7 RPC_ADDRESS:0.0.0.0 RELEASE_VERSION:2.0.13 RACK:rack3 LOAD:1.047542503234E12 INTERNAL_IP:2001:4c28:1:413:0:1:3:12 HOST_ID:bb20ddcb-0a14-4d91-b90d-fb27536d6b00 DC:iceland SEVERITY:0.0 STATUS:NORMAL,100163259989151698942931348962560111256 SCHEMA:4b994277-19a5-3458-b157-f69ef9ad3cda I also tried disablegossip + enablegossip on 02-05 to see if that made 04-05 mark it as up, with no success. Please let me know what other debug information I can provide. Regards, \EF On Thu, Apr 2, 2015 at 6:56 PM, daemeon reiydelle <daeme...@gmail.com> wrote: > Do you happen to be using a tool like Nagios or Ganglia that are able to > report utilization (CPU, Load, disk io, network)? There are plugins for > both that will also notify you of (depending on whether you enabled the > intermediate GC logging) about what is happening. > > > > On Thu, Apr 2, 2015 at 8:35 AM, Jan <cne...@yahoo.com> wrote: > >> Marcin ; >> >> are all your nodes within the same Region ? >> If not in the same region, what is the Snitch type that you are using >> ? >> >> Jan/ >> >> >> >> On Thursday, April 2, 2015 3:28 AM, Michal Michalski < >> michal.michal...@boxever.com> wrote: >> >> >> Hey Marcin, >> >> Are they actually going up and down repeatedly (flapping) or just down >> and they never come back? >> There might be different reasons for flapping nodes, but to list what I >> have at the top of my head right now: >> >> 1. Network issues. I don't think it's your case, but you can read about >> the issues some people are having when deploying C* on AWS EC2 (keyword to >> look for: phi_convict_threshold) >> >> 2. Heavy load. Node is under heavy load because of massive number of >> reads / writes / bulkloads or e.g. unthrottled compaction etc., which may >> result in extensive GC. >> >> Could any of these be a problem in your case? I'd start from >> investigating GC logs e.g. to see how long does the "stop the world" full >> GC take (GC logs should be on by default from what I can see [1]) >> >> [1] https://issues.apache.org/jira/browse/CASSANDRA-5319 >> >> Michał >> >> >> Kind regards, >> Michał Michalski, >> michal.michal...@boxever.com >> >> On 2 April 2015 at 11:05, Marcin Pietraszek <mpietras...@opera.com> >> wrote: >> >> Hi! >> >> We have 56 node cluster with C* 2.0.13 + CASSANDRA-9036 patch >> installed. Assume we have nodes A, B, C, D, E. On some irregular basis >> one of those nodes starts to report that subset of other nodes is in >> DN state although C* deamon on all nodes is running: >> >> A$ nodetool status >> UN B >> DN C >> DN D >> UN E >> >> B$ nodetool status >> UN A >> UN C >> UN D >> UN E >> >> C$ nodetool status >> DN A >> UN B >> UN D >> UN E >> >> After restart of A node, C and D report that A it's in UN and also A >> claims that whole cluster is in UN state. Right now I don't have any >> clear steps to reproduce that situation, do you guys have any idea >> what could be causing such behaviour? How this could be prevented? >> >> It seems like when A node is a coordinator and gets request for some >> data being replicated on C and D it respond with Unavailable >> exception, after restarting A that problem disapears. >> >> -- >> mp >> >> >> >> >> >