Is this using GPFS? If so, can you open a JIRA? It feels like potentially GPFS is not persisting the rack/DC info into system.peers and loses the DC on restart. This is somewhat understandable, but definitely deserves a JIRA.
On Thu, Mar 14, 2019 at 11:44 PM Stefan Miklosovic < stefan.mikloso...@instaclustr.com> wrote: > Hi Fd, > > I tried this on 3 nodes cluster. I killed node 2, both node1 and node3 > reported node2 to be DN, then I killed node1 and node3 and I restarted them > and node2 was reported like this: > > [root@spark-master-1 /]# nodetool status > Datacenter: DC1 > =============== > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > DN 172.19.0.8 ? 256 64.0% > bd75a5e2-2890-44c5-8f7a-fca1b4ce94ab r1 > Datacenter: dc1 > =============== > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 172.19.0.5 382.75 KiB 256 64.4% > 2a062140-2428-4092-b48b-7495d083d7f9 rack1 > UN 172.19.0.9 171.41 KiB 256 71.6% > 9590b791-ad53-4b5a-b4c7-b00408ed02dd rack3 > > Prior to killing of node1 and node3, node2 was indeed marked as DN but it > was part of the "Datacenter: dc1" output where both node1 and node3 were. > > But after killing both node1 and node3 (so cluster was totally down), > after restarting them, node2 was reported like that. > > I do not know what is the difference here. Are gossiping data somewhere > stored on the disk? I would say so, otherwise there is no way how could > node1 / node3 report > that node2 is down but at the same time I dont get why it is "out of the > list" where node1 and node3 are. > > > On Fri, 15 Mar 2019 at 02:42, Fd Habash <fmhab...@gmail.com> wrote: > >> I can conclusively say, none of these commands were run. However, I think >> this is the likely scenario … >> >> >> >> If you have a cluster of three nodes 1,2,3 … >> >> - If 3 shows as DN >> - Restart C* on 1 & 2 >> - Nodetool status should NOT show node 3 IP at all. >> >> >> >> Restarting the cluster while a node is down resets gossip state. >> >> >> >> There is a good chance this is what happened. >> >> >> >> Plausible? >> >> >> >> ---------------- >> Thank you >> >> >> >> *From: *Jeff Jirsa <jji...@gmail.com> >> *Sent: *Thursday, March 14, 2019 11:06 AM >> *To: *cassandra <user@cassandra.apache.org> >> *Subject: *Re: Cannot replace_address /10.xx.xx.xx because it doesn't >> exist ingossip >> >> >> >> Two things that wouldn't be a bug: >> >> >> >> You could have run removenode >> >> You could have run assassinate >> >> >> >> Also could be some new bug, but that's much less likely. >> >> >> >> >> >> On Thu, Mar 14, 2019 at 2:50 PM Fd Habash <fmhab...@gmail.com> wrote: >> >> I have a node which I know for certain was a cluster member last week. It >> showed in nodetool status as DN. When I attempted to replace it today, I >> got this message >> >> >> >> ERROR [main] 2019-03-14 14:40:49,208 CassandraDaemon.java:654 - Exception >> encountered during startup >> >> java.lang.RuntimeException: Cannot replace_address /10.xx.xx.xxx.xx >> because it doesn't exist in gossip >> >> at >> org.apache.cassandra.service.StorageService.prepareReplacementInfo(StorageService.java:449) >> ~[apache-cassandra-2.2.8.jar:2.2.8] >> >> >> >> >> >> DN 10.xx.xx.xx 388.43 KB 256 6.9% >> bdbd632a-bf5d-44d4-b220-f17f258c4701 1e >> >> >> >> Under what conditions does this happen? >> >> >> >> >> >> ---------------- >> Thank you >> >> >> >> >> > > Stefan Miklosovic > >