Re: Assassinate fails

Jeff Jirsa Thu, 04 Apr 2019 08:22:56 -0700

How long ago did you remove this host from the cluster?



-- 
Jeff Jirsa


> On Apr 4, 2019, at 8:09 AM, Nick Hatfield <nick.hatfi...@metricly.com> wrote:
> 
> This will sound a little silly but, have you tried rolling the cluster?
>  
> $> nodetool flush; nodetool drain; service cassandra stop
> $> ps aux | grep ‘cassandra’  
> 
> # make sure the process actually dies. If not you may need to kill -9 <pid>. 
> Check first to see if nodetool can connect first, nodetool gossipinfo. If the 
> connection is live and listening on the port, then just try re-running 
> service cassandra stop again. Kill -9 as a last resort
> 
> $> service cassandra start
> $> nodetool netstats | grep ‘NORMAL’  # wait for this to return before moving 
> on to the next node.
>  
> Restart them all using this method, then run nodetool status again and see if 
> it is listed.
>  
> Once other thing, I recall you said something about having to terminate a 
> node and then replace it. Make sure that whichever node you did the –Dreplace 
> flag on, does not still have it set when you start cassandra on it again!
>  
> From: Alex [mailto:m...@aca-o.com] 
> Sent: Thursday, April 04, 2019 4:58 AM
> To: user@cassandra.apache.org
> Subject: Re: Assassinate fails
>  
> Hi Anthony,
> 
> Thanks for your help.
> 
> I tried to run multiple times in quick succession but it fails with :
> 
> -- StackTrace --
> java.lang.RuntimeException: Endpoint still alive: /192.168.1.18 generation 
> changed while trying to assassinate it
>         at 
> org.apache.cassandra.gms.Gossiper.assassinateEndpoint(Gossiper.java:592)
> 
> I can see that the generation number for this node increases by 1 every time 
> I call nodetool assassinate ; and the command itself waits for 30 seconds 
> before assassinating node. When ran multiple times in quick succession, the 
> command fails because the generation number has been changed by the previous 
> instance.
> 
>  
> 
> In 'nodetool gossipinfo', the node is marked as "LEFT" on every node.
> 
> However, in 'nodetool describecluster', this node is marked as "unreacheable" 
> on 3 nodes out of 5.
> 
>  
> 
> Alex
> 
>  
> 
> Le 04.04.2019 00:56, Anthony Grasso a écrit :
> 
> Hi Alex,
>  
> We wrote a blog post on this topic late last year: 
> http://thelastpickle.com/blog/2018/09/18/assassinate.html.
>  
> In short, you will need to run the assassinate command on each node 
> simultaneously a number of times in quick succession. This will generate a 
> number of messages requesting all nodes completely forget there used to be an 
> entry within the gossip state for the given IP address.
>  
> Regards,
> Anthony
>  
> On Thu, 4 Apr 2019 at 03:32, Alex <m...@aca-o.com> wrote:
> Same result it seems:
> Welcome to JMX terminal. Type "help" for available commands.
> $>open localhost:7199
> #Connection to localhost:7199 is opened
> $>bean org.apache.cassandra.net:type=Gossiper
> #bean is set to org.apache.cassandra.net:type=Gossiper
> $>run unsafeAssassinateEndpoint 192.168.1.18
> #calling operation unsafeAssassinateEndpoint of mbean 
> org.apache.cassandra.net:type=Gossiper
> #RuntimeMBeanException: java.lang.NullPointerException
> 
> 
> There not much more to see in log files :
> WARN  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,626 
> Gossiper.java:575 - Assassinating /192.168.1.18 via gossip
> INFO  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,627 
> Gossiper.java:585 - Sleeping for 30000ms to ensure /192.168.1.18 does 
> not change
> INFO  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,628 
> Gossiper.java:1029 - InetAddress /192.168.1.18 is now DOWN
> INFO  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,631 
> StorageService.java:2324 - Removing tokens [..] for /192.168.1.18
> 
> 
> 
> 
> Le 03.04.2019 17:10, Nick Hatfield a écrit :
> > Run assassinate the old way. I works very well...
> > 
> > wget -q -O jmxterm.jar
> > http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar
> > 
> > java -jar ./jmxterm.jar
> > 
> > $>open localhost:7199
> > 
> > $>bean org.apache.cassandra.net:type=Gossiper
> > 
> > $>run unsafeAssassinateEndpoint 192.168.1.18
> > 
> > $>quit
> > 
> > 
> > Happy deleting
> > 
> > -----Original Message-----
> > From: Alex [mailto:m...@aca-o.com]
> > Sent: Wednesday, April 03, 2019 10:42 AM
> > To: user@cassandra.apache.org
> > Subject: Assassinate fails
> > 
> > Hello,
> > 
> > Short story:
> > - I had to replace a dead node in my cluster
> > - 1 week after, dead node is still seen as DN by 3 out of 5 nodes
> > - dead node has null host_id
> > - assassinate on dead node fails with error
> > 
> > How can I get rid of this dead node ?
> > 
> > 
> > Long story:
> > I had a 3 nodes cluster (Cassandra 3.9) ; one node went dead. I built
> > a new node from scratch and "replaced" the dead node using the
> > information from this page
> > https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsReplaceNode.html.
> > It looked like the replacement went ok.
> > 
> > I added two more nodes to strengthen the cluster.
> > 
> > A few days have passed and the dead node is still visible and marked
> > as "down" on 3 of 5 nodes in nodetool status:
> > 
> > --  Address       Load       Tokens       Owns (effective)  Host ID
> >                           Rack
> > UN  192.168.1.9   16 GiB     256          35.0%
> > 76223d4c-9d9f-417f-be27-cebb791cddcc  rack1
> > UN  192.168.1.12  16.09 GiB  256          34.0%
> > 719601e2-54a6-440e-a379-c9cf2dc20564  rack1
> > UN  192.168.1.14  14.16 GiB  256          32.6%
> > d8017a03-7e4e-47b7-89b9-cd9ec472d74f  rack1
> > UN  192.168.1.17  15.4 GiB   256          34.1%
> > fa238b21-1db1-47dc-bfb7-beedc6c9967a  rack1
> > DN  192.168.1.18  24.3 GiB   256          33.7%             null
> >                           rack1
> > UN  192.168.1.22  19.06 GiB  256          30.7%
> > 09d24557-4e98-44c3-8c9d-53c4c31066e1  rack1
> > 
> > Its host ID is null, so I cannot use nodetool removenode. Moreover
> > nodetool assassinate 192.168.1.18 fails with :
> > 
> > error: null
> > -- StackTrace --
> > java.lang.NullPointerException
> > 
> > And in system.log:
> > 
> > INFO  [RMI TCP Connection(16)-127.0.0.1] 2019-03-27 17:39:38,595
> > Gossiper.java:585 - Sleeping for 30000ms to ensure /192.168.1.18 does
> > not change INFO  [CompactionExecutor:547] 2019-03-27 17:39:38,669
> > AutoSavingCache.java:393 - Saved KeyCache (27316 items) in 163 ms INFO
> >  [IndexSummaryManager:1] 2019-03-27 17:40:03,620
> > IndexSummaryRedistribution.java:75 - Redistributing index summaries
> > INFO  [RMI TCP Connection(16)-127.0.0.1] 2019-03-27 17:40:08,597
> > Gossiper.java:1029 - InetAddress /192.168.1.18 is now DOWN INFO  [RMI
> > TCP Connection(16)-127.0.0.1] 2019-03-27 17:40:08,599
> > StorageService.java:2324 - Removing tokens [-1061369577393671924,...]
> > ERROR [GossipStage:1] 2019-03-27 17:40:08,600 CassandraDaemon.java:226
> > - Exception in thread Thread[GossipStage:1,5,main]
> > java.lang.NullPointerException: null
> > 
> > 
> > In system.peers, the dead node shows and has the same ID as the 
> > replacing node :
> > 
> > cqlsh> select peer, host_id from system.peers;
> > 
> >   peer         | host_id
> > --------------+--------------------------------------
> >   192.168.1.18 | 09d24557-4e98-44c3-8c9d-53c4c31066e1
> >   192.168.1.22 | 09d24557-4e98-44c3-8c9d-53c4c31066e1
> >    192.168.1.9 | 76223d4c-9d9f-417f-be27-cebb791cddcc
> >   192.168.1.14 | d8017a03-7e4e-47b7-89b9-cd9ec472d74f
> >   192.168.1.12 | 719601e2-54a6-440e-a379-c9cf2dc20564
> > 
> > Dead node and replacing node have different tokens in system.peers.
> > 
> > I should add that I also tried decommission on a node that still
> > 192.168.1.18 in its peers. - it is still marked as "leaving" 5 days
> > later. Nothing in notetool netstats or nodetool compactionstats.
> > 
> > 
> > Thank you for taking the time to read this. Hope you can help.
> > 
> > Alex
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> > 
> > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>

Re: Assassinate fails

Reply via email to