Re: Assassinate fails

Alain RODRIGUEZ Thu, 04 Apr 2019 08:46:35 -0700

Hi Alex,

About previous advices:


You might have inconsistent data in your system tables.  Try setting the
> consistency level to ALL, then do read query of system tables to force
> repair.
>

System tables use the 'LocalStrategy', thus I don't think any repair would
happen for the system.* tables. Regardless the consistency you use. It
should not harm, but I really think it won't help.

This will sound a little silly but, have you tried rolling the cluster?


The other way around, the rolling restart does not sound that silly to me.
I would try it before touching any other 'deeper' systems. It has indeed
sometimes proven to do some magic for me as well. It's hard to guess on
this kind of ghost node issues without being working on the machine (and
sometimes even when accessing the machine I had some trouble =)). Also a
rolling restart is an operation that should be easy to perform and with low
risk (if everything is well configured).

Other idea to explore:

You can actually select the 'system.peers' table to see if all (other)
nodes are referenced for each node. There should not be any dead nodes in
there. By the way you will see that different nodes have slightly different
data in system.peers, and are not in sync, thus no way to 'repair' that
really.
'Select' is safe. If you delete non-existing 'peers' if any, If the node is
dead anyway, this shouldn't hurt, but make sure you are doing the right
thing you can easily break your cluster from there. I did not see an issue
(a bug) of those for a while though. Normally you should not have to go
that deep touching system tables.

Then also nodes removed should be immediately removed from peers but
persist for some time (7 days maybe?) in the gossip information (normally
as 'LEFT'). This should not create the issue in 'nodetool describecluster'
though.

C*heers,
-----------------------
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


Le jeu. 4 avr. 2019 à 16:09, Nick Hatfield <nick.hatfi...@metricly.com> a
écrit :

> This will sound a little silly but, have you tried rolling the cluster?
>
>
>
> $> nodetool flush; nodetool drain; service cassandra stop
> $> ps aux | grep ‘cassandra’
>
> # make sure the process actually dies. If not you may need to kill -9
> <pid>. Check first to see if nodetool can connect first, nodetool
> gossipinfo. If the connection is live and listening on the port, then just
> try re-running service cassandra stop again. Kill -9 as a last resort
>
> $> service cassandra start
> $> nodetool netstats | grep ‘NORMAL’  # wait for this to return before
> moving on to the next node.
>
>
>
> Restart them all using this method, then run nodetool status again and see
> if it is listed.
>
>
>
> Once other thing, I recall you said something about having to terminate a
> node and then replace it. Make sure that whichever node you did the
> –Dreplace flag on, does not still have it set when you start cassandra on
> it again!
>
>
>
> *From:* Alex [mailto:m...@aca-o.com]
> *Sent:* Thursday, April 04, 2019 4:58 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Assassinate fails
>
>
>
> Hi Anthony,
>
> Thanks for your help.
>
> I tried to run multiple times in quick succession but it fails with :
>
> -- StackTrace --
> java.lang.RuntimeException: Endpoint still alive: /192.168.1.18
> generation changed while trying to assassinate it
>         at
> org.apache.cassandra.gms.Gossiper.assassinateEndpoint(Gossiper.java:592)
>
> I can see that the generation number for this node increases by 1 every
> time I call nodetool assassinate ; and the command itself waits for 30
> seconds before assassinating node. When ran multiple times in quick
> succession, the command fails because the generation number has been
> changed by the previous instance.
>
>
>
> In 'nodetool gossipinfo', the node is marked as "LEFT" on every node.
>
> However, in 'nodetool describecluster', this node is marked as
> "unreacheable" on 3 nodes out of 5.
>
>
>
> Alex
>
>
>
> Le 04.04.2019 00:56, Anthony Grasso a écrit :
>
> Hi Alex,
>
>
>
> We wrote a blog post on this topic late last year:
> http://thelastpickle.com/blog/2018/09/18/assassinate.html.
>
>
>
> In short, you will need to run the assassinate command on each node
> simultaneously a number of times in quick succession. This will generate a
> number of messages requesting all nodes completely forget there used to be
> an entry within the gossip state for the given IP address.
>
>
>
> Regards,
>
> Anthony
>
>
>
> On Thu, 4 Apr 2019 at 03:32, Alex <m...@aca-o.com> wrote:
>
> Same result it seems:
> Welcome to JMX terminal. Type "help" for available commands.
> $>open localhost:7199
> #Connection to localhost:7199 is opened
> $>bean org.apache.cassandra.net:type=Gossiper
> #bean is set to org.apache.cassandra.net:type=Gossiper
> $>run unsafeAssassinateEndpoint 192.168.1.18
> #calling operation unsafeAssassinateEndpoint of mbean
> org.apache.cassandra.net:type=Gossiper
> #RuntimeMBeanException: java.lang.NullPointerException
>
>
> There not much more to see in log files :
> WARN  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,626
> Gossiper.java:575 - Assassinating /192.168.1.18 via gossip
> INFO  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:13,627
> Gossiper.java:585 - Sleeping for 30000ms to ensure /192.168.1.18 does
> not change
> INFO  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,628
> Gossiper.java:1029 - InetAddress /192.168.1.18 is now DOWN
> INFO  [RMI TCP Connection(10)-127.0.0.1] 2019-04-03 16:25:43,631
> StorageService.java:2324 - Removing tokens [..] for /192.168.1.18
>
>
>
>
> Le 03.04.2019 17:10, Nick Hatfield a écrit :
> > Run assassinate the old way. I works very well...
> >
> > wget -q -O jmxterm.jar
> >
> http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar
> >
> > java -jar ./jmxterm.jar
> >
> > $>open localhost:7199
> >
> > $>bean org.apache.cassandra.net:type=Gossiper
> >
> > $>run unsafeAssassinateEndpoint 192.168.1.18
> >
> > $>quit
> >
> >
> > Happy deleting
> >
> > -----Original Message-----
> > From: Alex [mailto:m...@aca-o.com]
> > Sent: Wednesday, April 03, 2019 10:42 AM
> > To: user@cassandra.apache.org
> > Subject: Assassinate fails
> >
> > Hello,
> >
> > Short story:
> > - I had to replace a dead node in my cluster
> > - 1 week after, dead node is still seen as DN by 3 out of 5 nodes
> > - dead node has null host_id
> > - assassinate on dead node fails with error
> >
> > How can I get rid of this dead node ?
> >
> >
> > Long story:
> > I had a 3 nodes cluster (Cassandra 3.9) ; one node went dead. I built
> > a new node from scratch and "replaced" the dead node using the
> > information from this page
> >
> https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsReplaceNode.html
> .
> > It looked like the replacement went ok.
> >
> > I added two more nodes to strengthen the cluster.
> >
> > A few days have passed and the dead node is still visible and marked
> > as "down" on 3 of 5 nodes in nodetool status:
> >
> > --  Address       Load       Tokens       Owns (effective)  Host ID
> >                           Rack
> > UN  192.168.1.9   16 GiB     256          35.0%
> > 76223d4c-9d9f-417f-be27-cebb791cddcc  rack1
> > UN  192.168.1.12  16.09 GiB  256          34.0%
> > 719601e2-54a6-440e-a379-c9cf2dc20564  rack1
> > UN  192.168.1.14  14.16 GiB  256          32.6%
> > d8017a03-7e4e-47b7-89b9-cd9ec472d74f  rack1
> > UN  192.168.1.17  15.4 GiB   256          34.1%
> > fa238b21-1db1-47dc-bfb7-beedc6c9967a  rack1
> > DN  192.168.1.18  24.3 GiB   256          33.7%             null
> >                           rack1
> > UN  192.168.1.22  19.06 GiB  256          30.7%
> > 09d24557-4e98-44c3-8c9d-53c4c31066e1  rack1
> >
> > Its host ID is null, so I cannot use nodetool removenode. Moreover
> > nodetool assassinate 192.168.1.18 fails with :
> >
> > error: null
> > -- StackTrace --
> > java.lang.NullPointerException
> >
> > And in system.log:
> >
> > INFO  [RMI TCP Connection(16)-127.0.0.1] 2019-03-27 17:39:38,595
> > Gossiper.java:585 - Sleeping for 30000ms to ensure /192.168.1.18 does
> > not change INFO  [CompactionExecutor:547] 2019-03-27 17:39:38,669
> > AutoSavingCache.java:393 - Saved KeyCache (27316 items) in 163 ms INFO
> >  [IndexSummaryManager:1] 2019-03-27 17:40:03,620
> > IndexSummaryRedistribution.java:75 - Redistributing index summaries
> > INFO  [RMI TCP Connection(16)-127.0.0.1] 2019-03-27 17:40:08,597
> > Gossiper.java:1029 - InetAddress /192.168.1.18 is now DOWN INFO  [RMI
> > TCP Connection(16)-127.0.0.1] 2019-03-27 17:40:08,599
> > StorageService.java:2324 - Removing tokens [-1061369577393671924,...]
> > ERROR [GossipStage:1] 2019-03-27 17:40:08,600 CassandraDaemon.java:226
> > - Exception in thread Thread[GossipStage:1,5,main]
> > java.lang.NullPointerException: null
> >
> >
> > In system.peers, the dead node shows and has the same ID as the
> > replacing node :
> >
> > cqlsh> select peer, host_id from system.peers;
> >
> >   peer         | host_id
> > --------------+--------------------------------------
> >   192.168.1.18 | 09d24557-4e98-44c3-8c9d-53c4c31066e1
> >   192.168.1.22 | 09d24557-4e98-44c3-8c9d-53c4c31066e1
> >    192.168.1.9 | 76223d4c-9d9f-417f-be27-cebb791cddcc
> >   192.168.1.14 | d8017a03-7e4e-47b7-89b9-cd9ec472d74f
> >   192.168.1.12 | 719601e2-54a6-440e-a379-c9cf2dc20564
> >
> > Dead node and replacing node have different tokens in system.peers.
> >
> > I should add that I also tried decommission on a node that still
> > 192.168.1.18 in its peers. - it is still marked as "leaving" 5 days
> > later. Nothing in notetool netstats or nodetool compactionstats.
> >
> >
> > Thank you for taking the time to read this. Hope you can help.
> >
> > Alex
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>
>

Re: Assassinate fails

Reply via email to