Hello Alain,
long time - I had to wait for a quiet week to try this. I finally did,
I thought I'd give you some feedback.
Short reminder: one of the nodes of my 3.9 cluster died and I replaced
it. But it still appeared in nodetool status, on one node with a "null"
host_id and on another with the same host_id of its replacement.
nodetool assassinate failed and I could not decommission or remove any
other node on the cluster.
Basically, after backup and preparing another cluster in case anything
went wrong, I did :
DELETE FROM system.peers WHERE peer = '192.168.1.18';
and restarted cassandra on the two nodes still seeing the zombie node.
After the first restart, the cassandra system.log was filled with:
java.lang.NullPointerException: null
WARN [MutationStage-2] 2019-08-15 15:31:44,735
AbstractLocalAwareExecutorService.java:169 - Uncaught exception on
thread Thread[MutationStage-2,5,main]:
So... I restarted again. The error disappeared. I ran a full repair and
everything seems to be back in order. I could decommission a node
without problem.
Thanks for your help !
Alex
Le 05.04.2019 10:55, Alain RODRIGUEZ a écrit :
> Alex,
>
>> Well, I tried : rolling restart did not work its magic.
>
> Sorry to hear and for misleading you. May faith into the rolling restart
> magical power went down a bit, but I still think it was worth a try :D.
>
>> @ Alain : In system.peers I see both the dead node and its replacement with
>> the same ID :peer | host_id
>> --+--
>> 192.168.1.18 | 09d24557-4e98-44c3-8c9d-53c4c31066e1
>> 192.168.1.22 | 09d24557-4e98-44c3-8c9d-53c4c31066e1
>>
>> Is it expected ?
>>
>> If I cannot fix this, I think I will add new nodes and remove, one by one,
>> the nodes that show the dead node in nodetool status.
>
> Well, no. This is clearly not good or expected I would say.
>
> TL;DR - SUGGESTED FIX:
> What I would try to fix this is the following is removing this row. It
> *should* be safe but that's only my opinion and with the condition you remove
> *only* the 'ghost/dead' nodes. Any mistake here would probably be costly.
> Again, be aware you're on a sensitive part when messing with system tables.
> Think it twice, check it twice, take a copy of the SSTables/a snapshot. Then
> I would go for it and observe changes on one node first. If no harm is done,
> continue to the next node.
>
> Considering the old node is '192.168.1.18', I would run this on all nodes
> (maybe after testing on a node) to make it simple or just run it on nodes
> that show the ghost node(s):
>
> "DELETE FROM SYSTEM.PEERS WHERE PEER = '192.168.1.18';"
>
> Maybe will you need to restart, I think you won't even need it. I have good
> hope that this should finally fix your issue with no harm.
>
> MORE CONTEXT - IDEA OF THE PROBLEM:
> This above, is clearly an issue I would say. Most probably the source of your
> troubles here. The problem is that I lack understanding. From where I stand,
> this kind of bugs should not happen anymore in Cassandra (I did not see
> anything similar for a while).
>
> I would blame:
> - A corner case scenario (unlikely, system tables are rather solid for a
> while). Or maybe are you using an old C* version. It *might* be related to
> this (or similar): https://issues.apache.org/jira/browse/CASSANDRA-7122)
> - A really weird operation (A succession of action might have put you in this
> state, but hard for me to say what)
> - KairosDB? I don't know It or what it does. Might it be less reliable than
> Cassandra is, and have lead to this issue? Maybe, I have no clue once again.
>
> RISK OF THIS OPERATION AND CURRENT SITUATION:
> Also, I *think* the current situation is relatively 'stable' (maybe just some
> hints being stored for nothing, and possibly not being able to add more nodes
> or change schema?). This is the kind of situation where 'rushing' a solution
> without understanding the impacts and risks can make things to go terribly
> wrong. Take the time to analyse my suggested fix, maybe read the ticket above
> etc. When you're ready, backup the data, prepare well the DELETE command and
> observe how 1 node reacts to the fix first.
>
> As you can see, I think it's the 'good' fix, but I'm not comfortable with
> this operation. And you should not be either :).
> I would say, arbitrary to share my feeling about this operation, that there
> is 95% chances this does not hurt, 90% chances to fix the issue with that,
> but if something goes wrong, if we are in the 5% were it does not go well,
> there is a not negligible probability that you will destroy your cluster in a
> very bad way. I guess I try to say be careful, watch your step, make sure you
> remove the good line, ensure it works on one node with no harm.
> I shared my feeling and I would try this fix. But it's ultimately your
> responsibility and I won't be behind the machine when you'll fix it.