[jira] [Comment Edited] (CASSANDRA-10371) Decommissioned nodes can remain in gossip

Didier (JIRA) Tue, 22 Dec 2015 06:44:29 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068186#comment-15068186
 ]


Didier edited comment on CASSANDRA-10371 at 12/22/15 2:43 PM:
--------------------------------------------------------------

Hi Stefania,

You are perfectly right ! I just fix my issue when you wrote your answer. My 
problem is that in fact there is a lot of nodes impacted in this mess (not just 
one : Multi DC Europe / US).



I have setup these entries in the log4j-server.properties in one node :

{code}
log4j.logger.org.apache.cassandra.gms.GossipDigestSynVerbHandler=TRACE
log4j.logger.org.apache.cassandra.gms.FailureDetector=TRACE
{code}

With this trick I have found the culpurit nodes with a simple tail in the 
system.log :

I just run a tail -f system.log | grep "TRACE" | grep -A 10 -B 10 
"192.168.136.28"

{code}
TRACE [GossipStage:1] 2015-12-22 14:25:10,262 GossipDigestSynVerbHandler.java 
(line 40) Received a GossipDigestSynMessage from /10.0.2.110
TRACE [GossipStage:1] 2015-12-22 14:25:10,262 GossipDigestSynVerbHandler.java 
(line 71) Gossip syn digests are : /10.10.102.97:1448271725:7650177 
/10.10.2.23:1450793863:1377 /10.0.102.190:1448275278:7636527 
/10.0.2.36:1450792729:4816 /192.168.136.28:1449485228:258388
{code}

Every time I found a match with a phantom node IP in the Gossip syn digests, I 
run this on the affected node (in this example 10.0.2.110) : 

{code}
nodetool drain && /etc/init.d/cassandra restart
{code}

After some nodes (15 nodes), I check if I get some entries in my system.log 
with the phantom nodes ... and voila ! 
No more phantom nodes.

Thanks for your help ;)

Didier


was (Author: didier.seg...@gmail.com):
Hi Stefania,

You are perfectly right ! I just fix my issue when you wrote your answer. My 
problem is that in fact there is a lot of nodes impacted in this mess (not just 
one : Multi DC Europe / US).



I have setup these entries in the log4j-server.properties in one node :

{code}
log4j.logger.org.apache.cassandra.gms.GossipDigestSynVerbHandler=TRACE
log4j.logger.org.apache.cassandra.gms.FailureDetector=TRACE
{/code}

With this trick I have found the culpurit nodes with a simple tail in the 
system.log :

I just run a tail -f system.log | grep "TRACE" | grep -A 10 -B 10 
"192.168.136.28"

{code}
TRACE [GossipStage:1] 2015-12-22 14:25:10,262 GossipDigestSynVerbHandler.java 
(line 40) Received a GossipDigestSynMessage from /10.0.2.110
TRACE [GossipStage:1] 2015-12-22 14:25:10,262 GossipDigestSynVerbHandler.java 
(line 71) Gossip syn digests are : /10.10.102.97:1448271725:7650177 
/10.10.2.23:1450793863:1377 /10.0.102.190:1448275278:7636527 
/10.0.2.36:1450792729:4816 /192.168.136.28:1449485228:258388
{code}

Every time I found a match with a phantom node IP in the Gossip syn digests, I 
run this on the affected node (in this example 10.0.2.110) : 

{code}
nodetool drain && /etc/init.d/cassandra restart
{/code}

After some nodes (15 nodes), I check if I get some entries in my system.log 
with the phantom nodes ... and voila ! 
No more phantom nodes.

Thanks for your help ;)

Didier

> Decommissioned nodes can remain in gossip
> -----------------------------------------
>
>                 Key: CASSANDRA-10371
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10371
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Distributed Metadata
>            Reporter: Brandon Williams
>            Assignee: Stefania
>            Priority: Minor
>
> This may apply to other dead states as well.  Dead states should be expired 
> after 3 days.  In the case of decom we attach a timestamp to let the other 
> nodes know when it should be expired.  It has been observed that sometimes a 
> subset of nodes in the cluster never expire the state, and through heap 
> analysis of these nodes it is revealed that the epstate.isAlive check returns 
> true when it should return false, which would allow the state to be evicted.  
> This may have been affected by CASSANDRA-8336.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-10371) Decommissioned nodes can remain in gossip

Reply via email to