[ 
https://issues.apache.org/jira/browse/CASSANDRA-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15066681#comment-15066681
 ] 

Didier commented on CASSANDRA-10371:
------------------------------------

Hi Stefania,

Thanks to your quick answer.

I attach TRACE log for phantom node 192.168.128.28 :

3614313:TRACE [GossipStage:2] 2015-12-21 17:21:19,984 Gossiper.java (line 1155) 
requestAll for /192.168.128.28
3616877:TRACE [GossipStage:2] 2015-12-21 17:21:20,123 FailureDetector.java 
(line 205) reporting /192.168.128.28
3616881:TRACE [GossipStage:2] 2015-12-21 17:21:20,124 Gossiper.java (line 986) 
Adding endpoint state for /192.168.128.28
3616892:DEBUG [GossipStage:2] 2015-12-21 17:21:20,124 Gossiper.java (line 999) 
Not marking /192.168.128.28 alive due to dead state
3616897:TRACE [GossipStage:2] 2015-12-21 17:21:20,125 Gossiper.java (line 958) 
marking as down /192.168.128.28
3616908: INFO [GossipStage:2] 2015-12-21 17:21:20,125 Gossiper.java (line 962) 
InetAddress /192.168.128.28 is now DOWN
3616912:DEBUG [GossipStage:2] 2015-12-21 17:21:20,126 MessagingService.java 
(line 397) Resetting pool for /192.168.128.28
3616937:DEBUG [GossipStage:2] 2015-12-21 17:21:20,128 StorageService.java (line 
1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28
3616955:DEBUG [GossipStage:2] 2015-12-21 17:21:20,128 StorageService.java (line 
1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28
3616956:DEBUG [GossipStage:2] 2015-12-21 17:21:20,129 StorageService.java (line 
1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28
3616958:DEBUG [GossipStage:2] 2015-12-21 17:21:20,129 StorageService.java (line 
1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28
3616976:DEBUG [GossipStage:2] 2015-12-21 17:21:20,129 StorageService.java (line 
1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28
3616977:DEBUG [GossipStage:2] 2015-12-21 17:21:20,130 StorageService.java (line 
1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28
3616979:DEBUG [GossipStage:2] 2015-12-21 17:21:20,130 StorageService.java (line 
1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28
3616992:DEBUG [GossipStage:2] 2015-12-21 17:21:20,130 StorageService.java (line 
1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28
3616993:DEBUG [GossipStage:2] 2015-12-21 17:21:20,131 StorageService.java (line 
1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28
3616995:DEBUG [GossipStage:2] 2015-12-21 17:21:20,131 StorageService.java (line 
1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28
3617008:DEBUG [GossipStage:2] 2015-12-21 17:21:20,131 StorageService.java (line 
1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28
3617317:DEBUG [GossipStage:2] 2015-12-21 17:21:20,143 StorageService.java (line 
1699) Node /192.168.128.28 state left, tokens 
[100310405581336885248896672411729131592, ....... , 
99937615223192795414082780446763257757, 99975703478103230193804512094895677044]
3617321:DEBUG [GossipStage:2] 2015-12-21 17:21:20,144 Gossiper.java (line 1463) 
adding expire time for endpoint : /192.168.128.28 (1449830784335)
3617337: INFO [GossipStage:2] 2015-12-21 17:21:20,145 StorageService.java (line 
1781) Removing tokens [100310405581336885248896672411729131592, 
100598580285540169800869916837708042668, ....., 
99743016911284542884064313061048682083, 99937615223192795414082780446763257757, 
99975703478103230193804512094895677044] for /192.168.128.28
3617362:DEBUG [GossipStage:2] 2015-12-21 17:21:20,146 MessagingService.java 
(line 795) Resetting version for /192.168.128.28
3617367:DEBUG [GossipStage:2] 2015-12-21 17:21:20,147 Gossiper.java (line 410) 
removing endpoint /192.168.128.28
3631829:TRACE [GossipTasks:1] 2015-12-21 17:21:20,964 Gossiper.java (line 492) 
Gossip Digests are : /10.10.102.96:1448271659:7409547 
/10.0.102.190:1448275278:7395730 /10.10.102.94:1448271818:7409091 
/192.168.128.23:1450707984:20939 /10.10.102.8:1448271443:7409972 
/10.0.2.97:1448276012:7395072 /10.0.102.93:1448274183:7401036 
/192.168.136.26:1450708061:20700 /192.168.136.23:1450708062:20695 
/10.10.2.239:1448533274:6614346 /10.0.102.206:1448273613:7402527 
/10.0.102.92:1448274024:7401356 /10.0.2.143:1448275597:7396779 
/10.10.2.11:1448270678:7412474 /10.10.2.145:1448271264:7410576 
/192.168.128.32:1449151772:4740947 /10.0.2.5:1449149504:4746745 
/192.168.128.26:1450707983:20947 /192.168.136.22:1450708061:20700 
/10.0.102.94:1448274372:7400487 /10.0.2.109:1448276688:7393112 
/10.10.2.18:1448271203:7410982 /10.10.102.49:1448271974:7408616 
/10.10.102.192:1448271561:7409839 /192.168.128.31:1449151700:4741174 
/10.0.102.90:1448273911:7401771 /192.168.128.21:1450714541:1013 
/10.0.102.138:1448273504:7402737 /10.0.2.107:1448276554:7393892 
/10.0.2.105:1448276464:7393834 /10.10.2.10:1448270541:7412796 
/10.10.2.13:1448270948:7411786 /10.10.102.95:1448271895:7408758 
/192.168.128.30:1450427261:872385 /10.0.2.142:1448275345:7397252 
/10.0.102.113:1448274816:7398949 /10.10.102.97:1448271725:7409279 
/10.10.2.23:1448271352:7410212 /192.168.136.21:1450708063:20699 
/192.168.136.25:1450708061:20699 /192.168.136.24:1450708064:20688 
/10.0.2.110:1448276759:7393030 /192.168.128.25:1450707984:20942 
/10.0.102.125:1448275195:7397877 /10.0.2.36:1448276280:7394606 
/10.10.2.4:1448271033:7410975 /10.0.2.4:1448275709:7396295 
/192.168.128.28:1449485330:259526 /10.10.102.66:1448271505:7409736 
/192.168.128.22:1450707985:20936 /10.10.102.29:1448951289:5348480 
/10.10.2.121:1448271104:7410985 /10.0.2.108:1448276619:7393387 
/10.0.102.247:1448275119:7398016 /10.0.2.226:1448276163:7394860 
/10.0.102.95:1448274450:7400161 /192.168.128.29:1449151797:4740847 
/10.0.102.32:1448274522:7398608 /10.0.102.88:1448273810:7402146 
/10.0.2.166:1448276372:7394409 /10.10.102.38:1448961691:5316954 
/192.168.128.24:1450707985:20932
3632204:DEBUG [GossipTasks:1] 2015-12-21 17:21:20,983 Gossiper.java (line 741) 
time is expiring for endpoint : /192.168.128.28 (1449830784335)
3632208:DEBUG [GossipTasks:1] 2015-12-21 17:21:20,985 Gossiper.java (line 383) 
evicting /192.168.128.28 from gossip
3832305:TRACE [ReadStage:319] 2015-12-21 17:21:08,855 ColumnFamilyStore.java 
(line 1652) scanned 192.168.128.28
3853098:TRACE [ReadStage:322] 2015-12-21 17:21:09,978 ColumnFamilyStore.java 
(line 1652) scanned 192.168.128.28
3973963:DEBUG [GossipTasks:1] 2015-12-21 17:21:05,096 Gossiper.java (line 755) 
60000 elapsed, /192.168.128.28 gossip quarantine over


I can see culprit IPs in the GossipDigestSynVerbHandler : 192.168.128.28 / 
192.168.136.28 (2 others are missing 192.168.128.27 and 192.168.136.27)

I have checked in all system.peers on each node in each DC of our cluster, and 
none of these IP are still presents. The NTP seems to be OK and we don't have 
desynchronisation.

The node 192.168.128.28 is in Gossip quarantine mode and every n seconds, 
something tries to remove it without success. The node seems to have reach a 
time limit (time is expiring for endpoint : /192.168.128.28 (1449830784335))

I have tried to assassinate it via JMX, rolling restart one DC (we have 4 DCs 
in this cluster), I also tried the JVM_OPTS="$JVM_OPTS 
-Dcassandra.load_ring_state=false" but everything is unsuccessful.

If you have any advise, I'm in !

Best regards,

Didier



> Decommissioned nodes can remain in gossip
> -----------------------------------------
>
>                 Key: CASSANDRA-10371
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10371
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Distributed Metadata
>            Reporter: Brandon Williams
>            Assignee: Stefania
>            Priority: Minor
>
> This may apply to other dead states as well.  Dead states should be expired 
> after 3 days.  In the case of decom we attach a timestamp to let the other 
> nodes know when it should be expired.  It has been observed that sometimes a 
> subset of nodes in the cluster never expire the state, and through heap 
> analysis of these nodes it is revealed that the epstate.isAlive check returns 
> true when it should return false, which would allow the state to be evicted.  
> This may have been affected by CASSANDRA-8336.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to