[jira] [Comment Edited] (CASSANDRA-10371) Decommissioned nodes can remain in gossip
[ https://issues.apache.org/jira/browse/CASSANDRA-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068186#comment-15068186 ] Didier edited comment on CASSANDRA-10371 at 12/22/15 2:43 PM: -- Hi Stefania, You are perfectly right ! I just fix my issue when you wrote your answer. My problem is that in fact there is a lot of nodes impacted in this mess (not just one : Multi DC Europe / US). I have setup these entries in the log4j-server.properties in one node : {code} log4j.logger.org.apache.cassandra.gms.GossipDigestSynVerbHandler=TRACE log4j.logger.org.apache.cassandra.gms.FailureDetector=TRACE {code} With this trick I have found the culpurit nodes with a simple tail in the system.log : I just run a tail -f system.log | grep "TRACE" | grep -A 10 -B 10 "192.168.136.28" {code} TRACE [GossipStage:1] 2015-12-22 14:25:10,262 GossipDigestSynVerbHandler.java (line 40) Received a GossipDigestSynMessage from /10.0.2.110 TRACE [GossipStage:1] 2015-12-22 14:25:10,262 GossipDigestSynVerbHandler.java (line 71) Gossip syn digests are : /10.10.102.97:1448271725:7650177 /10.10.2.23:1450793863:1377 /10.0.102.190:1448275278:7636527 /10.0.2.36:1450792729:4816 /192.168.136.28:1449485228:258388 {code} Every time I found a match with a phantom node IP in the Gossip syn digests, I run this on the affected node (in this example 10.0.2.110) : {code} nodetool drain && /etc/init.d/cassandra restart {code} After some nodes (15 nodes), I check if I get some entries in my system.log with the phantom nodes ... and voila ! No more phantom nodes. Thanks for your help ;) Didier was (Author: didier.seg...@gmail.com): Hi Stefania, You are perfectly right ! I just fix my issue when you wrote your answer. My problem is that in fact there is a lot of nodes impacted in this mess (not just one : Multi DC Europe / US). I have setup these entries in the log4j-server.properties in one node : {code} log4j.logger.org.apache.cassandra.gms.GossipDigestSynVerbHandler=TRACE log4j.logger.org.apache.cassandra.gms.FailureDetector=TRACE {/code} With this trick I have found the culpurit nodes with a simple tail in the system.log : I just run a tail -f system.log | grep "TRACE" | grep -A 10 -B 10 "192.168.136.28" {code} TRACE [GossipStage:1] 2015-12-22 14:25:10,262 GossipDigestSynVerbHandler.java (line 40) Received a GossipDigestSynMessage from /10.0.2.110 TRACE [GossipStage:1] 2015-12-22 14:25:10,262 GossipDigestSynVerbHandler.java (line 71) Gossip syn digests are : /10.10.102.97:1448271725:7650177 /10.10.2.23:1450793863:1377 /10.0.102.190:1448275278:7636527 /10.0.2.36:1450792729:4816 /192.168.136.28:1449485228:258388 {code} Every time I found a match with a phantom node IP in the Gossip syn digests, I run this on the affected node (in this example 10.0.2.110) : {code} nodetool drain && /etc/init.d/cassandra restart {/code} After some nodes (15 nodes), I check if I get some entries in my system.log with the phantom nodes ... and voila ! No more phantom nodes. Thanks for your help ;) Didier > Decommissioned nodes can remain in gossip > - > > Key: CASSANDRA-10371 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10371 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Brandon Williams >Assignee: Stefania >Priority: Minor > > This may apply to other dead states as well. Dead states should be expired > after 3 days. In the case of decom we attach a timestamp to let the other > nodes know when it should be expired. It has been observed that sometimes a > subset of nodes in the cluster never expire the state, and through heap > analysis of these nodes it is revealed that the epstate.isAlive check returns > true when it should return false, which would allow the state to be evicted. > This may have been affected by CASSANDRA-8336. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10371) Decommissioned nodes can remain in gossip
[ https://issues.apache.org/jira/browse/CASSANDRA-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068186#comment-15068186 ] Didier commented on CASSANDRA-10371: Hi Stefania, You are perfectly right ! I just fix my issue when you wrote your answer. My problem is that in fact there is a lot of nodes impacted in this mess (not just one : Multi DC Europe / US). I have setup these entries in the log4j-server.properties in one node : {code} log4j.logger.org.apache.cassandra.gms.GossipDigestSynVerbHandler=TRACE log4j.logger.org.apache.cassandra.gms.FailureDetector=TRACE {/code} With this trick I have found the culpurit nodes with a simple tail in the system.log : I just run a tail -f system.log | grep "TRACE" | grep -A 10 -B 10 "192.168.136.28" {code} TRACE [GossipStage:1] 2015-12-22 14:25:10,262 GossipDigestSynVerbHandler.java (line 40) Received a GossipDigestSynMessage from /10.0.2.110 TRACE [GossipStage:1] 2015-12-22 14:25:10,262 GossipDigestSynVerbHandler.java (line 71) Gossip syn digests are : /10.10.102.97:1448271725:7650177 /10.10.2.23:1450793863:1377 /10.0.102.190:1448275278:7636527 /10.0.2.36:1450792729:4816 /192.168.136.28:1449485228:258388 {code} Every time I found a match with a phantom node IP in the Gossip syn digests, I run this on the affected node (in this example 10.0.2.110) : {code} nodetool drain && /etc/init.d/cassandra restart {/code} After some nodes (15 nodes), I check if I get some entries in my system.log with the phantom nodes ... and voila ! No more phantom nodes. Thanks for your help ;) Didier > Decommissioned nodes can remain in gossip > - > > Key: CASSANDRA-10371 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10371 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Brandon Williams >Assignee: Stefania >Priority: Minor > > This may apply to other dead states as well. Dead states should be expired > after 3 days. In the case of decom we attach a timestamp to let the other > nodes know when it should be expired. It has been observed that sometimes a > subset of nodes in the cluster never expire the state, and through heap > analysis of these nodes it is revealed that the epstate.isAlive check returns > true when it should return false, which would allow the state to be evicted. > This may have been affected by CASSANDRA-8336. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10371) Decommissioned nodes can remain in gossip
[ https://issues.apache.org/jira/browse/CASSANDRA-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15066681#comment-15066681 ] Didier edited comment on CASSANDRA-10371 at 12/21/15 4:43 PM: -- Hi Stefania, Thanks to your quick answer. I attach TRACE log for phantom node 192.168.128.28 : {code} 3614313:TRACE [GossipStage:2] 2015-12-21 17:21:19,984 Gossiper.java (line 1155) requestAll for /192.168.128.28 3616877:TRACE [GossipStage:2] 2015-12-21 17:21:20,123 FailureDetector.java (line 205) reporting /192.168.128.28 3616881:TRACE [GossipStage:2] 2015-12-21 17:21:20,124 Gossiper.java (line 986) Adding endpoint state for /192.168.128.28 3616892:DEBUG [GossipStage:2] 2015-12-21 17:21:20,124 Gossiper.java (line 999) Not marking /192.168.128.28 alive due to dead state 3616897:TRACE [GossipStage:2] 2015-12-21 17:21:20,125 Gossiper.java (line 958) marking as down /192.168.128.28 3616908: INFO [GossipStage:2] 2015-12-21 17:21:20,125 Gossiper.java (line 962) InetAddress /192.168.128.28 is now DOWN 3616912:DEBUG [GossipStage:2] 2015-12-21 17:21:20,126 MessagingService.java (line 397) Resetting pool for /192.168.128.28 3616937:DEBUG [GossipStage:2] 2015-12-21 17:21:20,128 StorageService.java (line 1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28 3616955:DEBUG [GossipStage:2] 2015-12-21 17:21:20,128 StorageService.java (line 1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28 3616956:DEBUG [GossipStage:2] 2015-12-21 17:21:20,129 StorageService.java (line 1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28 3616958:DEBUG [GossipStage:2] 2015-12-21 17:21:20,129 StorageService.java (line 1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28 3616976:DEBUG [GossipStage:2] 2015-12-21 17:21:20,129 StorageService.java (line 1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28 3616977:DEBUG [GossipStage:2] 2015-12-21 17:21:20,130 StorageService.java (line 1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28 3616979:DEBUG [GossipStage:2] 2015-12-21 17:21:20,130 StorageService.java (line 1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28 3616992:DEBUG [GossipStage:2] 2015-12-21 17:21:20,130 StorageService.java (line 1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28 3616993:DEBUG [GossipStage:2] 2015-12-21 17:21:20,131 StorageService.java (line 1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28 3616995:DEBUG [GossipStage:2] 2015-12-21 17:21:20,131 StorageService.java (line 1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28 3617008:DEBUG [GossipStage:2] 2015-12-21 17:21:20,131 StorageService.java (line 1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28 3617317:DEBUG [GossipStage:2] 2015-12-21 17:21:20,143 StorageService.java (line 1699) Node /192.168.128.28 state left, tokens [100310405581336885248896672411729131592, ... , 99937615223192795414082780446763257757, 99975703478103230193804512094895677044] 3617321:DEBUG [GossipStage:2] 2015-12-21 17:21:20,144 Gossiper.java (line 1463) adding expire time for endpoint : /192.168.128.28 (1449830784335) 3617337: INFO [GossipStage:2] 2015-12-21 17:21:20,145 StorageService.java (line 1781) Removing tokens [100310405581336885248896672411729131592, 100598580285540169800869916837708042668, ., 99743016911284542884064313061048682083, 99937615223192795414082780446763257757, 99975703478103230193804512094895677044] for /192.168.128.28 3617362:DEBUG [GossipStage:2] 2015-12-21 17:21:20,146 MessagingService.java (line 795) Resetting version for /192.168.128.28 3617367:DEBUG [GossipStage:2] 2015-12-21 17:21:20,147 Gossiper.java (line 410) removing endpoint /192.168.128.28 3631829:TRACE [GossipTasks:1] 2015-12-21 17:21:20,964 Gossiper.java (line 492) Gossip Digests are : /10.10.102.96:1448271659:7409547 /10.0.102.190:1448275278:7395730 /10.10.102.94:1448271818:7409091 /192.168.128.23:1450707984:20939 /10.10.102.8:1448271443:7409972 /10.0.2.97:1448276012:7395072 /10.0.102.93:1448274183:7401036 /192.168.136.26:1450708061:20700 /192.168.136.23:1450708062:20695 /10.10.2.239:1448533274:6614346 /10.0.102.206:1448273613:7402527 /10.0.102.92:1448274024:7401356 /10.0.2.143:1448275597:7396779 /10.10.2.11:1448270678:7412474 /10.10.2.145:1448271264:7410576 /192.168.128.32:1449151772:4740947 /10.0.2.5:1449149504:4746745 /192.168.128.26:1450707983:20947 /192.168.136.22:1450708061:20700 /10.0.102.94:1448274372:7400487 /10.0.2.109:1448276688:7393112 /10.10.2.18:1448271203:7410982 /10.10.102.49:1448271974:7408616 /10.10.102.192:1448271561:7409839 /192.168.128.31:1449151700:4741174 /10.0.102.90:1448273911:7401771 /192.168.128.21:1450714541:1013 /10.0.102.138:1448273504:7402737 /10.0.2.107:1448276554:7393892 /10.0.2.105:
[jira] [Commented] (CASSANDRA-10371) Decommissioned nodes can remain in gossip
[ https://issues.apache.org/jira/browse/CASSANDRA-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15066681#comment-15066681 ] Didier commented on CASSANDRA-10371: Hi Stefania, Thanks to your quick answer. I attach TRACE log for phantom node 192.168.128.28 : 3614313:TRACE [GossipStage:2] 2015-12-21 17:21:19,984 Gossiper.java (line 1155) requestAll for /192.168.128.28 3616877:TRACE [GossipStage:2] 2015-12-21 17:21:20,123 FailureDetector.java (line 205) reporting /192.168.128.28 3616881:TRACE [GossipStage:2] 2015-12-21 17:21:20,124 Gossiper.java (line 986) Adding endpoint state for /192.168.128.28 3616892:DEBUG [GossipStage:2] 2015-12-21 17:21:20,124 Gossiper.java (line 999) Not marking /192.168.128.28 alive due to dead state 3616897:TRACE [GossipStage:2] 2015-12-21 17:21:20,125 Gossiper.java (line 958) marking as down /192.168.128.28 3616908: INFO [GossipStage:2] 2015-12-21 17:21:20,125 Gossiper.java (line 962) InetAddress /192.168.128.28 is now DOWN 3616912:DEBUG [GossipStage:2] 2015-12-21 17:21:20,126 MessagingService.java (line 397) Resetting pool for /192.168.128.28 3616937:DEBUG [GossipStage:2] 2015-12-21 17:21:20,128 StorageService.java (line 1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28 3616955:DEBUG [GossipStage:2] 2015-12-21 17:21:20,128 StorageService.java (line 1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28 3616956:DEBUG [GossipStage:2] 2015-12-21 17:21:20,129 StorageService.java (line 1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28 3616958:DEBUG [GossipStage:2] 2015-12-21 17:21:20,129 StorageService.java (line 1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28 3616976:DEBUG [GossipStage:2] 2015-12-21 17:21:20,129 StorageService.java (line 1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28 3616977:DEBUG [GossipStage:2] 2015-12-21 17:21:20,130 StorageService.java (line 1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28 3616979:DEBUG [GossipStage:2] 2015-12-21 17:21:20,130 StorageService.java (line 1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28 3616992:DEBUG [GossipStage:2] 2015-12-21 17:21:20,130 StorageService.java (line 1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28 3616993:DEBUG [GossipStage:2] 2015-12-21 17:21:20,131 StorageService.java (line 1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28 3616995:DEBUG [GossipStage:2] 2015-12-21 17:21:20,131 StorageService.java (line 1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28 3617008:DEBUG [GossipStage:2] 2015-12-21 17:21:20,131 StorageService.java (line 1370) Ignoring state change for dead or unknown endpoint: /192.168.128.28 3617317:DEBUG [GossipStage:2] 2015-12-21 17:21:20,143 StorageService.java (line 1699) Node /192.168.128.28 state left, tokens [100310405581336885248896672411729131592, ... , 99937615223192795414082780446763257757, 99975703478103230193804512094895677044] 3617321:DEBUG [GossipStage:2] 2015-12-21 17:21:20,144 Gossiper.java (line 1463) adding expire time for endpoint : /192.168.128.28 (1449830784335) 3617337: INFO [GossipStage:2] 2015-12-21 17:21:20,145 StorageService.java (line 1781) Removing tokens [100310405581336885248896672411729131592, 100598580285540169800869916837708042668, ., 99743016911284542884064313061048682083, 99937615223192795414082780446763257757, 99975703478103230193804512094895677044] for /192.168.128.28 3617362:DEBUG [GossipStage:2] 2015-12-21 17:21:20,146 MessagingService.java (line 795) Resetting version for /192.168.128.28 3617367:DEBUG [GossipStage:2] 2015-12-21 17:21:20,147 Gossiper.java (line 410) removing endpoint /192.168.128.28 3631829:TRACE [GossipTasks:1] 2015-12-21 17:21:20,964 Gossiper.java (line 492) Gossip Digests are : /10.10.102.96:1448271659:7409547 /10.0.102.190:1448275278:7395730 /10.10.102.94:1448271818:7409091 /192.168.128.23:1450707984:20939 /10.10.102.8:1448271443:7409972 /10.0.2.97:1448276012:7395072 /10.0.102.93:1448274183:7401036 /192.168.136.26:1450708061:20700 /192.168.136.23:1450708062:20695 /10.10.2.239:1448533274:6614346 /10.0.102.206:1448273613:7402527 /10.0.102.92:1448274024:7401356 /10.0.2.143:1448275597:7396779 /10.10.2.11:1448270678:7412474 /10.10.2.145:1448271264:7410576 /192.168.128.32:1449151772:4740947 /10.0.2.5:1449149504:4746745 /192.168.128.26:1450707983:20947 /192.168.136.22:1450708061:20700 /10.0.102.94:1448274372:7400487 /10.0.2.109:1448276688:7393112 /10.10.2.18:1448271203:7410982 /10.10.102.49:1448271974:7408616 /10.10.102.192:1448271561:7409839 /192.168.128.31:1449151700:4741174 /10.0.102.90:1448273911:7401771 /192.168.128.21:1450714541:1013 /10.0.102.138:1448273504:7402737 /10.0.2.107:1448276554:7393892 /10.0.2.105:1448276464:7393834 /10.10.2.10:1448270541:7412796 /10.10.
[jira] [Comment Edited] (CASSANDRA-10371) Decommissioned nodes can remain in gossip
[ https://issues.apache.org/jira/browse/CASSANDRA-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063764#comment-15063764 ] Didier edited comment on CASSANDRA-10371 at 12/18/15 9:46 AM: -- Is it planned to release a fix for the 2.0.x branch for this issue ? I have this problem in production with C* 2.0.16, is it fixed in C* 2.0.17 ? Every n minutes we have a gossiping flood like that : INFO [GossipStage:2] 2015-12-18 10:29:05,082 Gossiper.java (line 962) InetAddress /192.168.128.27 is now DOWN INFO [GossipStage:2] 2015-12-18 10:29:05,083 StorageService.java (line 1781) Removing tokens [100029758220565479311893935069170672938, , 99324782484008101117663863086419168046] for /192.168.128.27 INFO [GossipStage:2] 2015-12-18 10:40:44,253 Gossiper.java (line 962) InetAddress /192.168.128.27 is now DOWN INFO [GossipStage:2] 2015-12-18 10:40:44,254 StorageService.java (line 1781) Removing tokens [100029758220565479311893935069170672938, ..., 99324782484008101117663863086419168046] for /192.168.128.27 The impacted nodes aren't in system.peers and nodetool ring/status, and they have been decommissioned properly from the DC. Do you plan to release a new release 2.0.18 with a fix or do you recommand to upgrade to C* 2.1 or later ? We also tried to assassinate the impacted nodes via JMX but without any success. Best regards, Didier was (Author: didier.seg...@gmail.com): Is it planned to release a fix for the 2.0.x branch for this issue ? I have this problem in production with C* 2.0.16, is it fixed in C* 2.0.17 ? Every n minutes we have a gossiping flood like that : INFO [GossipStage:2] 2015-12-18 10:29:05,082 Gossiper.java (line 962) InetAddress /192.168.128.27 is now DOWN INFO [GossipStage:2] 2015-12-18 10:29:05,083 StorageService.java (line 1781) Removing tokens [100029758220565479311893935069170672938, , 99324782484008101117663863086419168046] for /192.168.128.27 INFO [GossipStage:2] 2015-12-18 10:40:44,253 Gossiper.java (line 962) InetAddress /192.168.128.27 is now DOWN INFO [GossipStage:2] 2015-12-18 10:40:44,254 StorageService.java (line 1781) Removing tokens [100029758220565479311893935069170672938, ..., 99324782484008101117663863086419168046] for /192.168.128.27 The impacted nodes aren't in system.peers and nodetool ring/status, and they have been decommissioned properly from the DC. Do you plan to release a new release 2.0.18 with a fix or do you recommand to upgrade to C* 2.1 or later ? Best regards, Didier > Decommissioned nodes can remain in gossip > - > > Key: CASSANDRA-10371 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10371 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Brandon Williams >Assignee: Stefania >Priority: Minor > > This may apply to other dead states as well. Dead states should be expired > after 3 days. In the case of decom we attach a timestamp to let the other > nodes know when it should be expired. It has been observed that sometimes a > subset of nodes in the cluster never expire the state, and through heap > analysis of these nodes it is revealed that the epstate.isAlive check returns > true when it should return false, which would allow the state to be evicted. > This may have been affected by CASSANDRA-8336. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10371) Decommissioned nodes can remain in gossip
[ https://issues.apache.org/jira/browse/CASSANDRA-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063764#comment-15063764 ] Didier commented on CASSANDRA-10371: Is it planned to release a fix for the 2.0.x branch for this issue ? I have this problem in production with C* 2.0.16, is it fixed in C* 2.0.17 ? Every n minutes we have a gossiping flood like that : INFO [GossipStage:2] 2015-12-18 10:29:05,082 Gossiper.java (line 962) InetAddress /192.168.128.27 is now DOWN INFO [GossipStage:2] 2015-12-18 10:29:05,083 StorageService.java (line 1781) Removing tokens [100029758220565479311893935069170672938, , 99324782484008101117663863086419168046] for /192.168.128.27 INFO [GossipStage:2] 2015-12-18 10:40:44,253 Gossiper.java (line 962) InetAddress /192.168.128.27 is now DOWN INFO [GossipStage:2] 2015-12-18 10:40:44,254 StorageService.java (line 1781) Removing tokens [100029758220565479311893935069170672938, ..., 99324782484008101117663863086419168046] for /192.168.128.27 The impacted nodes aren't in system.peers and nodetool ring/status, and they have been decommissioned properly from the DC. Do you plan to release a new release 2.0.18 with a fix or do you recommand to upgrade to C* 2.1 or later ? Best regards, Didier > Decommissioned nodes can remain in gossip > - > > Key: CASSANDRA-10371 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10371 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Brandon Williams >Assignee: Stefania >Priority: Minor > > This may apply to other dead states as well. Dead states should be expired > after 3 days. In the case of decom we attach a timestamp to let the other > nodes know when it should be expired. It has been observed that sometimes a > subset of nodes in the cluster never expire the state, and through heap > analysis of these nodes it is revealed that the epstate.isAlive check returns > true when it should return false, which would allow the state to be evicted. > This may have been affected by CASSANDRA-8336. -- This message was sent by Atlassian JIRA (v6.3.4#6332)