Re: Re : Decommissioned nodes show as DOWN in Cassandra versions 2.1.12 - 2.1.16

sai krishnam raju potturi Fri, 27 Jan 2017 10:19:04 -0800

FYI : This issue is related to CASSANDRA-10205
<https://issues.apache.org/jira/browse/CASSANDRA-10205> (Gossiper
class) patch introduced in 2.1.11. When we roll back the changes from
CASSANDRA-10205
<https://issues.apache.org/jira/browse/CASSANDRA-10205> (Gossiper
class) in 2.1.12 and 2.1.15, everything works as expected. Further tests
still need to be done on our end though.


One more thing observed was that the decommissioned nodes do not show up as
"UNREACHABLE" in the "nodetool describecluster" after 72 hours. Things are
normal.

thanks Pillai; but the ip-address does not exist in the system-peers table
on any of the nodes. Unsafe-assasinate is not our preferred option when we
decommission a datacenter consisting of more than 100 nodes.

Kurt; we have not tested out 2.1.7 and 2.1.8 versions yet

Pratik; i'm not sure if your issue relates to this, as we observe the node
as UNREACHABLE in the "nodetool describecluster". nodetool gossipinfo
should generally show the information of the decommissioned nodes for a
while, which is expected behaviour.

thanks
Sai




On Fri, Jan 27, 2017 at 12:54 PM, Harikrishnan Pillai <
hpil...@walmartlabs.com> wrote:

> Please remove the ips from the system.peer table of all nodes  or you can
> use unsafeassasinate from JMX.
>
>
> ------------------------------
> *From:* Agrawal, Pratik <paagr...@amazon.com>
> *Sent:* Friday, January 27, 2017 9:05:43 AM
> *To:* user@cassandra.apache.org; k...@instaclustr.com; pskraj...@gmail.com
> *Cc:* Sun, Guan
>
> *Subject:* Re: Re : Decommissioned nodes show as DOWN in Cassandra
> versions 2.1.12 - 2.1.16
>
> We are seeing the same issue with Cassandra 2.0.8. The nodetool gossipinfo
> reports a node being down even after we decommission the node from the
> cluster.
>
> Thanks,
> Pratik
>
> From: kurt greaves <k...@instaclustr.com>
> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Date: Friday, January 27, 2017 at 5:54 AM
> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Subject: Re: Re : Decommissioned nodes show as DOWN in Cassandra versions
> 2.1.12 - 2.1.16
>
> we've seen this issue on a few clusters, including on 2.1.7 and 2.1.8.
> pretty sure it is an issue in gossip that's known about. in later versions
> it seems to be fixed.
>
> On 24 Jan 2017 06:09, "sai krishnam raju potturi" <pskraj...@gmail.com>
> wrote:
>
>> In the Cassandra versions 2.1.11 - 2.1.16, after we decommission a node
>> or datacenter, we observe the decommissioned nodes marked as DOWN in the
>> cluster when you do a "nodetool describecluster". The nodes however do not
>> show up in the "nodetool status" command.
>> The decommissioned node also does not show up in the "system_peers" table
>> on the nodes.
>>
>> The workaround we follow is rolling restart of the cluster, which removes
>> the decommissioned nodes from the "UNREACHABLE STATE", and shows the actual
>> state of the cluster. The workaround is tedious for huge clusters.
>>
>> We also verified the decommission process in CCM tool, and observed the
>> same issue for clusters with versions from 2.1.12 to 2.1.16. The issue was
>> not observed in versions prior to or later than the ones mentioned above.
>>
>>
>> Has anybody in the community observed similar issue? We've also raised a
>> JIRA issue regarding this.   https://issues.apache.org/jira
>> /browse/CASSANDRA-13144
>>
>>
>> Below are the observed logs from the versions without the bug, and with
>> the bug.  The one's highlighted in yellow show the expected logs. The one's
>> highlighted in red are the one's where the node is recognized as down, and
>> shows as UNREACHABLE.
>>
>>
>>
>> Cassandra 2.1.1 Logs showing the decommissioned node :  (Without the bug)
>>
>> 2017-01-19 20:18:56,415 [GossipStage:1] DEBUG ArrivalWindow Ignoring
>> interval time of 2049943233 <(204)%20994-3233> for /X.X.X.X
>> 2017-01-19 20:18:56,416 [GossipStage:1] DEBUG StorageService Node
>> /X.X.X.X state left, tokens [ 59353109817657926242901533144729725259,
>> 60254520910109313597677907197875221475, 
>> 75698727618038614819889933974570742305,
>> 84508739091270910297310401957975430578]
>> 2017-01-19 20:18:56,416 [GossipStage:1] DEBUG Gossiper adding expire
>> time for endpoint : /X.X.X.X (1485116334088)
>> 2017-01-19 20:18:56,417 [GossipStage:1] INFO StorageService Removing
>> tokens [100434964734820719895982857900842892337,
>> 114144647582686041354301802358217767299, 
>> 132090888860517964702932350041942412177,
>> 138409460913927199437556572481804704749] for /X.X.X.X
>> 2017-01-19 20:18:56,418 [HintedHandoff:3] INFO HintedHandOffManager
>> Deleting any stored hints for /X.X.X.X
>> 2017-01-19 20:18:56,424 [GossipStage:1] DEBUG MessagingService Resetting
>> version for /X.X.X.X
>> 2017-01-19 20:18:56,424 [GossipStage:1] DEBUG Gossiper removing endpoint
>> /X.X.X.X
>> 2017-01-19 20:18:56,437 [GossipStage:1] DEBUG StorageService Ignoring
>> state change for dead or unknown endpoint: /X.X.X.X
>> 2017-01-19 20:19:02,022 [WRITE-/X.X.X.X] DEBUG OutboundTcpConnection
>> attempting to connect to /X.X.X.X
>> 2017-01-19 20:19:02,023 [HANDSHAKE-/X.X.X.X] INFO OutboundTcpConnection
>> Handshaking version with /X.X.X.X
>> 2017-01-19 20:19:02,023 [WRITE-/X.X.X.X] DEBUG MessagingService Setting
>> version 7 for /X.X.X.X
>> 2017-01-19 20:19:08,096 [GossipStage:1] DEBUG ArrivalWindow Ignoring
>> interval time of 2074454222 <(207)%20445-4222> for /X.X.X.X
>> 2017-01-19 20:19:54,407 [GossipStage:1] DEBUG ArrivalWindow Ignoring
>> interval time of 4302985797 <(430)%20298-5797> for /X.X.X.X
>> 2017-01-19 20:19:57,405 [GossipTasks:1] DEBUG Gossiper 60000 elapsed,
>> /X.X.X.X gossip quarantine over
>> 2017-01-19 20:19:57,455 [GossipStage:1] DEBUG ArrivalWindow Ignoring
>> interval time of 3047826501 <(304)%20782-6501> for /X.X.X.X
>> 2017-01-19 20:19:57,455 [GossipStage:1] DEBUG StorageService Ignoring
>> state change for dead or unknown endpoint: /X.X.X.X
>>
>>
>> Cassandra 2.1.16 Logs showing the decommissioned node :   (The logs in
>> 2.1.16 show the same as 2.1.1 upto "DEBUG Gossiper 60000 elapsed, /X.X.X.X
>> gossip quarantine over", and then is followed by "NODE is now DOWN"
>>
>> 017-01-19 19:52:23,687 [GossipStage:1] DEBUG StorageService.java:1883 -
>> Node /X.X.X.X state left, tokens [-1112888759032625467,
>> -228773855963737699, -311455042375
>> 4381391, -4848625944949064281, -6920961603460018610,
>> -8566729719076824066, 1611098831406674636, 7278843689020594771,
>> 7565410054791352413, 9166885764 <(916)%20688-5764>, 8654747784805453046]
>> 2017-01-19 19:52:23,688 [GossipStage:1] DEBUG Gossiper.java:1520 -
>> adding expire time for endpoint : /X.X.X.X (1485114743567)
>> 2017-01-19 19:52:23,688 [GossipStage:1] INFO StorageService.java:1965 -
>> Removing tokens [-1112888759032625467, -228773855963737699,
>> -3114550423754381391, -48486259449
>> 49064281, -6920961603460018610, 5690722015779071557, 6202373691525063547,
>> 7191120402564284381, 7278843689020594771, 7565410054791352413,
>> 8524200089166885764, 865474778
>> 4805453046 <(480)%20545-3046>] for /X.X.X.X
>> 2017-01-19 19:52:23,689 [HintedHandoffManager:1] INFO
>> HintedHandOffManager.java:230 - Deleting any stored hints for /X.X.X.X
>> 2017-01-19 19:52:23,689 [GossipStage:1] DEBUG MessagingService.java:840
>> - Resetting version for /X.X.X.X
>> 2017-01-19 19:52:23,690 [GossipStage:1] DEBUG Gossiper.java:417 -
>> removing endpoint /X.X.X.X
>> 2017-01-19 19:52:23,691 [GossipStage:1] DEBUG StorageService.java:1552 -
>> Ignoring state change for dead or unknown endpoint: /X.X.X.X
>> 2017-01-19 19:52:31,617 [MessagingService-Outgoing-/X.X.X.X] DEBUG
>> OutboundTcpConnection.java:372 - attempting to connect to /X.X.X.X
>> 2017-01-19 19:52:31,618 [HANDSHAKE-/X.X.X.X] INFO
>> OutboundTcpConnection.java:488 - Handshaking version with /X.X.X.X
>> 2017-01-19 19:52:31,619 [MessagingService-Outgoing-/X.X.X.X] DEBUG
>> MessagingService.java:826 - Setting version 8 for /X.X.X.X
>> 2017-01-19 19:53:19,914 [GossipStage:1] DEBUG FailureDetector.java:423 -
>> Ignoring interval time of 6004119075 <(600)%20411-9075> for /X.X.X.X
>> 2017-01-19 19:53:23,702 [GossipTasks:1] DEBUG Gossiper.java:795 - 60000
>> elapsed, /X.X.X.X gossip quarantine over
>> 2017-01-19 19:53:23,985 [GossipStage:1] DEBUG StorageService.java:1552 -
>> Ignoring state change for dead or unknown endpoint: /X.X.X.X
>> 2017-01-19 19:53:26,223 [GossipStage:1] DEBUG FailureDetector.java:423 -
>> Ignoring interval time of 6309159352 <(630)%20915-9352> for /X.X.X.X
>> 2017-01-19 19:53:50,709 [GossipTasks:1] DEBUG Gossiper.java:336 -
>> Convicting /X.X.X.X with status LEFT - alive true
>> 2017-01-19 19:53:50,709 [GossipTasks:1] INFO Gossiper.java:1008 -
>> InetAddress /X.X.X.X is now DOWN
>> 2017-01-19 19:53:50,709 [GossipTasks:1] DEBUG MessagingService.java:429
>> - Resetting pool for /X.X.X.X
>> 2017-01-19 19:53:51,710 [GossipTasks:1] DEBUG Gossiper.java:336 -
>> Convicting /X.X.X.X with status LEFT - alive false
>> 2017-01-19 19:53:53,711 [MessagingService-Outgoing-/X.X.X.X] DEBUG
>> OutboundTcpConnection.java:372 - attempting to connect to /X.X.X.X
>> 2017-01-19 19:53:53,711 [GossipTasks:1] DEBUG Gossiper.java:336 -
>> Convicting /X.X.X.X with status LEFT - alive false
>> 2017-01-19 19:53:54,711 [GossipTasks:1] DEBUG Gossiper.java:336 -
>> Convicting /X.X.X.X with status LEFT - alive false
>>
>>
>>
>> thanks
>>
>> Sai
>>
>

Re: Re : Decommissioned nodes show as DOWN in Cassandra versions 2.1.12 - 2.1.16

Reply via email to