[jira] [Updated] (CASSANDRA-13308) Hint files not being deleted on nodetool decommission

Arijit (JIRA) Wed, 08 Mar 2017 02:40:18 -0800

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-13308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Arijit updated CASSANDRA-13308:
-------------------------------
    Description: 
How to reproduce the issue I'm seeing:
Shut down Cassandra on one node of the cluster and wait until we accumulate a 
ton of hints. Start Cassandra on the node and immediately run "nodetool 
decommission" on it.

The node streams its replicas and marks itself as DECOMMISSIONED, but other 
nodes do not seem to see this message. "nodetool status" shows the 
decommissioned node in state "UL" on all other nodes (it is also present in 
system.peers), and Cassandra logs show that gossip tasks on nodes are not 
proceeding (number of pending tasks keeps increasing). Jstack suggests that a 
gossip task is blocked on hints dispatch (I can provide traces if this is not 
obvious). Because the cluster is large and there are a lot of hints, this is 
taking a while. 

On inspecting "/var/lib/cassandra/hints" on the nodes, I see a bunch of hint 
files for the decommissioned node. Documentation seems to suggest that these 
hints should be deleted during "nodetool decommission", but it does not seem to 
be the case here. This is the bug being reported.

To recover from this scenario, if I manually delete hint files on the nodes, 
the hints dispatcher threads throw a bunch of exceptions and the decommissioned 
node is now in state "DL" (perhaps it missed some gossip messages?). The node 
is still in my "system.peers" table

Restarting Cassandra on all nodes after this step does not fix the issue (the 
node remains in the peers table). In fact, after this point the decommissioned 
node is in state "DN"

  was:
How to reproduce the issue I'm seeing:
Shut down Cassandra on one node of the cluster and wait until we accumulate a 
ton of hints. Start Cassandra on the node and immediately run "nodetool 
decommission" on it.

The node streams its replicas and marks itself has DECOMMISSIONED, but other 
nodes do not seem to see this message. "nodetool status" shows the 
decommissioned node in state "UL", and Cassandra logs show that this is because 
gossip tasks on nodes are blocked. Jstack shows that the tasks are blocked on 
hints dispatch (I can provide traces if this is not obvious). Because the 
cluster is large and there are a lot of hints, this is taking a while. 

On inspecting "/var/lib/cassandra/hints" on the nodes, I see a bunch of hint 
files for the decommissioned node. Documentation seems to suggest that these 
hints should be deleted during "nodetool decommission".

When I manually delete hint files on the nodes, the hints dispatcher threads 
throw a bunch of exceptions and the decommissioned node is now in state "DL" 
(perhaps it missed some gossip messages?). The node is still in my 
"system.peers" table

Restarting Cassandra on all nodes after this step does not fix the issue (the 
node remains in the peers table). In fact, after this point the decommissioned 
node is in state "DN"


> Hint files not being deleted on nodetool decommission
> -----------------------------------------------------
>
>                 Key: CASSANDRA-13308
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13308
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Streaming and Messaging
>         Environment: Using Cassandra version 3.0.9
>            Reporter: Arijit
>            Priority: Minor
>
> How to reproduce the issue I'm seeing:
> Shut down Cassandra on one node of the cluster and wait until we accumulate a 
> ton of hints. Start Cassandra on the node and immediately run "nodetool 
> decommission" on it.
> The node streams its replicas and marks itself as DECOMMISSIONED, but other 
> nodes do not seem to see this message. "nodetool status" shows the 
> decommissioned node in state "UL" on all other nodes (it is also present in 
> system.peers), and Cassandra logs show that gossip tasks on nodes are not 
> proceeding (number of pending tasks keeps increasing). Jstack suggests that a 
> gossip task is blocked on hints dispatch (I can provide traces if this is not 
> obvious). Because the cluster is large and there are a lot of hints, this is 
> taking a while. 
> On inspecting "/var/lib/cassandra/hints" on the nodes, I see a bunch of hint 
> files for the decommissioned node. Documentation seems to suggest that these 
> hints should be deleted during "nodetool decommission", but it does not seem 
> to be the case here. This is the bug being reported.
> To recover from this scenario, if I manually delete hint files on the nodes, 
> the hints dispatcher threads throw a bunch of exceptions and the 
> decommissioned node is now in state "DL" (perhaps it missed some gossip 
> messages?). The node is still in my "system.peers" table
> Restarting Cassandra on all nodes after this step does not fix the issue (the 
> node remains in the peers table). In fact, after this point the 
> decommissioned node is in state "DN"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (CASSANDRA-13308) Hint files not being deleted on nodetool decommission

Reply via email to