Jaydeepkumar Chovatia created CASSANDRA-13740:
-------------------------------------------------

             Summary: Orphan hint file gets created while node is being removed 
from cluster
                 Key: CASSANDRA-13740
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13740
             Project: Cassandra
          Issue Type: Bug
          Components: Core
            Reporter: Jaydeepkumar Chovatia
             Fix For: 3.0.15
         Attachments: gossip_hang_test.py

I have found this new issue during my test, whenever node is being removed then 
hint file for that node gets written and stays inside the hint directory 
forever. I debugged the code and found that it is due to the race condition 
between [HintsWriteExecutor.java::flush | 
https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/hints/HintsWriteExecutor.java#L195]
 and [HintsWriteExecutor.java::closeWriter | 
https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/hints/HintsWriteExecutor.java#L106]
. 
 
*Time t1* Node is down, as a result Hints are being written by 
[HintsWriteExecutor.java::flush | 
https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/hints/HintsWriteExecutor.java#L195]
*Time t2* Node is removed from cluster as a result it calls 
[HintsService.java-exciseStore | 
https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/hints/HintsService.java#L327]
 which removes hint files for the node being removed
*Time t3* Mutation stage keeps pumping Hints through [HintService.java::write | 
https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/hints/HintsService.java#L145]
 which again calls [HintsWriteExecutor.java::flush | 
https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/hints/HintsWriteExecutor.java#L215]
 and new orphan file gets created

I was writing a new dtest for {CASSANDRA-13562, CASSANDRA-13308} and that 
helped me reproduce this new bug. I will submit patch for this new dtest later.

I also tried following to check how this orphan hint file responds:
1. I tried {{nodetool truncatehints <node>}} but it fails as node is no longer 
part of the ring
2. I then tried {{nodetool truncatehints}}, that still doesn’t remove hint file 
because it is not yet included in the [dispatchDequeue | 
https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/hints/HintsStore.java#L53]


Reproducible steps:
Please find dTest python file {{gossip_hang_test.py}} attached which reproduces 
this bug.

Solution:
This is due to race condition as mentioned above. Since 
{{HintsWriteExecutor.java}} creates thread pool with only 1 worker, so solution 
becomes little simple. Whenever we [HintService.java::excise | 
https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/hints/HintsService.java#L303]
 a host, just store it in-memory, and check for already evicted host inside 
[HintsWriteExecutor.java::flush | 
https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/hints/HintsWriteExecutor.java#L215].
 If already evicted host is found then ignore hints.

Jaydeep



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to