[jira] [Commented] (CASSANDRA-13562) nodes in cluster gets into split-brain mode

2017-07-11 Thread Jaydeepkumar Chovatia (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083476#comment-16083476
 ] 

Jaydeepkumar Chovatia commented on CASSANDRA-13562:
---

I analyzed stack trace when Cassandra goes into split-brain mode and found that 
Gossiper thread is stuck at following place forever for 
HintsDispatchExecutor.java to complete, and HintsDispatchExecutor.java executor 
thread is blocked in delivering hints to the node being removed. They are going 
in dead-lock state and thats the reason behind this split brains. 

{quote}
"GossipStage:1" #310
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xab000720> (a 
java.util.concurrent.FutureTask)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at 
org.apache.cassandra.hints.HintsDispatchExecutor.completeDispatchBlockingly(HintsDispatchExecutor.java:112)
at org.apache.cassandra.hints.HintsService.excise(HintsService.java:323)
at 
org.apache.cassandra.service.StorageService.excise(StorageService.java:2265)
at 
org.apache.cassandra.service.StorageService.excise(StorageService.java:2278)
at 
org.apache.cassandra.service.StorageService.handleStateRemoving(StorageService.java:2234)
at 
org.apache.cassandra.service.StorageService.onChange(StorageService.java:1690)
at 
org.apache.cassandra.service.StorageService.onJoin(StorageService.java:2474)
at 
org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1060)
at 
org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1143)
at 
org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:76)
at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at 
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
at 
org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$4/1527007086.run(Unknown
 Source)
at java.lang.Thread.run(Thread.java:745)
{quote}


Here are the reproducible steps:
1. Created Cassandra 3.0.13 cluster with few nodes (say 5 nodes)
2. Set {{hinted_handoff_throttle_in_kb}} to 1 (so that hint propagation will 
take time, we must hit removenode while hints are in-preogress to reproduce 
this issue)
3. Start a load on this cluster specifically write traffic
4. Purposefully shutdown one node and let hints build 
5. Restart node momentarily and make sure all nodes are in UN state, wait for 
30 seconds to 1 min. so that {{HintsDispatchExecutor.java}} starts dispatching 
hints to the node
6. Kill Cassandra on that node again
7. Try removing that down node using {{nodetool removenode force}} or 
{{nodetool assassinate}}, at this point check {{nodetool status}} on each node 
and you will see they are in split-brain mode due to Gossip thread is stuck. At 
this point the only way to come out of this situation is to to reboot Cassandra.

Fix for this problem is to do {{future.cancel}}, upon further investigation I 
found that it has already fixed as part of CASSANDRA-13308. I have tried 
reproducing this with 3.0.14 and it is no longer reproduced in 3.0.14.


> nodes in cluster gets into split-brain mode
> ---
>
> Key: CASSANDRA-13562
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13562
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Jaydeepkumar Chovatia
> Fix For: 3.0.x
>
>
> We have seen nodes in Cassandra (3.0.11) ring gets into split-brain somehow. 
> We don't know exact reproducible steps but here is our observation:
> Let's assume we have 5 node cluster n1,n2,n3,n4,n5. In this bug when do 
> nodetool status on each node then each one has different view of DN node
> e.g.
> n1 sees n3 as DN and other nodes are UN
> n3 sees n4 as DN and other nodes are UN
> n4 sees n5 as DN and other nodes are UN and so on...
> One thing we have observed is once n/w link is broken and restored then 
> sometimes nodes go into this split-brain mode but we still don't have exact 
> reproducible steps.
> Please let us know if I am missing anything specific here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (CASSANDRA-13562) nodes in cluster gets into split-brain mode

2017-07-04 Thread ZhaoYang (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16074128#comment-16074128
 ] 

ZhaoYang commented on CASSANDRA-13562:
--

you may consider checking `phi_convict_threshold` in cassandra.yaml and turning 
the logging for phi value in gossip. then you will get a better idea why c* 
node thinks another as down node.

> nodes in cluster gets into split-brain mode
> ---
>
> Key: CASSANDRA-13562
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13562
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Jaydeepkumar Chovatia
> Fix For: 3.0.x
>
>
> We have seen nodes in Cassandra (3.0.11) ring gets into split-brain somehow. 
> We don't know exact reproducible steps but here is our observation:
> Let's assume we have 5 node cluster n1,n2,n3,n4,n5. In this bug when do 
> nodetool status on each node then each one has different view of DN node
> e.g.
> n1 sees n3 as DN and other nodes are UN
> n3 sees n4 as DN and other nodes are UN
> n4 sees n5 as DN and other nodes are UN and so on...
> One thing we have observed is once n/w link is broken and restored then 
> sometimes nodes go into this split-brain mode but we still don't have exact 
> reproducible steps.
> Please let us know if I am missing anything specific here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org