[jira] [Commented] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster

Jackson Chung (JIRA) Thu, 29 Sep 2016 23:56:47 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15535240#comment-15535240
 ]


Jackson Chung commented on CASSANDRA-11724:
-------------------------------------------

i believe we ran into this as well (along with  CASSANDRA-10371 )

no i don't have a test case , sorry.

to "fix" CASSANDRA-10371 , rolling restart appeared work (will monitor for 
couple more days). But for this issue, FailureDetector jmx attribute shows an 
IP as DOWN even it was properly decommissioned (no hang, didn't need to do 
removenode). 

getEndpointState (or gossipinfo) shows:
{noformat}
/10.30.10.146
  generation:1459192362
  heartbeat:48032911
  RACK:10:r1
  NET_VERSION:1:7
  LOAD:48032807:8.68526498837E11
  SEVERITY:48032913:0.0
  HOST_ID:2:e96fdd2b-73a0-4579-bc04-3b60a557c2d3
  STATUS:24603149:LEFT,13028853640594434189771438209987024084,1475453013098
  DC:8:DC_OREGON_OFFLINE
  SCHEMA:46105309:db7592b0-5047-3595-bfea-e3efce1aa75f
  RELEASE_VERSION:4:2.0.17
  INTERNAL_IP:6:10.30.10.146
  RPC_ADDRESS:3:10.30.10.146
  TOKENS:15:<hidden>
{noformat}

> False Failure Detection in Big Cassandra Cluster
> ------------------------------------------------
>
>                 Key: CASSANDRA-11724
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11724
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jeffrey F. Lukman
>              Labels: gossip, node-failure
>         Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, 
> Workload4.jpg, experiment-result.txt
>
>
> We are running some testing on Cassandra v2.2.5 stable in a big cluster. The 
> setting in our testing is that each machine has 16-cores and runs 8 cassandra 
> instances, and our testing is 32, 64, 128, 256, and 512 instances of 
> Cassandra. We use the default number of vnodes for each instance which is 
> 256. The data and log directories are on in-memory tmpfs file system.
> We run several types of workloads on this Cassandra cluster:
> Workload1: Just start the cluster
> Workload2: Start half of the cluster, wait until it gets into a stable 
> condition, and run another half of the cluster
> Workload3: Start half of the cluster, wait until it gets into a stable 
> condition, load some data, and run another half of the cluster
> Workload4: Start the cluster, wait until it gets into a stable condition, 
> load some data and decommission one node
> For this testing, we measure the total numbers of false failure detection 
> inside the cluster. By false failure detection, we mean that, for example, 
> instance-1 marks the instance-2 down, but the instance-2 is not down. We dig 
> deeper into the root cause and find out that instance-1 has not received any 
> heartbeat after some time from instance-2 because the instance-2 run a long 
> computation process.
> Here I attach the graphs of each workload result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster

Reply via email to