[ https://issues.apache.org/jira/browse/CASSANDRA-8479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278685#comment-14278685 ]
Anuj commented on CASSANDRA-8479: --------------------------------- You are correct. These logs are from 2.0.3. As suggested in CASSANDRA 8352, we upgraded to 2.0.11 and tested the issue. Same issue was observed there. We are using read_repair_chance=1,dclocal_read_repair_chance=0. When we made read_repair_chance=0, killing node in local and remote DC didn't lead to any read failures:) We need your help in understanding the following points: 1. We are using strong consistency i.e. LOCAL_QUORUM for reads & writes. So, even if one of the replicas is having obsolete value, we will read latest value next time we read data. Does that mean read_repair_chance=1 is not required when LOCAL_QUORUM is used for both reads and writes? We must have read_repair_chance=0, that will give us better performance without sacrificing consistency? What is your recommendation? 2. We are writing to Cassandra at high speeds. Is that the reason we are getting digest mismatch during read repair? And that's when Cassandra goes for CL.ALL irrespective of the fact that we are using CL.LOCAL_QUORUM? 3. I think read_repair is comparing digests from replicas in remote DC also? Isn't that a performance hit? We are currently using Cassandra in Active-Passive mode so updating remote DC fast is not our priority. What's recommended? I tried setting dclocal_read_repair_chance=1 and read_repair_chance=0 in order to make sure that read repairs are only executed within the DC. But I noticed that killing local node didn't caused any read failures. Does that mean the digest mismatch problem occurs with node at remote DC rather than digest of third node which didn't participated in read LOCAL_QUORUM? 4.Documentation at http://www.datastax.com/docs/1.1/configuration/storage_configuration says that read_repair_chance specifies the probability with which read repairs should be invoked on "non-quorum reads". What is the significance of "non-Quorum" here? We are using LOCAL_QUORUM and still read repair is coming into picture. Yes. We misunderstood Tracing. Now that you have identified the issue, do you still need Tracing? > Timeout Exception on Node Failure in Remote Data Center > ------------------------------------------------------- > > Key: CASSANDRA-8479 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8479 > Project: Cassandra > Issue Type: Bug > Components: API, Core, Tools > Environment: Unix, Cassandra 2.0.11 > Reporter: Amit Singh Chowdhery > Assignee: Sam Tunnicliffe > Priority: Minor > Attachments: TRACE_LOGS.zip > > > Issue Faced : > We have a Geo-red setup with 2 Data centers having 3 nodes each. When we > bring down a single Cassandra node down in DC2 by kill -9 <Cassandra-pid>, > reads fail on DC1 with TimedOutException for a brief amount of time (15-20 > sec~). > Reference : > Already a ticket has been opened/resolved and link is provided below : > https://issues.apache.org/jira/browse/CASSANDRA-8352 > Activity Done as per Resolution Provided : > Upgraded to Cassandra 2.0.11 . > We have two 3 node clusters in two different DCs and if one or more of the > nodes go down in one Data Center , ~5-10% traffic failure is observed on the > other. > CL: LOCAL_QUORUM > RF=3 -- This message was sent by Atlassian JIRA (v6.3.4#6332)