[jira] [Commented] (CASSANDRA-8479) Timeout Exception on Node Failure in Remote Data Center

Anuj (JIRA) Thu, 15 Jan 2015 05:56:16 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-8479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278685#comment-14278685
 ]


Anuj commented on CASSANDRA-8479:
---------------------------------

You are correct. These logs are from 2.0.3. As suggested in CASSANDRA 8352, we 
upgraded to 2.0.11 and tested the issue. Same issue was observed there.

We are using read_repair_chance=1,dclocal_read_repair_chance=0. When we made 
read_repair_chance=0, killing node in local and remote DC didn't lead to any 
read failures:) We need your help in understanding the following points:

1. We are using strong consistency i.e. LOCAL_QUORUM for reads & writes. So, 
even if one of the replicas is having obsolete value, we will read latest value 
next time we read data. Does that mean read_repair_chance=1 is not required 
when LOCAL_QUORUM is used for both reads and writes? We must have 
read_repair_chance=0, that will give us better performance without sacrificing 
consistency? What is your recommendation?

2. We are writing to Cassandra at high speeds. Is that the reason we are 
getting digest mismatch during read repair? And that's when Cassandra goes for 
CL.ALL irrespective of the fact that we are using CL.LOCAL_QUORUM?

3. I think read_repair is comparing digests from replicas in remote DC also? 
Isn't that a performance hit? We are currently using Cassandra in 
Active-Passive mode so updating remote DC fast is not our priority. What's 
recommended? I tried setting dclocal_read_repair_chance=1 and 
read_repair_chance=0 in order to make sure that read repairs are only executed 
within the DC. But I noticed that killing local node didn't caused any read 
failures. Does that mean the digest mismatch problem occurs with node at remote 
DC rather than digest of third node which didn't participated in read 
LOCAL_QUORUM?

4.Documentation at 
http://www.datastax.com/docs/1.1/configuration/storage_configuration says that 
read_repair_chance specifies the probability with which read repairs should be 
invoked on "non-quorum reads". What is the significance of "non-Quorum" here? 
We are using LOCAL_QUORUM and still read repair is coming into picture.

Yes. We misunderstood Tracing. Now that you have identified the issue, do you 
still need Tracing?

 


> Timeout Exception on Node Failure in Remote Data Center
> -------------------------------------------------------
>
>                 Key: CASSANDRA-8479
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8479
>             Project: Cassandra
>          Issue Type: Bug
>          Components: API, Core, Tools
>         Environment: Unix, Cassandra 2.0.11
>            Reporter: Amit Singh Chowdhery
>            Assignee: Sam Tunnicliffe
>            Priority: Minor
>         Attachments: TRACE_LOGS.zip
>
>
> Issue Faced :
> We have a Geo-red setup with 2 Data centers having 3 nodes each. When we 
> bring down a single Cassandra node down in DC2 by kill -9 <Cassandra-pid>, 
> reads fail on DC1 with TimedOutException for a brief amount of time (15-20 
> sec~).
> Reference :
> Already a ticket has been opened/resolved and link is provided below :
> https://issues.apache.org/jira/browse/CASSANDRA-8352
> Activity Done as per Resolution Provided :
> Upgraded to Cassandra 2.0.11 .
> We have two 3 node clusters in two different DCs and if one or more of the 
> nodes go down in one Data Center , ~5-10% traffic failure is observed on the 
> other.
> CL: LOCAL_QUORUM
> RF=3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8479) Timeout Exception on Node Failure in Remote Data Center

Reply via email to