[ https://issues.apache.org/jira/browse/CASSANDRA-8352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Akhtar Hussain reopened CASSANDRA-8352: --------------------------------------- Currently, it’s not possible for us to go for an immediate upgrade to 2.0.11. Moreover, we are not certain whether it’s an issue with Cassandra version or a problem with our setup. I would appreciate if you could try to reproduce the issue on Cassandra 2.0.3. Moreover, we would like you to recheck our configuration. We are using private IP for rpc_address and Public IP for seeds and listen_address. Is this configuration Ok? It’s very strange than inspite of using LOCAL_QUORUM for reads, we are getting org.apache.cassandra.thrift.TimedOutException: null in our application logs. We are also getting Read timeout Exception in Cassandra logs as only 5 out of 6 nodes responded when we killed one node. But Cassandra Exception is acceptable if we don’t get Exception in Thrift. Please analyse the stacktrace we shared. Steps to Reproduce: 1. Setup two DCs with 3 nodes each 2. Cassandra.yaml: a. Seeds= public host names of 6 nodes (as configured in /etc/hosts) b. Listen_address= publi host name of node c. Rpc_address= private host name as configured in /etc/hosts d. Using vnodes 3. Cassandra-topology.properties: host2_pub=DC1:RAC1 host3_pub=DC1:RAC1 host1_pub=DC1:RAC1 geo1_host=DC2:RAC1 geo2_host=DC2:RAC1 geo3_host=DC2:RAC1 default= DC1:RAC1 (for DC1 nodes) / default= DC2 :RAC1 (for DC2 nodes) host<n>_pub= public hostname geo<n>_host= public hostname of nodes in remote DC 4. Keyspace configuration CREATE KEYSPACE vs WITH replication = { 'class': 'NetworkTopologyStrategy', 'DC2': '3', 'DC1': '3' }; 5. Run traffic of 200 read request/sec on DC1. 6. Go to one node of DC2 and do kill -9 <cassandra pid> 7. Read requests on DC1 fail temporarily. > Timeout Exception on Node Failure in Remote Data Center > ------------------------------------------------------- > > Key: CASSANDRA-8352 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8352 > Project: Cassandra > Issue Type: Bug > Environment: Unix, Cassandra 2.0.3 > Reporter: Akhtar Hussain > Labels: DataCenter, GEO-Red > > We have a Geo-red setup with 2 Data centers having 3 nodes each. When we > bring down a single Cassandra node down in DC2 by kill -9 <Cassandra-pid>, > reads fail on DC1 with TimedOutException for a brief amount of time (15-20 > sec~). > Questions: > 1. We need to understand why reads fail on DC1 when a node in another DC > i.e. DC2 fails? As we are using LOCAL_QUORUM for both reads/writes in DC1, > request should return once 2 nodes in local DC have replied instead of timing > out because of node in remote DC. > 2. We want to make sure that no Cassandra requests fail in case of node > failures. We used rapid read protection of ALWAYS/99percentile/10ms as > mentioned in > http://www.datastax.com/dev/blog/rapid-read-protection-in-cassandra-2-0-2. > But nothing worked. How to ensure zero request failures in case a node fails? > 3. What is the right way of handling HTimedOutException exceptions in > Hector? > 4. Please confirm are we using public private hostnames as expected? > We are using Cassandra 2.0.3. -- This message was sent by Atlassian JIRA (v6.3.4#6332)