[jira] [Reopened] (CASSANDRA-8352) Timeout Exception on Node Failure in Remote Data Center

Akhtar Hussain (JIRA) Tue, 25 Nov 2014 03:17:38 -0800

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-8352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Akhtar Hussain reopened CASSANDRA-8352:
---------------------------------------

Currently, it’s not possible for us to go for an immediate upgrade to 2.0.11. 
Moreover, we are not certain whether it’s an issue with Cassandra version or a 
problem with our setup. 

I would appreciate if you could try to reproduce the issue on Cassandra 2.0.3. 
Moreover, we would like you to recheck our configuration. We are using private 
IP for rpc_address and Public IP for seeds and listen_address. Is this 
configuration Ok? 

It’s very strange than inspite of using LOCAL_QUORUM for reads, we are getting 
org.apache.cassandra.thrift.TimedOutException: null in our application logs. We 
are also getting Read timeout Exception in Cassandra logs as only 5 out of 6 
nodes responded when we killed one node. But Cassandra Exception is acceptable 
if we don’t get Exception in Thrift. Please analyse the stacktrace we shared.

Steps to Reproduce:
1.      Setup two DCs with 3 nodes each
2.      Cassandra.yaml:
a.      Seeds= public  host names of 6 nodes (as configured in /etc/hosts)
b.      Listen_address= publi host name of node
c.      Rpc_address= private host name as configured in /etc/hosts
d.      Using vnodes
3.      Cassandra-topology.properties:
host2_pub=DC1:RAC1
host3_pub=DC1:RAC1
host1_pub=DC1:RAC1
geo1_host=DC2:RAC1
geo2_host=DC2:RAC1
geo3_host=DC2:RAC1
default= DC1:RAC1 (for DC1 nodes) / default= DC2 :RAC1 (for DC2 nodes)

host<n>_pub= public hostname
geo<n>_host= public hostname of nodes in remote DC

4.      Keyspace configuration
CREATE KEYSPACE vs WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'DC2': '3',
  'DC1': '3'
};
5.      Run traffic of 200 read request/sec on DC1. 

6.      Go to one node of DC2 and do kill -9 <cassandra pid>

7.      Read requests on DC1 fail temporarily.


> Timeout Exception on Node Failure in Remote Data Center
> -------------------------------------------------------
>
>                 Key: CASSANDRA-8352
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8352
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Unix, Cassandra 2.0.3
>            Reporter: Akhtar Hussain
>              Labels: DataCenter, GEO-Red
>
> We have a Geo-red setup with 2 Data centers having 3 nodes each. When we 
> bring down a single Cassandra node down in DC2 by kill -9 <Cassandra-pid>, 
> reads fail on DC1 with TimedOutException for a brief amount of time (15-20 
> sec~). 
> Questions:
> 1.    We need to understand why reads fail on DC1 when a node in another DC 
> i.e. DC2 fails? As we are using LOCAL_QUORUM for both reads/writes in DC1, 
> request should return once 2 nodes in local DC have replied instead of timing 
> out because of node in remote DC.
> 2.    We want to make sure that no Cassandra requests fail in case of node 
> failures. We used rapid read protection of ALWAYS/99percentile/10ms as 
> mentioned in 
> http://www.datastax.com/dev/blog/rapid-read-protection-in-cassandra-2-0-2. 
> But nothing worked. How to ensure zero request failures in case a node fails?
> 3.    What is the right way of handling HTimedOutException exceptions in 
> Hector?
> 4.    Please confirm are we using public private hostnames as expected?
> We are using Cassandra 2.0.3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (CASSANDRA-8352) Timeout Exception on Node Failure in Remote Data Center

Reply via email to