[jira] [Commented] (CASSANDRA-8479) Timeout Exception on Node Failure in Remote Data Center

2015-06-29 Thread Sam Tunnicliffe (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605231#comment-14605231
 ] 

Sam Tunnicliffe commented on CASSANDRA-8479:


[~eanujwa] the digest requests were sent to nodes in the remote DC because of 
the {{read_repair_chance}} setting on the table. Read repair is orthogonal to 
the consistency level specified for the request, so the fact that the client 
request was using {{LOCAL_QUORUM}} has no bearing here. The CL determines which 
and how many replica responses the coordinator will wait for before returning 
to the client, it doesn't have any effect on which replicas are sent digest 
requests when a global read repair is triggered (and it cannot, by definition 
*global* read repair implies *all* replicas). There is ongoing discussion on 
CASSANDRA-6887 about whether LOCAL CLs should influence the replica set for 
global read repair. Rather than re-opening this, perhaps you could add your 
voice to that conversation.

 Timeout Exception on Node Failure in Remote Data Center
 ---

 Key: CASSANDRA-8479
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8479
 Project: Cassandra
  Issue Type: Bug
  Components: API, Core, Tools
 Environment: Unix, Cassandra 2.0.11
Reporter: Amit Singh Chowdhery
Assignee: Sam Tunnicliffe
Priority: Minor
 Attachments: TRACE_LOGS.zip


 Issue Faced :
 We have a Geo-red setup with 2 Data centers having 3 nodes each. When we 
 bring down a single Cassandra node down in DC2 by kill -9 Cassandra-pid, 
 reads fail on DC1 with TimedOutException for a brief amount of time (15-20 
 sec~).
 Reference :
 Already a ticket has been opened/resolved and link is provided below :
 https://issues.apache.org/jira/browse/CASSANDRA-8352
 Activity Done as per Resolution Provided :
 Upgraded to Cassandra 2.0.11 .
 We have two 3 node clusters in two different DCs and if one or more of the 
 nodes go down in one Data Center , ~5-10% traffic failure is observed on the 
 other.
 CL: LOCAL_QUORUM
 RF=3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8479) Timeout Exception on Node Failure in Remote Data Center

2015-06-28 Thread Anuj Wadehra (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604787#comment-14604787
 ] 

Anuj Wadehra commented on CASSANDRA-8479:
-

@Sam I think it's an issue. As you mentioned in comment:  At least one of the 
digests doesn't match, triggering a blocking full read against all the replicas 
that were sent digest requests - which includes the down node in the remote DC.

Blocking full read must be triggered ONLY against the replicas that were sent 
digest requests, Why did digest requests went to Remote DC when read CL was 
LOCAL_QUORUM? This seems to be a major problem.

Should I reopen the JIRA ? 

 Timeout Exception on Node Failure in Remote Data Center
 ---

 Key: CASSANDRA-8479
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8479
 Project: Cassandra
  Issue Type: Bug
  Components: API, Core, Tools
 Environment: Unix, Cassandra 2.0.11
Reporter: Amit Singh Chowdhery
Assignee: Sam Tunnicliffe
Priority: Minor
 Attachments: TRACE_LOGS.zip


 Issue Faced :
 We have a Geo-red setup with 2 Data centers having 3 nodes each. When we 
 bring down a single Cassandra node down in DC2 by kill -9 Cassandra-pid, 
 reads fail on DC1 with TimedOutException for a brief amount of time (15-20 
 sec~).
 Reference :
 Already a ticket has been opened/resolved and link is provided below :
 https://issues.apache.org/jira/browse/CASSANDRA-8352
 Activity Done as per Resolution Provided :
 Upgraded to Cassandra 2.0.11 .
 We have two 3 node clusters in two different DCs and if one or more of the 
 nodes go down in one Data Center , ~5-10% traffic failure is observed on the 
 other.
 CL: LOCAL_QUORUM
 RF=3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8479) Timeout Exception on Node Failure in Remote Data Center

2015-01-15 Thread Sam Tunnicliffe (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278846#comment-14278846
 ] 

Sam Tunnicliffe commented on CASSANDRA-8479:


1. read_repair_chance=1 means that on every request, digest requests will be 
sent to all nodes, local and remote so this is almost certainly not what you 
want. As you're reading and writing at LOCAL_QUORUM, you don't need to worry 
about read repair. A full read will be done by 1 node in the local dc, and a 
digest request sent to 1 other local node. This is enough to provide the 
consistency you're looking for, as should the digest not match the data a full 
read will be done on both nodes and the data resolved according to timestamp. 
Since you wrote at LOCAL_QUORUM, at least one of those nodes has the latest 
value.

You should note that this is not exactly strong consistency as you describe 
it - that would only be the case with a single client performing both reads and 
writes  with no failures. But terminology aside, I believe you can just set 
read repair chance to 0.

2. What's more relevant than the simple rate is the workload and the delay 
between writing a row and reading it back. It's certainly possibly for a write 
to be ack'd by 2 nodes, satisfying LOCAL_QUORUM but not yet be processed by the 
third node. If a read request targets that third node, either for the full data 
or a digest you'll get a mismatch.

3. Yes, read_repair_chance=1 means perform a global read repair for 100% of 
read requests, which is usually overkill (see CASSANDRA-7320). The logs don't 
show which node(s) returned a non-matching digests, only that the coordinator 
received 3 responses. Those are likely to have been from the nodes in the same 
dc, and your experiment with dclocal_read_repair_chance seems to bear that out, 
but it isn't guaranteed.

4. Whether read repair happens is only determined by 
read_repair_chance/dclocal_read_repair_chance in 2.0, so CL isn't relevant (and 
has been removed from the documentation - the version you linked is for 
Cassandra 1.1).

Note, this all disregards any additional data reads being performed for [Rapid 
Read 
Protection|http://www.datastax.com/dev/blog/rapid-read-protection-in-cassandra-2-0-2]
 because from the logs it seems as though you have that set to NONE for this 
table.

 Timeout Exception on Node Failure in Remote Data Center
 ---

 Key: CASSANDRA-8479
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8479
 Project: Cassandra
  Issue Type: Bug
  Components: API, Core, Tools
 Environment: Unix, Cassandra 2.0.11
Reporter: Amit Singh Chowdhery
Assignee: Sam Tunnicliffe
Priority: Minor
 Attachments: TRACE_LOGS.zip


 Issue Faced :
 We have a Geo-red setup with 2 Data centers having 3 nodes each. When we 
 bring down a single Cassandra node down in DC2 by kill -9 Cassandra-pid, 
 reads fail on DC1 with TimedOutException for a brief amount of time (15-20 
 sec~).
 Reference :
 Already a ticket has been opened/resolved and link is provided below :
 https://issues.apache.org/jira/browse/CASSANDRA-8352
 Activity Done as per Resolution Provided :
 Upgraded to Cassandra 2.0.11 .
 We have two 3 node clusters in two different DCs and if one or more of the 
 nodes go down in one Data Center , ~5-10% traffic failure is observed on the 
 other.
 CL: LOCAL_QUORUM
 RF=3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8479) Timeout Exception on Node Failure in Remote Data Center

2015-01-15 Thread Anuj (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278685#comment-14278685
 ] 

Anuj commented on CASSANDRA-8479:
-

You are correct. These logs are from 2.0.3. As suggested in CASSANDRA 8352, we 
upgraded to 2.0.11 and tested the issue. Same issue was observed there.

We are using read_repair_chance=1,dclocal_read_repair_chance=0. When we made 
read_repair_chance=0, killing node in local and remote DC didn't lead to any 
read failures:) We need your help in understanding the following points:

1. We are using strong consistency i.e. LOCAL_QUORUM for reads  writes. So, 
even if one of the replicas is having obsolete value, we will read latest value 
next time we read data. Does that mean read_repair_chance=1 is not required 
when LOCAL_QUORUM is used for both reads and writes? We must have 
read_repair_chance=0, that will give us better performance without sacrificing 
consistency? What is your recommendation?

2. We are writing to Cassandra at high speeds. Is that the reason we are 
getting digest mismatch during read repair? And that's when Cassandra goes for 
CL.ALL irrespective of the fact that we are using CL.LOCAL_QUORUM?

3. I think read_repair is comparing digests from replicas in remote DC also? 
Isn't that a performance hit? We are currently using Cassandra in 
Active-Passive mode so updating remote DC fast is not our priority. What's 
recommended? I tried setting dclocal_read_repair_chance=1 and 
read_repair_chance=0 in order to make sure that read repairs are only executed 
within the DC. But I noticed that killing local node didn't caused any read 
failures. Does that mean the digest mismatch problem occurs with node at remote 
DC rather than digest of third node which didn't participated in read 
LOCAL_QUORUM?

4.Documentation at 
http://www.datastax.com/docs/1.1/configuration/storage_configuration says that 
read_repair_chance specifies the probability with which read repairs should be 
invoked on non-quorum reads. What is the significance of non-Quorum here? 
We are using LOCAL_QUORUM and still read repair is coming into picture.

Yes. We misunderstood Tracing. Now that you have identified the issue, do you 
still need Tracing?

 


 Timeout Exception on Node Failure in Remote Data Center
 ---

 Key: CASSANDRA-8479
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8479
 Project: Cassandra
  Issue Type: Bug
  Components: API, Core, Tools
 Environment: Unix, Cassandra 2.0.11
Reporter: Amit Singh Chowdhery
Assignee: Sam Tunnicliffe
Priority: Minor
 Attachments: TRACE_LOGS.zip


 Issue Faced :
 We have a Geo-red setup with 2 Data centers having 3 nodes each. When we 
 bring down a single Cassandra node down in DC2 by kill -9 Cassandra-pid, 
 reads fail on DC1 with TimedOutException for a brief amount of time (15-20 
 sec~).
 Reference :
 Already a ticket has been opened/resolved and link is provided below :
 https://issues.apache.org/jira/browse/CASSANDRA-8352
 Activity Done as per Resolution Provided :
 Upgraded to Cassandra 2.0.11 .
 We have two 3 node clusters in two different DCs and if one or more of the 
 nodes go down in one Data Center , ~5-10% traffic failure is observed on the 
 other.
 CL: LOCAL_QUORUM
 RF=3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8479) Timeout Exception on Node Failure in Remote Data Center

2015-01-14 Thread Sam Tunnicliffe (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276863#comment-14276863
 ] 

Sam Tunnicliffe commented on CASSANDRA-8479:


I don't think you did actually upgrade to 2.0.11, as the line numbers in the 
logs you did attached don't line up with the source (but they do match 2.0.3 
which is the version you originally reported CASSANDRA-8352 against).

That aside, what are the read repair settings for the keyspace as it seems that 
global read repair is being triggered, causing digest requests to be sent to 
all endpoints (the default is 0.1, which tallies with your report of ~10% error 
rate). At least one of the digests doesn't match, triggering a blocking full 
read against all the replicas that were sent digest requests - which includes 
the down node in the remote DC. Once failure detection kicks in, the down node 
is no longer included in the digest round and so the errors stop.

CASSANDRA-7947 (2.0.12) doesn't prevent this happening as it's the correct 
behaviour, but it does add additional debug logging  tracing  as well as 
reporting the original CL (in this case LOCAL_QUORUM) in the thrown exception.

Also, by 'enable tracing', Ryan was referring to request tracing as described 
in the blog post he linked, not setting the logging level to TRACE. A request 
trace for one of the failing queries would show the initial round of data and 
digest requests sent by the coordinator as well as the digest mismatch  
subsequent read repairs requests.


 Timeout Exception on Node Failure in Remote Data Center
 ---

 Key: CASSANDRA-8479
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8479
 Project: Cassandra
  Issue Type: Bug
  Components: API, Core, Tools
 Environment: Unix, Cassandra 2.0.11
Reporter: Amit Singh Chowdhery
Assignee: Sam Tunnicliffe
Priority: Minor
 Attachments: TRACE_LOGS.zip


 Issue Faced :
 We have a Geo-red setup with 2 Data centers having 3 nodes each. When we 
 bring down a single Cassandra node down in DC2 by kill -9 Cassandra-pid, 
 reads fail on DC1 with TimedOutException for a brief amount of time (15-20 
 sec~).
 Reference :
 Already a ticket has been opened/resolved and link is provided below :
 https://issues.apache.org/jira/browse/CASSANDRA-8352
 Activity Done as per Resolution Provided :
 Upgraded to Cassandra 2.0.11 .
 We have two 3 node clusters in two different DCs and if one or more of the 
 nodes go down in one Data Center , ~5-10% traffic failure is observed on the 
 other.
 CL: LOCAL_QUORUM
 RF=3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8479) Timeout Exception on Node Failure in Remote Data Center

2015-01-09 Thread Anuj (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270893#comment-14270893
 ] 

Anuj commented on CASSANDRA-8479:
-

I have attached TRACE level logs. You can find multiple ReadTimeoutException in 
System.log.3 . Once we killed Cassandra on one of the nodes in DC2, around 7 
read requests failed for around 17 seconds on DC1 and then everything was back 
to normal. We need to understand why these reads failed when we are using 
LOCAL_QUORUM in our application. Also, in another Cassandra log file 
System.log.2, we saw java.nio.file.NoSuchFileException. 

We got Hector's HTimeoutException in our application logs during these 17 
seconds. 
Stack Trace from application logs:
com.ericsson.rm.service.voucher.InternalServerException: Internal server error, 
me.prettyprint.hector.api.exceptions.HTimedOutException: TimedOutException()
at 
com.ericsson.rm.voucher.traffic.reservation.cassandra.CassandraReservation.getReservationSlice(CassandraReservation.java:552)
 ~[na:na]
at 
com.ericsson.rm.voucher.traffic.reservation.cassandra.CassandraReservation.lookup(CassandraReservation.java:499)
 ~[na:na]
at 
com.ericsson.rm.voucher.traffic.VoucherTraffic.getReservedOrPendingVoucher(VoucherTraffic.java:764)
 ~[na:na]
at 
com.ericsson.rm.voucher.traffic.VoucherTraffic.commit(VoucherTraffic.java:686) 
~[na:na]
... 6 common frames omitted
Caused by: com.ericsson.rm.service.cassandra.xa.ConnectionException: 
me.prettyprint.hector.api.exceptions.HTimedOutException: TimedOutException()
at 
com.ericsson.rm.cassandra.xa.keyspace.row.KeyedRowQuery.execute(KeyedRowQuery.java:93)
 ~[na:na]
at 
com.ericsson.rm.voucher.traffic.reservation.cassandra.CassandraReservation.getReservationSlice(CassandraReservation.java:548)
 ~[na:na]
... 9 common frames omitted
Caused by: me.prettyprint.hector.api.exceptions.HTimedOutException: 
TimedOutException()
at 
me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:42)
 ~[na:na]
at 
me.prettyprint.cassandra.service.KeyspaceServiceImpl$7.execute(KeyspaceServiceImpl.java:286)
 ~[na:na]
at 
me.prettyprint.cassandra.service.KeyspaceServiceImpl$7.execute(KeyspaceServiceImpl.java:269)
 ~[na:na]
at 
me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:104)
 ~[na:na]
at 
me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:258)
 ~[na:na]
at 
me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:132)
 ~[na:na]
at 
me.prettyprint.cassandra.service.KeyspaceServiceImpl.getSlice(KeyspaceServiceImpl.java:290)
 ~[na:na]
at 
me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:53)
 ~[na:na]
at 
me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:49)
 ~[na:na]
at 
me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20)
 ~[na:na]
at 
me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:101)
 ~[na:na]
at 
me.prettyprint.cassandra.model.thrift.ThriftSliceQuery.execute(ThriftSliceQuery.java:48)
 ~[na:na]
at 
com.ericsson.rm.cassandra.xa.keyspace.row.KeyedRowQuery.execute(KeyedRowQuery.java:77)
 ~[na:na]
... 10 common frames omitted
Caused by: org.apache.cassandra.thrift.TimedOutException: null
at 
org.apache.cassandra.thrift.Cassandra$get_slice_result$get_slice_resultStandardScheme.read(Cassandra.java:11504)
 ~[na:na]
at 
org.apache.cassandra.thrift.Cassandra$get_slice_result$get_slice_resultStandardScheme.read(Cassandra.java:11453)
 ~[na:na]
at 
org.apache.cassandra.thrift.Cassandra$get_slice_result.read(Cassandra.java:11379)
 ~[na:na]
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) 
~[na:na]
at 
org.apache.cassandra.thrift.Cassandra$Client.recv_get_slice(Cassandra.java:653) 
~[na:na]
at 
org.apache.cassandra.thrift.Cassandra$Client.get_slice(Cassandra.java:637) 
~[na:na]
at 
me.prettyprint.cassandra.service.KeyspaceServiceImpl$7.execute(KeyspaceServiceImpl.java:274)
 ~[na:na]
... 21 common frames omitted

Please have a look at https://issues.apache.org/jira/browse/CASSANDRA-8352 for 
more details about the issue.




 Timeout Exception on Node Failure in Remote Data Center
 ---

 Key: CASSANDRA-8479
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8479
 Project: Cassandra
  Issue Type: Bug
  Components: API, Core, Tools
 Environment: Unix, Cassandra 2.0.11
Reporter: Amit Singh 

[jira] [Commented] (CASSANDRA-8479) Timeout Exception on Node Failure in Remote Data Center

2014-12-29 Thread Parth Setya (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14260042#comment-14260042
 ] 

Parth Setya commented on CASSANDRA-8479:


We are working on getting the trace level logs. Meanwhile can you comment on 
the following?
We are currently using hector(1.1.0.E001) api to query data from C*. Do you 
think this can be a Hector related Issue?
Which client did you use when you tried to reproduce the issue?

 Timeout Exception on Node Failure in Remote Data Center
 ---

 Key: CASSANDRA-8479
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8479
 Project: Cassandra
  Issue Type: Bug
  Components: API, Core, Tools
 Environment: Unix, Cassandra 2.0.11
Reporter: Amit Singh Chowdhery
Assignee: Ryan McGuire
Priority: Minor

 Issue Faced :
 We have a Geo-red setup with 2 Data centers having 3 nodes each. When we 
 bring down a single Cassandra node down in DC2 by kill -9 Cassandra-pid, 
 reads fail on DC1 with TimedOutException for a brief amount of time (15-20 
 sec~).
 Reference :
 Already a ticket has been opened/resolved and link is provided below :
 https://issues.apache.org/jira/browse/CASSANDRA-8352
 Activity Done as per Resolution Provided :
 Upgraded to Cassandra 2.0.11 .
 We have two 3 node clusters in two different DCs and if one or more of the 
 nodes go down in one Data Center , ~5-10% traffic failure is observed on the 
 other.
 CL: LOCAL_QUORUM
 RF=3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8479) Timeout Exception on Node Failure in Remote Data Center

2014-12-29 Thread Parth Setya (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14260043#comment-14260043
 ] 

Parth Setya commented on CASSANDRA-8479:


We are working on getting the trace level logs. Meanwhile can you comment on 
the following?
We are currently using hector(1.1.0.E001) api to query data from C*. Do you 
think this can be a Hector related Issue?
Which client did you use when you tried to reproduce the issue?

 Timeout Exception on Node Failure in Remote Data Center
 ---

 Key: CASSANDRA-8479
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8479
 Project: Cassandra
  Issue Type: Bug
  Components: API, Core, Tools
 Environment: Unix, Cassandra 2.0.11
Reporter: Amit Singh Chowdhery
Assignee: Ryan McGuire
Priority: Minor

 Issue Faced :
 We have a Geo-red setup with 2 Data centers having 3 nodes each. When we 
 bring down a single Cassandra node down in DC2 by kill -9 Cassandra-pid, 
 reads fail on DC1 with TimedOutException for a brief amount of time (15-20 
 sec~).
 Reference :
 Already a ticket has been opened/resolved and link is provided below :
 https://issues.apache.org/jira/browse/CASSANDRA-8352
 Activity Done as per Resolution Provided :
 Upgraded to Cassandra 2.0.11 .
 We have two 3 node clusters in two different DCs and if one or more of the 
 nodes go down in one Data Center , ~5-10% traffic failure is observed on the 
 other.
 CL: LOCAL_QUORUM
 RF=3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8479) Timeout Exception on Node Failure in Remote Data Center

2014-12-29 Thread Ryan McGuire (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14260149#comment-14260149
 ] 

Ryan McGuire commented on CASSANDRA-8479:
-

I was using the datastax python driver (cql). I'll be sure to try it over 
thrift once I can see the trace.

 Timeout Exception on Node Failure in Remote Data Center
 ---

 Key: CASSANDRA-8479
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8479
 Project: Cassandra
  Issue Type: Bug
  Components: API, Core, Tools
 Environment: Unix, Cassandra 2.0.11
Reporter: Amit Singh Chowdhery
Assignee: Ryan McGuire
Priority: Minor

 Issue Faced :
 We have a Geo-red setup with 2 Data centers having 3 nodes each. When we 
 bring down a single Cassandra node down in DC2 by kill -9 Cassandra-pid, 
 reads fail on DC1 with TimedOutException for a brief amount of time (15-20 
 sec~).
 Reference :
 Already a ticket has been opened/resolved and link is provided below :
 https://issues.apache.org/jira/browse/CASSANDRA-8352
 Activity Done as per Resolution Provided :
 Upgraded to Cassandra 2.0.11 .
 We have two 3 node clusters in two different DCs and if one or more of the 
 nodes go down in one Data Center , ~5-10% traffic failure is observed on the 
 other.
 CL: LOCAL_QUORUM
 RF=3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8479) Timeout Exception on Node Failure in Remote Data Center

2014-12-19 Thread Ryan McGuire (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254060#comment-14254060
 ] 

Ryan McGuire commented on CASSANDRA-8479:
-

I have not been able able to reproduce this.

I would suggest you reproduce this again and enable tracing: 
http://www.datastax.com/dev/blog/tracing-in-cassandra-1-2 . Perhaps the trace 
log will help illuminate what is going on here.

When I kill a node in the other datacenter the query trace looks identical and 
is the same speed as otherwise.

 Timeout Exception on Node Failure in Remote Data Center
 ---

 Key: CASSANDRA-8479
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8479
 Project: Cassandra
  Issue Type: Bug
  Components: API, Core, Tools
 Environment: Unix, Cassandra 2.0.11
Reporter: Amit Singh Chowdhery
Assignee: Ryan McGuire
Priority: Minor

 Issue Faced :
 We have a Geo-red setup with 2 Data centers having 3 nodes each. When we 
 bring down a single Cassandra node down in DC2 by kill -9 Cassandra-pid, 
 reads fail on DC1 with TimedOutException for a brief amount of time (15-20 
 sec~).
 Reference :
 Already a ticket has been opened/resolved and link is provided below :
 https://issues.apache.org/jira/browse/CASSANDRA-8352
 Activity Done as per Resolution Provided :
 Upgraded to Cassandra 2.0.11 .
 We have two 3 node clusters in two different DCs and if one or more of the 
 nodes go down in one Data Center , ~5-10% traffic failure is observed on the 
 other.
 CL: LOCAL_QUORUM
 RF=3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)