[jira] [Commented] (CASSANDRA-7886) Coordinator should not wait for read timeouts when replicas hit Exceptions

Christian Spriegel (JIRA) Tue, 23 Dec 2014 04:53:46 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14256948#comment-14256948
 ]


Christian Spriegel commented on CASSANDRA-7886:
-----------------------------------------------

Hi @thobbs!
I have a chrismas present for you, in form of a patch file ;-) I attached a v5 
patch that contains the fixes.

Regarding TOE: Currently I throw TOEs as exceptions and they get logged just 
like any other exception. I am not sure if this is desireable and would like to 
hear your feedback. I think we have the following options:
- Leave as it is in v5, meaning TOEs get logged with stacktraces.
- Add catch blocks where neccessary and log it in user-friendly way. But it 
might be in many places. Also in this case I would prefer making TOE a checked 
exception. Imho TOE should not be unchecked.
- Add TOE logging to C* default exception handler. (I did not investigate yet, 
but I assume there is a exceptionhandler)
- Leave it as it was before


Here a few examples how TOEs look now to the user:

TOE using a 3.0 CQLSH (still on CQL-protocol 3):
{code}
cqlsh:test> select * from test;
code=1200 [Coordinator node timed out waiting for replica nodes' responses] 
message="Operation timed out - received only 0 responses." 
info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
cqlsh:test> 
{code}


TOE using a 2.0 CQLSH:
{code}
cqlsh:test> select * from test;
Request did not complete within rpc_timeout.
{code}


TOE with cassandra-cli:
{code}
[default@unknown] use test;
Authenticated to keyspace: test
[default@test] list test;
Using default limit of 100
Using default cell limit of 100
null
TimedOutException()
        at 
org.apache.cassandra.thrift.Cassandra$get_range_slices_result$get_range_slices_resultStandardScheme.read(Cassandra.java:17448)
        at 
org.apache.cassandra.thrift.Cassandra$get_range_slices_result$get_range_slices_resultStandardScheme.read(Cassandra.java:17397)
        at 
org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:17323)
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
        at 
org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:802)
        at 
org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:786)
        at org.apache.cassandra.cli.CliClient.executeList(CliClient.java:1520)
        at 
org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:285)
        at 
org.apache.cassandra.cli.CliMain.processStatementInteractive(CliMain.java:201)
        at org.apache.cassandra.cli.CliMain.main(CliMain.java:331)
[default@test] 
{code}




> Coordinator should not wait for read timeouts when replicas hit Exceptions
> --------------------------------------------------------------------------
>
>                 Key: CASSANDRA-7886
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7886
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>         Environment: Tested with Cassandra 2.0.8
>            Reporter: Christian Spriegel
>            Assignee: Christian Spriegel
>            Priority: Minor
>              Labels: protocolv4
>             Fix For: 3.0
>
>         Attachments: 7886_v1.txt, 7886_v2_trunk.txt, 7886_v3_trunk.txt, 
> 7886_v4_trunk.txt, 7886_v5_trunk.txt
>
>
> *Issue*
> When you have TombstoneOverwhelmingExceptions occuring in queries, this will 
> cause the query to be simply dropped on every data-node, but no response is 
> sent back to the coordinator. Instead the coordinator waits for the specified 
> read_request_timeout_in_ms.
> On the application side this can cause memory issues, since the application 
> is waiting for the timeout interval for every request.Therefore, if our 
> application runs into TombstoneOverwhelmingExceptions, then (sooner or later) 
> our entire application cluster goes down :-(
> *Proposed solution*
> I think the data nodes should send a error message to the coordinator when 
> they run into a TombstoneOverwhelmingException. Then the coordinator does not 
> have to wait for the timeout-interval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7886) Coordinator should not wait for read timeouts when replicas hit Exceptions

Reply via email to