[ 
https://issues.apache.org/jira/browse/CASSANDRA-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333301#comment-17333301
 ] 

Sam Tunnicliffe commented on CASSANDRA-16581:
---------------------------------------------

Currently, catching a {{ProtocolException}} in by the pipeline's exception 
handler is supposed to close the channel, forcing the client to reconnect. For 
v5, this is {{o.a.c.t.ExceptionHandlers.ExceptionHandler}} and for v4 and 
lower, {{o.a.c.t.PreV5Handlers.ExceptionHandler}} in trunk and 
{{o.a.c.t.Message.ExceptionHandler}} in prior versions. ) 

{code}
// On protocol exception, close the channel as soon as the message have been 
sent
if (cause instanceof ProtocolException)
    future.addListener((ChannelFutureListener) f -> ctx.close());

{code}

However, many if not most instances of {{ProtocolException}} are actually 
contained in a {{WrappedException}} at this point, so not many actually trigger 
this condition. This is changed by David's patches, but we spoke offline and 
agreed that this should be reverted for v4- in 3.0, 3.11 and trunk as 
reconnections can be expensive, especially on the server side when auth is 
enabled. 

As David mentioned, this is also a slightly more tricky in v5 as a frame can 
contain envelopes for multiple streams. In the case of a fatal error (one which 
renders the entire frame unusable), the server is not able to notify the client 
of the stream ids present in the frame. To avoid causing a wave of client side 
timeouts, we decided to fail fast and close the client connection if any 
protocol error is detected.

{quote}
>From a client point of view, a dropped frame will result in request timeouts. 
>We have no way of providing a better error, since the stream ids of the failed 
>requests are in the corrupt payload. I'm wondering if it might not be better 
>to drop the connection all the time: at least the client gets immediate 
>feedback (we could try to propagate a cause), instead of a bunch of requests 
>timing out for no apparent reason.
 
[1|https://issues.apache.org/jira/browse/CASSANDRA-15299?focusedCommentId=17099447&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17099447].
 
{quote}

However, many protocol errors are not fatal; examples include invalid 
consistency levels, illegal statements in a BATCH, incorrect opcode. In these 
cases, the server _may_ be able to respond appropriately to the invalid 
envelope and continue processing the remainder of the frame.  

I've pushed a couple of v5 specific commits 
[here|https://github.com/beobal/cassandra/tree/16581-trunk-v5]. The general 
idea is to attempt to differentiate between so-called protocol errors from 
which the server can recover and those from which it can't. With this in mind, 
the message processor will return an error response the first time it 
encountered a protocol exception, but only terminate the connection if it 
immediately encounters a second error on the very next envelope in the frame. 
The reason for failing only on consecutive errors is that any individual error 
may be recoverable. For instance, a client could send a Frame with 100 
envelopes and every other one might have some recoverable corruption. A run of 
consecutive errors in the same frame is a heuristic for identifying 
non-recoverable corruption, and while not perfect, it seems fairly reasonable 
to me. 

An exception to this rule is if the body length advertised in the envelope 
header is invalid (i.e. < 0). In this case, the message processor is unable to 
even attempt to skip over the message, so it throws and closes the connection 
immediately.


> Failure to execute queries should emit a KPI other than read 
> timeout/unavailable so it can be alerted/tracked
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-16581
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16581
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Messaging/Client, Observability/Metrics
>            Reporter: David Capwell
>            Assignee: David Capwell
>            Priority: Normal
>             Fix For: 3.0.x, 3.11.x, 4.0-rc
>
>
> When we are unable to parse a message we do not have a way to detect this 
> from a monitoring point of view so can get into situations where we believe 
> the database is fine but the clients are on-fire.  This case popped up in the 
> 2.1 to 3.0 upgrade as paging state wasn’t mixed-mode safe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to