[ https://issues.apache.org/jira/browse/CASSANDRA-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333301#comment-17333301 ]
Sam Tunnicliffe commented on CASSANDRA-16581: --------------------------------------------- Currently, catching a {{ProtocolException}} in by the pipeline's exception handler is supposed to close the channel, forcing the client to reconnect. For v5, this is {{o.a.c.t.ExceptionHandlers.ExceptionHandler}} and for v4 and lower, {{o.a.c.t.PreV5Handlers.ExceptionHandler}} in trunk and {{o.a.c.t.Message.ExceptionHandler}} in prior versions. ) {code} // On protocol exception, close the channel as soon as the message have been sent if (cause instanceof ProtocolException) future.addListener((ChannelFutureListener) f -> ctx.close()); {code} However, many if not most instances of {{ProtocolException}} are actually contained in a {{WrappedException}} at this point, so not many actually trigger this condition. This is changed by David's patches, but we spoke offline and agreed that this should be reverted for v4- in 3.0, 3.11 and trunk as reconnections can be expensive, especially on the server side when auth is enabled. As David mentioned, this is also a slightly more tricky in v5 as a frame can contain envelopes for multiple streams. In the case of a fatal error (one which renders the entire frame unusable), the server is not able to notify the client of the stream ids present in the frame. To avoid causing a wave of client side timeouts, we decided to fail fast and close the client connection if any protocol error is detected. {quote} >From a client point of view, a dropped frame will result in request timeouts. >We have no way of providing a better error, since the stream ids of the failed >requests are in the corrupt payload. I'm wondering if it might not be better >to drop the connection all the time: at least the client gets immediate >feedback (we could try to propagate a cause), instead of a bunch of requests >timing out for no apparent reason. [1|https://issues.apache.org/jira/browse/CASSANDRA-15299?focusedCommentId=17099447&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17099447]. {quote} However, many protocol errors are not fatal; examples include invalid consistency levels, illegal statements in a BATCH, incorrect opcode. In these cases, the server _may_ be able to respond appropriately to the invalid envelope and continue processing the remainder of the frame. I've pushed a couple of v5 specific commits [here|https://github.com/beobal/cassandra/tree/16581-trunk-v5]. The general idea is to attempt to differentiate between so-called protocol errors from which the server can recover and those from which it can't. With this in mind, the message processor will return an error response the first time it encountered a protocol exception, but only terminate the connection if it immediately encounters a second error on the very next envelope in the frame. The reason for failing only on consecutive errors is that any individual error may be recoverable. For instance, a client could send a Frame with 100 envelopes and every other one might have some recoverable corruption. A run of consecutive errors in the same frame is a heuristic for identifying non-recoverable corruption, and while not perfect, it seems fairly reasonable to me. An exception to this rule is if the body length advertised in the envelope header is invalid (i.e. < 0). In this case, the message processor is unable to even attempt to skip over the message, so it throws and closes the connection immediately. > Failure to execute queries should emit a KPI other than read > timeout/unavailable so it can be alerted/tracked > ------------------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-16581 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16581 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client, Observability/Metrics > Reporter: David Capwell > Assignee: David Capwell > Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0-rc > > > When we are unable to parse a message we do not have a way to detect this > from a monitoring point of view so can get into situations where we believe > the database is fine but the clients are on-fire. This case popped up in the > 2.1 to 3.0 upgrade as paging state wasn’t mixed-mode safe. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org