[jira] [Comment Edited] (CASSANDRA-16581) Failure to execute queries should emit a KPI other than read timeout/unavailable so it can be alerted/tracked
[ https://issues.apache.org/jira/browse/CASSANDRA-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348687#comment-17348687 ] Sam Tunnicliffe edited comment on CASSANDRA-16581 at 5/20/21, 6:48 PM: --- Oh, I totally missed that in {{SimpleClient}}, sorry. In that case, I definitely think we need to revisit how the test is implemented. That assertion in {{FrameEncoder}} is important, not only for the client protocol, but for internode which shares the class. There's quite a lot of machinery in {{CQLConnectionTest}} to give control over the exact bytes that the v5 pipeline classes have to deal with. Unfortunately, retro fitting that to support v4 and earlier is not a trivial task. I've pushed an alternative reworking of the test [here|https://github.com/beobal/cassandra/tree/CASSANDRA-16581-4.0]. Instead of changing the encoders/decoders this adds a test implementation of {{Message.Request}} which we can prime with some garbage bytes, so that when it's decoded, a \{{ProtocolError}} is guaranteed. This required a tiny bit of refactoring in {{Envelope}} to make the v4 encoding ask-dont-tell like v5, but it gives the test code control over the actual bytes that get decoded on the server rather than having to bypass the checks in the outbound pipeline. A nice side effect of this is that the v5 test is now exercising the CQL protocol decoding, rather than hitting the CRC errors in the framing layer. There are quite a few tests in {{CQLConnectionTest}} which do verify that, so we should be good there. This also means that the channel-closing behaviour is the same between the v4 & v5 tests. Now the message pipelines and failure behaviours are aligned across versions, we don't need {{WrappedSimpleClient}} any more and because the test cases are identical, I've made the test {{@Parameterized}} so it will automatically cover all supported versions. was (Author: beobal): Oh, I totally missed that in {{SimpleClient}}, sorry. In that case, I definitely think we need to revisit how the test is implemented. That assertion in {{FrameEncoder}} is important, not only for the client protocol, but for internode which shares the class. There's quite a lot of machinery in {{CQLConnectionTest}} to give control over the exact bytes that the v5 pipeline classes have to deal with. Unfortunately, retro fitting that to support v4 and earlier is not a trivial task. I've pushed an alternative reworking of the test [here|https://github.com/beobal/cassandra/tree/CASSANDRA-16581-4.0]. Instead of changing the encoders/decoders this adds a test implementation of {{Message.Request}} which we can prime with some garbage bytes, so that when it's decoded, a{{ProtocolError}} is guaranteed. This required a tiny bit of refactoring in {{Envelope}} to make the v4 encoding ask-dont-tell like v5, but it gives the test code control over the actual bytes that get decoded on the server rather than having to bypass the checks in the outbound pipeline. A nice side effect of this is that the v5 test is now exercising the CQL protocol decoding, rather than hitting the CRC errors in the framing layer. There are quite a few tests in {{CQLConnectionTest}} which do verify that, so we should be good there. This also means that the channel-closing behaviour is the same between the v4 & v5 tests. Now the message pipelines and failure behaviours are aligned across versions, we don't need {{WrappedSimpleClient}} any more and because the test cases are identical, I've made the test {{@Parameterized}} so it will automatically cover all supported versions. > Failure to execute queries should emit a KPI other than read > timeout/unavailable so it can be alerted/tracked > - > > Key: CASSANDRA-16581 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16581 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client, Observability/Metrics >Reporter: David Capwell >Assignee: David Capwell >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0-rc > > > When we are unable to parse a message we do not have a way to detect this > from a monitoring point of view so can get into situations where we believe > the database is fine but the clients are on-fire. This case popped up in the > 2.1 to 3.0 upgrade as paging state wasn’t mixed-mode safe. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-16581) Failure to execute queries should emit a KPI other than read timeout/unavailable so it can be alerted/tracked
[ https://issues.apache.org/jira/browse/CASSANDRA-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348687#comment-17348687 ] Sam Tunnicliffe edited comment on CASSANDRA-16581 at 5/20/21, 6:47 PM: --- Oh, I totally missed that in {{SimpleClient}}, sorry. In that case, I definitely think we need to revisit how the test is implemented. That assertion in {{FrameEncoder}} is important, not only for the client protocol, but for internode which shares the class. There's quite a lot of machinery in {{CQLConnectionTest}} to give control over the exact bytes that the v5 pipeline classes have to deal with. Unfortunately, retro fitting that to support v4 and earlier is not a trivial task. I've pushed an alternative reworking of the test [here|https://github.com/beobal/cassandra/tree/CASSANDRA-16581-4.0]. Instead of changing the encoders/decoders this adds a test implementation of {{Message.Request}} which we can prime with some garbage bytes, so that when it's decoded, a{{ProtocolError}} is guaranteed. This required a tiny bit of refactoring in {{Envelope}} to make the v4 encoding ask-dont-tell like v5, but it gives the test code control over the actual bytes that get decoded on the server rather than having to bypass the checks in the outbound pipeline. A nice side effect of this is that the v5 test is now exercising the CQL protocol decoding, rather than hitting the CRC errors in the framing layer. There are quite a few tests in {{CQLConnectionTest}} which do verify that, so we should be good there. This also means that the channel-closing behaviour is the same between the v4 & v5 tests. Now the message pipelines and failure behaviours are aligned across versions, we don't need {{WrappedSimpleClient}} any more and because the test cases are identical, I've made the test {{@Parameterized}} so it will automatically cover all supported versions. was (Author: beobal): Oh, I totally missed that in {{SimpleClient}}, sorry. In that case, I definitely think we need to revisit how the test is implemented. That assertion in {{FrameEncoder}} is important, not only for the client protocol, but for internode which shares the class. There's quite a lot of machinery in {{CQLConnectionTest}} to give control over the exact bytes that the v5 pipeline classes have to deal with. Unfortunately, retro fitting that to support v4 and earlier is not a trivial task. I've pushed an alternative reworking of the test [here|https://github.com/beobal/cassandra/tree/CASSANDRA-16581-4.0]. Instead of changing the encoders/decoders this adds a test implementation of {{Message.Request}} which we can prime with some garbage bytes, so that when it's decoded, a {{ProtocolError}} is guaranteed. This required a tiny bit of refactoring in {{Envelope}} to make the v4 encoding ask-dont-tell like v5, but it gives the test code control over the actual bytes that get decoded on the server rather than having to bypass the checks in the outbound pipeline. A nice side effect of this is that the v5 test is now exercising the CQL protocol decoding, rather than hitting the CRC errors in the framing layer. There are quite a few tests in {{CQLConnectionTest}} which do verify that, so we should be good there. This also means that the channel-closing behaviour is the same between the v4 & v5 tests. Now the message pipelines and failure behaviours are aligned across versions, we don't need {{WrappedSimpleClient}} any more and because the test cases are identical, I've made the test {{@Parameterized}} so it will automatically cover all supported versions. > Failure to execute queries should emit a KPI other than read > timeout/unavailable so it can be alerted/tracked > - > > Key: CASSANDRA-16581 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16581 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client, Observability/Metrics >Reporter: David Capwell >Assignee: David Capwell >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0-rc > > > When we are unable to parse a message we do not have a way to detect this > from a monitoring point of view so can get into situations where we believe > the database is fine but the clients are on-fire. This case popped up in the > 2.1 to 3.0 upgrade as paging state wasn’t mixed-mode safe. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-16581) Failure to execute queries should emit a KPI other than read timeout/unavailable so it can be alerted/tracked
[ https://issues.apache.org/jira/browse/CASSANDRA-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347867#comment-17347867 ] David Capwell edited comment on CASSANDRA-16581 at 5/19/21, 9:12 PM: - [~samt] FrameEncoder change is {code} public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise promise) throws Exception { if (!(msg instanceof Payload)) -throw new IllegalStateException("Unexpected type: " + msg); +{ +ctx.write(msg, promise); +return; +} {code} This isn't a leftover from debugging, its from the fact that we no longer support it as the v5 logic uses the more lower level APIs (such as ChannelInboundHandlerAdapter); the higher level APIs pass through objects which do not match the type. Here is an example from v4 {code} public static class ProtocolEncoder extends MessageToMessageEncoder {code} MessageToMessageEncoder has the following https://github.com/netty/netty/blob/4.1/codec/src/main/java/io/netty/handler/codec/MessageToMessageEncoder.java#L84-L100 {code} if (acceptOutboundMessage(msg)) { ... } else { ctx.write(msg, promise); } {code} I use this in the simple client to send arbitrary payloads to validate server behavior. ProtocolVersion - made the change was (Author: dcapwell): [~samt] FrameEncoder change is {code} public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise promise) throws Exception { if (!(msg instanceof Payload)) -throw new IllegalStateException("Unexpected type: " + msg); +{ +ctx.write(msg, promise); +return; +} {code} This isn't a leftover from debugging, its from the fact that we no longer support it as the v5 logic uses the more lower level APIs (such as ChannelInboundHandlerAdapter); the higher level APIs pass through objects which do not match the type. Here is an example from v4 {code} public static class ProtocolEncoder extends MessageToMessageEncoder {code} MessageToMessageEncoder has the following https://github.com/netty/netty/blob/4.1/codec/src/main/java/io/netty/handler/codec/MessageToMessageEncoder.java#L84-L100 {code} if (acceptOutboundMessage(msg)) { ... } else { ctx.write(msg, promise); } {code} I use this in the simple client to send arbitrary payloads to validate server behavior. ProtocolVersion - made the change; back porting now > Failure to execute queries should emit a KPI other than read > timeout/unavailable so it can be alerted/tracked > - > > Key: CASSANDRA-16581 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16581 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client, Observability/Metrics >Reporter: David Capwell >Assignee: David Capwell >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0-rc > > > When we are unable to parse a message we do not have a way to detect this > from a monitoring point of view so can get into situations where we believe > the database is fine but the clients are on-fire. This case popped up in the > 2.1 to 3.0 upgrade as paging state wasn’t mixed-mode safe. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org