[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17104465#comment-17104465 ] Sam Tunnicliffe edited comment on CASSANDRA-15299 at 5/11/20, 1:39 PM: --- bq. find out that the _uncompressed_ size is above the limit, split it into 128 KiB chunks, then compress each chunk separately and wrap it into its own (uncontained) outer frame? Yes, exactly. bq. uncontained frames are never recoverable when compression is enabled Large messages are never recoverable when a corrupt payload is detected. Self-contained frames *could* be recoverable as long as only the payload is corrupt, but like we discussed earlier this is more complicated here than in the internode case due to the payload potentially containing multiple stream ids, so we may as well close the connection whenever a corrupt frame is encountered. was (Author: beobal): > find out that the _uncompressed_ size is above the limit, split it into 128 >KiB chunks, then compress each chunk separately and wrap it into its own >(uncontained) outer frame? Yes, exactly. > uncontained frames are never recoverable when compression is enabled Large messages are never recoverable when a corrupt payload is detected. Self-contained frames *could* be recoverable as long as only the payload is corrupt, but like we discussed earlier this is more complicated here than in the internode case due to the payload potentially containing multiple stream ids, so we may as well close the connection whenever a corrupt frame is encountered. > CASSANDRA-13304 follow-up: improve checksumming and compression in protocol > v5-beta > --- > > Key: CASSANDRA-15299 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15299 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Aleksey Yeschenko >Assignee: Sam Tunnicliffe >Priority: Normal > Labels: protocolv5 > Fix For: 4.0-beta > > > CASSANDRA-13304 made an important improvement to our native protocol: it > introduced checksumming/CRC32 to request and response bodies. It’s an > important step forward, but it doesn’t cover the entire stream. In > particular, the message header is not covered by a checksum or a crc, which > poses a correctness issue if, for example, {{streamId}} gets corrupted. > Additionally, we aren’t quite using CRC32 correctly, in two ways: > 1. We are calculating the CRC32 of the *decompressed* value instead of > computing the CRC32 on the bytes written on the wire - losing the properties > of the CRC32. In some cases, due to this sequencing, attempting to decompress > a corrupt stream can cause a segfault by LZ4. > 2. When using CRC32, the CRC32 value is written in the incorrect byte order, > also losing some of the protections. > See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for > explanation for the two points above. > Separately, there are some long-standing issues with the protocol - since > *way* before CASSANDRA-13304. Importantly, both checksumming and compression > operate on individual message bodies rather than frames of multiple complete > messages. In reality, this has several important additional downsides. To > name a couple: > # For compression, we are getting poor compression ratios for smaller > messages - when operating on tiny sequences of bytes. In reality, for most > small requests and responses we are discarding the compressed value as it’d > be smaller than the uncompressed one - incurring both redundant allocations > and compressions. > # For checksumming and CRC32 we pay a high overhead price for small messages. > 4 bytes extra is *a lot* for an empty write response, for example. > To address the correctness issue of {{streamId}} not being covered by the > checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we > should switch to a framing protocol with multiple messages in a single frame. > I suggest we reuse the framing protocol recently implemented for internode > messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, > and that we do it before native protocol v5 graduates from beta. See > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java > and > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107811#comment-17107811 ] Olivier Michallat edited comment on CASSANDRA-15299 at 5/15/20, 12:49 AM: -- As I'm starting to work on the driver changes, I have a few more remarks about the binary format: * length fields are just big enough to accommodate the 128 KiB limit. Wouldn't it be more cautious to keep a bit of margin? I have no issue with the current value, or even the fact that it is hard-coded; but it feels a bit restrictive to remove any possibility of ever increasing it without a protocol change. * along the same lines, maybe the header should reserve more space for flags. We only have 1 so far (and space for 5 more), but as experience has shown, new needs can arise over time. Legacy messages use either 1 or 4 bytes. * why store the uncompressed length in the header? In the legacy frame format, it's encoded in the compressed payload (Snappy does that natively, and for LZ4 we prepend it). We have code that takes a buffer and returns a compressed buffer, it would be nice to reuse it without having to change anything. I don't see any advantage having it in the header, if the payload is corrupted we have nothing to decompress anyway, so the uncompressed length is of no use. Putting all those changes together, we could have a unique header format that works both with and without compression: {code:java} 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Payload Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | flags | CRC24 of Header | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ {code} was (Author: omichallat): As I'm starting to work on the driver changes, I have a few more remarks about the binary format: * length fields are just big enough to accommodate the 128 KiB limit. Wouldn't it be more cautious to keep a bit of margin? I have no issue with the current value, or even the fact that it is hard-coded; but it feels a bit restrictive to remove any possibility of ever increasing it without a protocol change. * along the same lines, maybe the header should reserve more space for flags. We only have one so far, but as experience has shown, new needs can arise over time. Legacy messages use either 1 or 4 bytes. * why store the uncompressed length in the header? In the legacy frame format, it's encoded in the compressed payload (Snappy does that natively, and for LZ4 we prepend it). We have code that takes a buffer and returns a compressed buffer, it would be nice to reuse it without having to change anything. I don't see any advantage having it in the header, if the payload is corrupted we have nothing to decompress anyway, so the uncompressed length is of no use. Putting all those changes together, we could have a unique header format that works both with and without compression: {code:java} 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Payload Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | flags | CRC24 of Header | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ {code} > CASSANDRA-13304 follow-up: improve checksumming and compression in protocol > v5-beta > --- > > Key: CASSANDRA-15299 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15299 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Aleksey Yeschenko >Assignee: Sam Tunnicliffe >Priority: Normal > Labels: protocolv5 > Fix For: 4.0-alpha > > > CASSANDRA-13304 made an important improvement to our native protocol: it > introduced checksumming/CRC32 to request and response bodies. It’s an > important step forward, but it doesn’t cover the entire stream. In > particular, the message header is not covered by a checksum or a crc, which > poses a correctness issue if, for example, {{streamId}} gets corrupted. > Additionally, we aren’t quite using CRC32 correctly, in two ways: > 1. We are calculating the CRC32 of the *decompressed* value instead of > computing the CRC32 on the bytes written on the wire - losing the properties > of the CRC32. In some cases, due to this sequencing
[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113865#comment-17113865 ] Benedict Elliott Smith edited comment on CASSANDRA-15299 at 5/22/20, 9:35 AM: -- bq. That is, the specific lengths are chosen to ensure that the CRC 24/32 provides adequate protection. FWIW, depending on your definition of "adequate" CRC32 may well not fit the bill for 128KiB, so with CRC32 for the payload 128KiB should very much be an upper limit. A protocol change should move us to CRC64 or CRC128 before we consider increasing the maximum frame size. For reference, from Koopman: https://users.ece.cmu.edu/~koopman/crc/c32/0x82608edb_len.txt This means that the minimum number of bits necessary to corrupt the stream is 3 when the payload is smaller than 91608 bits, or ~11KiB, and just 2 when larger than that. I personally doubt there will be much value in increasing the frame size soon anyway, since the goal is only to amortise the cost of splitting well and increase compression efficiency. We would want to revisit the compression algorithm we use before we increase the size, to find one able to exploit longer histories (while not requiring too many resources per connection). Which is to say it's definitely for a future protocol, and also to clarify that CRC32 was a poor man's compromise at the time of authoring the internode messaging, because there is no accelerated and readily accessible CRC64 or CRC128 implementation in our supported JDKs. I expect they will be available in the future though, or we can probably include our own using the modern instructions. was (Author: benedict): bq. That is, the specific lengths are chosen to ensure that the CRC 24/32 provides adequate protection. FWIW, depending on your definition of "adequate" CRC32 may well not fit the bill for 128KiB, so with CRC32 for the payload 128KiB should very much be an upper limit. A protocol change should move us to CRC64 or CRC128 before we consider increasing the maximum frame size. I personally doubt there will be much value in increasing the frame size soon anyway, since the goal is only to amortise the cost of splitting well and increase compression efficiency. We would want to revisit the compression algorithm we use before we increase the size, to find one able to exploit longer histories (while not requiring too many resources per connection). Which is to say it's definitely for a future protocol, and also to clarify that CRC32 was a poor man's compromise at the time of authoring the internode messaging, because there is no accelerated and readily accessible CRC64 or CRC128 implementation in our supported JDKs. I expect they will be available in the future though, or we can probably include our own using the modern instructions. > CASSANDRA-13304 follow-up: improve checksumming and compression in protocol > v5-beta > --- > > Key: CASSANDRA-15299 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15299 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Aleksey Yeschenko >Assignee: Sam Tunnicliffe >Priority: Normal > Labels: protocolv5 > Fix For: 4.0-alpha > > > CASSANDRA-13304 made an important improvement to our native protocol: it > introduced checksumming/CRC32 to request and response bodies. It’s an > important step forward, but it doesn’t cover the entire stream. In > particular, the message header is not covered by a checksum or a crc, which > poses a correctness issue if, for example, {{streamId}} gets corrupted. > Additionally, we aren’t quite using CRC32 correctly, in two ways: > 1. We are calculating the CRC32 of the *decompressed* value instead of > computing the CRC32 on the bytes written on the wire - losing the properties > of the CRC32. In some cases, due to this sequencing, attempting to decompress > a corrupt stream can cause a segfault by LZ4. > 2. When using CRC32, the CRC32 value is written in the incorrect byte order, > also losing some of the protections. > See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for > explanation for the two points above. > Separately, there are some long-standing issues with the protocol - since > *way* before CASSANDRA-13304. Importantly, both checksumming and compression > operate on individual message bodies rather than frames of multiple complete > messages. In reality, this has several important additional downsides. To > name a couple: > # For compression, we are getting poor compression ratios for smaller > messages - when operating on tiny sequences of bytes. In reality, for most > small reque
[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139920#comment-17139920 ] Sam Tunnicliffe edited comment on CASSANDRA-15299 at 6/18/20, 6:58 PM: --- Pushed an updated branch where the protocol changes are pretty stable. I've been testing this with the [java2772|https://github.com/datastax/java-driver/tree/java2772] branch of the driver and with debug-cql and everything is working pretty much as expected. Due to current lack of support in the python driver, I've had to use a modified [dtest branch|https://github.com/beobal/cassandra-dtest/tree/15299] and make a temporary hack to cqlsh, but the tests that are running are pretty much green. Obviously, those are not covering anything from v5 now, but my primary concern was to make sure there's no regressions for v4 clients. Although I think this is ready for some more eyes on it, there's still a non-trivial amount of work to be done. Items outstanding include: * Comprehensive unit and in-jvm tests - in progress * Metrics * Python driver support (doesn't have to be fully implemented, but a basic level is needed for pytests and cqlsh) * Documentation * Renaming existing classes. There are a number of slightly confusing conflicts in naming now. These should be simple to resolve, just automated renaming mostly, but I've held off doing them for now because they'll make the patch much bigger and probably harder to read. The patch is also not quite a massive at it might appear at first. A large proportion is the revert of CASSANDRA-13304, and the rest is largely additive. I've tried hard not touch code on the v4 path, even where it could clearly be refactored, to minimize the delta & the risk there. So the patch largely consists of some v5 specific additions, moving a few existing classes/methods around, and changing modifiers on previously private/package-private things. I'm still actively working on the tests, but I don't think that (nor the renaming) need hold up review any more. |branch|[15299-trunk|https://github.com/beobal/cassandra/tree/15299-trunk]| |dtests|[15299|https://github.com/beobal/cassandra-dtest/tree/15299]| |ci|[circle|https://app.circleci.com/pipelines/github/beobal/cassandra?branch=15299-trunk]| was (Author: beobal): Pushed an updated branch where the protocol changes are pretty stable. I've been testing this with the java2772 branch of the driver and with debug-cql and everything is working pretty much as expected. Due to current lack of support in the python driver, I've had to use a modified [dtest branch|https://github.com/beobal/cassandra-dtest/tree/15299] and make a temporary hack to cqlsh, but the tests that are running are pretty much green. Obviously, those are not covering anything from v5 now, but my primary concern was to make sure there's no regressions for v4 clients. Although I think this is ready for some more eyes on it, there's still a non-trivial amount of work to be done. Items outstanding include: * Comprehensive unit and in-jvm tests - in progress * Metrics * Python driver support (doesn't have to be fully implemented, but a basic level is needed for pytests and cqlsh) * Documentation * Renaming existing classes. There are a number of slightly confusing conflicts in naming now. These should be simple to resolve, just automated renaming mostly, but I've held off doing them for now because they'll make the patch much bigger and probably harder to read. The patch is also not quite a massive at it might appear at first. A large proportion is the revert of CASSANDRA-13304, and the rest is largely additive. I've tried hard not touch code on the v4 path, even where it could clearly be refactored, to minimize the delta & the risk there. So the patch largely consists of some v5 specific additions, moving a few existing classes/methods around, and changing modifiers on previously private/package-private things. I'm still actively working on the tests, but I don't think that (nor the renaming) need hold up review any more. |branch|[15299-trunk|https://github.com/beobal/cassandra/tree/15299-trunk]| |dtests|[15299|https://github.com/beobal/cassandra-dtest/tree/15299]| |ci|[circle|https://app.circleci.com/pipelines/github/beobal/cassandra?branch=15299-trunk]| > CASSANDRA-13304 follow-up: improve checksumming and compression in protocol > v5-beta > --- > > Key: CASSANDRA-15299 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15299 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Aleksey Yeschenko >Assignee: Sam Tunnicliffe >Priority: Normal > Labels: protocolv5 > Fix F
[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17148831#comment-17148831 ] Sam Tunnicliffe edited comment on CASSANDRA-15299 at 6/30/20, 5:06 PM: --- Thanks for the comments [~ifesdjeen] & [~omichallat], I've pushed a few commits to the [branch|https://github.com/beobal/cassandra/commits/15299-trunk]. {quote} There are several things that I wanted to bring to your attention: {quote} I've handled most of these in a refactor of Flusher. As you suggested, for framed items we now collate the frames and only allocate the payloads when we flush to the netty channel. So now, we allocate the payload based on the actual number of bytes required for the specific channel. {quote}{{ExceptionHandlers$PostV5ExceptionHandler#exceptionCaught}}: when flushing an exception, we don't call release on the payload. {quote} Included in the {{minor cleanups}} commit {quote}There are several places in {{SimpleClient}} where {{largePayload#release}} isn't called. {quote} I've refactored the flushing large messages in {{SimpleClient}} to match {{Flusher}}, so this is working properly now. {quote}Other things... {quote} {quote}{{Dispatcher#processRequest}}, we don't need to cast error to {{Message.Response}} if we change its type to {{ErrorMessage}}. {quote} In {{CqlMessageHandler#releaseAfterFlush}}, we can call {{sourceFrame#release()}} instead of {{sourceFrame.body.release()}} for consistency with other calls Both in {{minor cleanups}} {quote}{{Server#requestPayloadInFlightPerEndpoint}} can be a non-static {{Server}} member. {quote} If you don't mind I'd prefer to leave this as it is for now as it's pre-existing and changing would require reworking CASSANDRA-15519 (changing limits at runtime). {quote}Should we hide {{flusher.queued.add()}} behind a method to disallow accessing queue directly? {quote} I've done this, but I'm not 100% convinced of its utility. As the two {{Flusher}} subclasses need access to the queue, we have to provide package private methods {{poll}} and {{isEmpty}} as well as one to {{enqueue}}. So unless we move {{Flusher}} to its own subpackage, the queue is effectively visible to everything else in {{o.a.c.Transport}} {quote}We can change the code a bit to make {{FlushItemConverter}} instances explicit. Right now, we basically have two converters both called {{#toFlushItem}} in {{CQLMessageHandler}} and {{LegacyDispatchHandler}}. We could have them as inner classes. It's somewhat useful since if you change the signature of this method, or stop using it, it'll be hard to find that it is actually an implementation of converter. {quote} > I've left this as it is just for the moment. I'm working on some tests which > supply a lambda to act as the converter, so I'll come back to this when those > have solidified a bit more. {quote}Looks like {{MessageConsumer}} could be generic, since we cast it to either request or response. {quote} I've parameterised {{MessageConsumer}} & {{CQLMessageHandler}} according to the subclass of Message they expect and extended this a bit by moving the logic out of {{Message$ProtocolEncoder}} to an abstract\{{ Message$Decoder}} with concrete subclasses for {{Request}} and {{Response}}. >bq.Looks like {{CQLMessageHandler#processCorruptFrame}}, initially had an >intention of handling recovery, but now just throws a CRC exception >regardless. This does match description, but usage of {{isRecoverable}} seems >to be redundant here, unless we change semantics of recovery. It is somewhat redundant here, except that it logs a slightly different message to indicate whether the CRC mismatch was found in the frame header or body. I'll leave it as it is for now as it's technically possible to recover from a corrupt body, but would be problematic for clients just now. I still have some comments to address, as well as those from [~omichallat] ... {quote}Frame$Decoder and other classes that are related to legacy path can be extracted to a separate class, since Frame itself is still useful, but classes that facilitate legacy encoding/decoding/etc can be extracted. {quote} {quote}Frame#encodeHeaderInto seems to be duplicating the logic we have in Frame$Encoder#encodeHeader, should we unify the two? Maybe we can have encoding/decoding methods shared for both legacy and new paths, for example, as static methods? {quote} {quote}As you have mentioned, it would be great to rename Frame to something different, like Envelope, since right now we have FrameDecoder#Frame and Frame$Decoder and variable names that correspond with class names, which makes it all hard to follow. {quote} was (Author: beobal): Thanks for the comments [~ifesdjeen] & [~omichallat]. {quote} There are several things that I wanted to bring to your attention: {quote} I've handled most of these in a refactor of Flusher
[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17148831#comment-17148831 ] Sam Tunnicliffe edited comment on CASSANDRA-15299 at 6/30/20, 5:08 PM: --- Thanks for the comments [~ifesdjeen] & [~omichallat], I've pushed a few commits to the [branch|https://github.com/beobal/cassandra/commits/15299-trunk]. {quote} There are several things that I wanted to bring to your attention: {quote} I've handled most of these in a refactor of Flusher. As you suggested, for framed items we now collate the frames and only allocate the payloads when we flush to the netty channel. So now, we allocate the payload based on the actual number of bytes required for the specific channel. {quote}{{ExceptionHandlers$PostV5ExceptionHandler#exceptionCaught}}: when flushing an exception, we don't call release on the payload. {quote} Included in the {{minor cleanups}} commit {quote}There are several places in {{SimpleClient}} where {{largePayload#release}} isn't called. {quote} I've refactored the flushing large messages in {{SimpleClient}} to match {{Flusher}}, so this is working properly now. {quote}Other things... {quote} {quote}{{Dispatcher#processRequest}}, we don't need to cast error to {{Message.Response}} if we change its type to {{ErrorMessage}}. {quote} In {{CqlMessageHandler#releaseAfterFlush}}, we can call {{sourceFrame#release()}} instead of {{sourceFrame.body.release()}} for consistency with other calls Both in {{minor cleanups}} {quote}{{Server#requestPayloadInFlightPerEndpoint}} can be a non-static {{Server}} member. {quote} If you don't mind I'd prefer to leave this as it is for now as it's pre-existing and changing would require reworking CASSANDRA-15519 (changing limits at runtime). {quote}Should we hide {{flusher.queued.add()}} behind a method to disallow accessing queue directly? {quote} I've done this, but I'm not 100% convinced of its utility. As the two {{Flusher}} subclasses need access to the queue, we have to provide package private methods {{poll}} and {{isEmpty}} as well as one to {{enqueue}}. So unless we move {{Flusher}} to its own subpackage, the queue is effectively visible to everything else in {{o.a.c.Transport}} {quote}We can change the code a bit to make {{FlushItemConverter}} instances explicit. Right now, we basically have two converters both called {{#toFlushItem}} in {{CQLMessageHandler}} and {{LegacyDispatchHandler}}. We could have them as inner classes. It's somewhat useful since if you change the signature of this method, or stop using it, it'll be hard to find that it is actually an implementation of converter. {quote} I've left this as it is just for the moment. I'm working on some tests which supply a lambda to act as the converter, so I'll come back to this when those have solidified a bit more. {quote}Looks like {{MessageConsumer}} could be generic, since we cast it to either request or response. {quote} I've parameterised {{MessageConsumer}} & {{CQLMessageHandler}} according to the subclass of Message they expect and extended this a bit by moving the logic out of {{Message$ProtocolEncoder}} to an abstract {{Message$Decoder}} with concrete subclasses for {{Request}} and {{Response}}. {quote}Looks like {{CQLMessageHandler#processCorruptFrame}}, initially had an intention of handling recovery, but now just throws a CRC exception regardless. This does match description, but usage of {{isRecoverable}} seems to be redundant here, unless we change semantics of recovery. {quote} It is somewhat redundant here, except that it logs a slightly different message to indicate whether the CRC mismatch was found in the frame header or body. I'll leave it as it is for now as it's technically possible to recover from a corrupt body, but would be problematic for clients just now. I still have some comments to address, as well as those from [~omichallat] ... {quote}{{Frame$Decoder}} and other classes that are related to legacy path can be extracted to a separate class, since {{Frame}} itself is still useful, but classes that facilitate legacy encoding/decoding/etc can be extracted. {quote} {quote}{{Frame#encodeHeaderInto}} seems to be duplicating the logic we have in {{Frame$Encoder#encodeHeader}}, should we unify the two? Maybe we can have encoding/decoding methods shared for both legacy and new paths, for example, as static methods? {quote} {quote}As you have mentioned, it would be great to rename {{Frame}} to something different, like {{Envelope}}, since right now we have {{FrameDecoder#Frame}} and {{Frame$Decoder}} and variable names that correspond with class names, which makes it all hard to follow. {quote} was (Author: beobal): Thanks for the comments [~ifesdjeen] & [~omichallat], I've pushed a few commits to the [branch|https://github.com/beobal/cassandra/commits/15299-trunk]. {qu
[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139920#comment-17139920 ] Sam Tunnicliffe edited comment on CASSANDRA-15299 at 7/10/20, 7:11 PM: --- Pushed an updated branch where the protocol changes are pretty stable. I've been testing this with the [java2772|https://github.com/datastax/java-driver/tree/java2772] branch of the driver and with debug-cql and everything is working pretty much as expected. Due to current lack of support in the python driver, I've had to use a modified [dtest branch|https://github.com/beobal/cassandra-dtest/tree/15299] and make a temporary hack to cqlsh, but the tests that are running are pretty much green. Obviously, those are not covering anything from v5 now, but my primary concern was to make sure there's no regressions for v4 clients. Although I think this is ready for some more eyes on it, there's still a non-trivial amount of work to be done. Items outstanding include: * Comprehensive unit and in-jvm tests - in progress * Metrics * Python driver support (doesn't have to be fully implemented, but a basic level is needed for pytests and cqlsh) * Documentation * Renaming existing classes. There are a number of slightly confusing conflicts in naming now. These should be simple to resolve, just automated renaming mostly, but I've held off doing them for now because they'll make the patch much bigger and probably harder to read. The patch is also not quite a massive at it might appear at first. A large proportion is the revert of CASSANDRA-13304, and the rest is largely additive. I've tried hard not touch code on the v4 path, even where it could clearly be refactored, to minimize the delta & the risk there. So the patch largely consists of some v5 specific additions, moving a few existing classes/methods around, and changing modifiers on previously private/package-private things. I'm still actively working on the tests, but I don't think that (nor the renaming) need hold up review any more. |branch|[15299-trunk|https://github.com/beobal/cassandra/tree/15299-trunk]| |dtests|[15299|https://github.com/beobal/cassandra-dtest/tree/15299]| |ci|[circle|https://app.circleci.com/pipelines/github/beobal/cassandra?branch=15299-trunk]| was (Author: beobal): Pushed an updated branch where the protocol changes are pretty stable. I've been testing this with the [java2772|https://github.com/datastax/java-driver/tree/java2772] branch of the driver and with debug-cql and everything is working pretty much as expected. Due to current lack of support in the python driver, I've had to use a modified [dtest branch|https://github.com/beobal/cassandra-dtest/tree/15299] and make a temporary hack to cqlsh, but the tests that are running are pretty much green. Obviously, those are not covering anything from v5 now, but my primary concern was to make sure there's no regressions for v4 clients. Although I think this is ready for some more eyes on it, there's still a non-trivial amount of work to be done. Items outstanding include: * Comprehensive unit and in-jvm tests - in progress * Metrics * Python driver support (doesn't have to be fully implemented, but a basic level is needed for pytests and cqlsh) * Documentation * Renaming existing classes. There are a number of slightly confusing conflicts in naming now. These should be simple to resolve, just automated renaming mostly, but I've held off doing them for now because they'll make the patch much bigger and probably harder to read. The patch is also not quite a massive at it might appear at first. A large proportion is the revert of CASSANDRA-13304, and the rest is largely additive. I've tried hard not touch code on the v4 path, even where it could clearly be refactored, to minimize the delta & the risk there. So the patch largely consists of some v5 specific additions, moving a few existing classes/methods around, and changing modifiers on previously private/package-private things. I'm still actively working on the tests, but I don't think that (nor the renaming) need hold up review any more. |branch|[15299-trunk|https://github.com/beobal/cassandra/tree/15299-trunk]| |dtests|[15299|https://github.com/beobal/cassandra-dtest/tree/15299]| |ci|[circle|https://app.circleci.com/pipelines/github/beobal/cassandra?branch=15299-trunk]| > CASSANDRA-13304 follow-up: improve checksumming and compression in protocol > v5-beta > --- > > Key: CASSANDRA-15299 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15299 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Aleksey Yeschenko >Assignee: Sam Tunnicliffe >Priority: N
[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155676#comment-17155676 ] Sam Tunnicliffe edited comment on CASSANDRA-15299 at 7/10/20, 7:12 PM: --- [~omichallat] I absolutely take on board all of your arguments. Now the changes are getting to a somewhat more decent shape, lets take another look and see if it a large scale rename is absolutely necessary. Maybe it's just Stockholm Syndrome, but I'm finding that the clashes between the {{o.a.c.net}} and {{o.a.c.transport}} classes are less and less of a problem. Perhaps we can live with things as they are after all. Unfortunately, I've been a bit short of time to focus on this since the last update, but I've made some changes to the server configuration to improve testability and added a first meaningful(ish) test. Fleshing out the test coverage should now be a bit easier, which I'll try to do next week. As I mentioned, I've been running end to end tests with the [java2772|https://github.com/datastax/java-driver/tree/java2772] branch of the java driver and have bundled a build from there for now. What would be really useful would be some experimental level of support in the python driver, for cqlsh and python dtests. |branch|[15299-trunk|https://github.com/beobal/cassandra/tree/15299-trunk]| |dtests|[15299|https://github.com/beobal/cassandra-dtest/tree/15299]| |ci|[circle|https://app.circleci.com/pipelines/github/beobal/cassandra?branch=15299-trunk]| was (Author: beobal): [~omichallat] I absolutely take on board all of your arguments. Now the changes are getting to a somewhat more decent shape, lets take another look and see if it a large scale rename is absolutely necessary. Maybe it's just Stockholm Syndrome, but I'm finding that the clashes between the {{o.a.c.net}} and {{o.a.c.transport}} classes are less and less of a problem. Perhaps we can live with things as they are after all. Unfortunately, I've been a bit short of time to focus on this since the last update, but I've made some changes to the server configuration to improve testability and added a first meaningful(ish) test. Fleshing out the test coverage should now be a bit easier, which I'll try to do next week. As I mentioned, I've been running end to end tests with the [java2772|https://github.com/datastax/java-driver/tree/java2772] branch of the java driver and have bundled a build from there for now. What would be really useful would be some experimental level of support in the python driver, for cqlsh and python dtests. > CASSANDRA-13304 follow-up: improve checksumming and compression in protocol > v5-beta > --- > > Key: CASSANDRA-15299 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15299 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Aleksey Yeschenko >Assignee: Sam Tunnicliffe >Priority: Normal > Labels: protocolv5 > Fix For: 4.0-alpha > > > CASSANDRA-13304 made an important improvement to our native protocol: it > introduced checksumming/CRC32 to request and response bodies. It’s an > important step forward, but it doesn’t cover the entire stream. In > particular, the message header is not covered by a checksum or a crc, which > poses a correctness issue if, for example, {{streamId}} gets corrupted. > Additionally, we aren’t quite using CRC32 correctly, in two ways: > 1. We are calculating the CRC32 of the *decompressed* value instead of > computing the CRC32 on the bytes written on the wire - losing the properties > of the CRC32. In some cases, due to this sequencing, attempting to decompress > a corrupt stream can cause a segfault by LZ4. > 2. When using CRC32, the CRC32 value is written in the incorrect byte order, > also losing some of the protections. > See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for > explanation for the two points above. > Separately, there are some long-standing issues with the protocol - since > *way* before CASSANDRA-13304. Importantly, both checksumming and compression > operate on individual message bodies rather than frames of multiple complete > messages. In reality, this has several important additional downsides. To > name a couple: > # For compression, we are getting poor compression ratios for smaller > messages - when operating on tiny sequences of bytes. In reality, for most > small requests and responses we are discarding the compressed value as it’d > be smaller than the uncompressed one - incurring both redundant allocations > and compressions. > # For checksumming and CRC32 we pay a high overhead price for small messages. > 4 bytes extra is *a l
[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162062#comment-17162062 ] Alex Petrov edited comment on CASSANDRA-15299 at 7/21/20, 2:29 PM: --- Thank you for the changes. I'm mostly done with a second pass of the review. First, wanted to bring up a couple of important things: * Current memory management is rather hard to follow, and maybe we should make an effort to document and/or codify it somewhere. I've spent about a day following all paths and figuring out ins-and-outs of it. One thing that I've found is that in {{FrameSet#finish}}, we're releasing not only the frame that was encoded into the sending buffer, but also the one we flush. This is done in three places with {{writeAndFlush}}. I believe this was leading to crc mismatch bug that I was catching earlier (see stacktrace [1] below). What might have been happening is we were releasing the buffer too early, which would allow it to get recycled. * There were several places in {{SimpleClient}} where we were not accouting for bytes. One is that we were never releasing {{Frame}} of the response we were getting. I've implemented copying, which is not optimal, but should work for testing purposes. The other one is that we weren't giving the bytes back to the limits. They would be acquired via netty processing, so we need to release them eventually. I've added a simple hook to release those. An alterntive to this would be to not use limits at all, with a downside of loosing a bit of observability. * [this one was discussed privately and source of the leak suggested by Sam] Flusher takes care of calling release for the flush item, which releases the source buffer. However, _response_ frame is released only for small buffers in {{FrameSet}} Here's a branch with aforementioned changes. However, I haven't run CI on it, I'll check if it breaks anything later today: https://github.com/ifesdjeen/cassandra/pull/new/15299-alex {code} [1] org.apache.cassandra.net.Crc$InvalidCrc: Read -854589741, Computed 1432984585 at org.apache.cassandra.transport.CQLMessageHandler.processCorruptFrame(CQLMessageHandler.java:328) at org.apache.cassandra.net.AbstractMessageHandler.process(AbstractMessageHandler.java:217) at org.apache.cassandra.net.FrameDecoder.deliver(FrameDecoder.java:321) at org.apache.cassandra.net.FrameDecoder.channelRead(FrameDecoder.java:285) at org.apache.cassandra.net.FrameDecoder.channelRead(FrameDecoder.java:269) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at {code} Some small nits: * In {{CQLMessageHandler#processOneContainedMessage}}, when we can't acquire capacity and, subsequently, we're not passing the frame further down the line. Shouold we release the frame in this case, since usually we're releasing the source frame after flush. * {{ReusableBuffer}} is unused. * {{Server}} has a few unused imports and {{eventExecutorGroup}} which is unused. * I'm not sure if we currently handle releasing corrupted frames. * In {{FrameSet}}, we're currently relying on the fact that we'll be able to go through {{finish}} and release everything successfully. However, this might not always be the case. I couldn't trigger such error, but it might still be possible. Shouold we maybe make {{FrameSet}} auto-closeable and make sure we always release buffers in {{finally}}? I've also made a similar change to {{processItem}} which would add item to {{flushed}} to make sure it's released. That makes {{flushed}} variable name not quite right though. And some improvements that were done to simple client: * it now supports time outs * supports sending multiple messages in a single frame when using v5 * simpler pipelines * reuses code for frame processing There's one more issue with a driver, which is easy to reproduce (with attached jar and test), but here's a stack trace: {code} DEBUG [cluster1-nio-worker-1] 2020-07-21 11:54:20,474 Connection.java:1396 - Connection[/127.0.0.1:64050-2, inFlight=2, closed=false] connection error java.lang.AssertionError: null at com.datastax.driver.core.BytesToSegmentDecoder.decode(BytesToSegmentDecoder.java:56) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:332) at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:501) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:440) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Abstra
[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047432#comment-17047432 ] Jorge Bay edited comment on CASSANDRA-15299 at 2/28/20 10:27 AM: - Let me know if I can help in any way with this ticket (early review / tests). was (Author: jorgebg): Let me know if I can help in anyway with this ticket (early review / tests). > CASSANDRA-13304 follow-up: improve checksumming and compression in protocol > v5-beta > --- > > Key: CASSANDRA-15299 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15299 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Aleksey Yeschenko >Assignee: Aleksey Yeschenko >Priority: Normal > Labels: protocolv5 > Fix For: 4.0-beta > > > CASSANDRA-13304 made an important improvement to our native protocol: it > introduced checksumming/CRC32 to request and response bodies. It’s an > important step forward, but it doesn’t cover the entire stream. In > particular, the message header is not covered by a checksum or a crc, which > poses a correctness issue if, for example, {{streamId}} gets corrupted. > Additionally, we aren’t quite using CRC32 correctly, in two ways: > 1. We are calculating the CRC32 of the *decompressed* value instead of > computing the CRC32 on the bytes written on the wire - losing the properties > of the CRC32. In some cases, due to this sequencing, attempting to decompress > a corrupt stream can cause a segfault by LZ4. > 2. When using CRC32, the CRC32 value is written in the incorrect byte order, > also losing some of the protections. > See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for > explanation for the two points above. > Separately, there are some long-standing issues with the protocol - since > *way* before CASSANDRA-13304. Importantly, both checksumming and compression > operate on individual message bodies rather than frames of multiple complete > messages. In reality, this has several important additional downsides. To > name a couple: > # For compression, we are getting poor compression ratios for smaller > messages - when operating on tiny sequences of bytes. In reality, for most > small requests and responses we are discarding the compressed value as it’d > be smaller than the uncompressed one - incurring both redundant allocations > and compressions. > # For checksumming and CRC32 we pay a high overhead price for small messages. > 4 bytes extra is *a lot* for an empty write response, for example. > To address the correctness issue of {{streamId}} not being covered by the > checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we > should switch to a framing protocol with multiple messages in a single frame. > I suggest we reuse the framing protocol recently implemented for internode > messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, > and that we do it before native protocol v5 graduates from beta. See > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java > and > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17198465#comment-17198465 ] Sam Tunnicliffe edited comment on CASSANDRA-15299 at 9/28/20, 11:33 AM: Sorry it's been a while without any visible movement here, but I've just pushed some more commits to address the latest comments from [~ifesdjeen] and [~omichallat]. I've added some tests for protocol negotiation, correct handling of corrupt messages and resource management. {quote} * In CQLMessageHandler#processOneContainedMessage, when we can't acquire capacity and, subsequently, we're not passing the frame further down the line. Shouold we release the frame in this case, since usually we're releasing the source frame after flush.{quote} Done, though we only need to do this when {{throwOnOverload == true}} as otherwise we process the inflight request before applying backpressure. {quote} * ReusableBuffer is unused.{quote} Ah yes, removed {quote} * Server has a few unused imports and eventExecutorGroup which is unused.{quote} Cleaned up the imports and removed eventExecutorGroup {quote} * I'm not sure if we currently handle releasing corrupted frames.{quote} For self-contained frames, there's nothing to do here as no resources have been acquired before the corruption is detected, hence {{CorruptFrame::release}} is a no-op. For frames which are part of a large message, there may be some resource allocated before we discover corruption. This is ok though, as we consume the frame, supplying it to the large message state machine, which handles releasing the bufffers of the previous frames (if any). I've added a test for this scenario which includes a check that everything allocated has been freed. {quote}Shouold we maybe make FrameSet auto-closeable and make sure we always release buffers in finally? I've also made a similar change to processItem which would add item to flushed to make sure it's released. That makes flushed variable name not quite right though. {quote} I've pulled in some of your change to {{processItem}} as it removes some duplication around polling the queue. I've removed the condition in the {{finally}} of {{ImmediateFlusher}} though, since if we throw from {{processQueue}} then {{doneWork}} will be false anyway, but there may have been some items processed and waiting to flush. The trade off is calling {{flushWrittenChannels}} even if there's no work to do, but that seems both cheap and unlikely, what do you think? As far as making {{FrameSet}} autoclosable, I don't think that's feasible, given how they are created and accessed. I've tried to address one of your comment re: the memory management here by adding some comments. They're probably not yet enough, but let me know if they are helpful at all. {quote}We can (and probably should) open a separate ticket that could aim at performance improvements around native protocol. {quote} Agreed, I'd like to do some further perf testing, but the results from your initial tests makes a follow-up ticket seem a reasonable option. {quote}I've noticed an issue when the client starts protocol negotiation with an unsupported version. {quote} Fixed, thanks. [~ifesdjeen], I haven't pulled in your burn test or changes to {{SimpleClient}} yet, I'll try to do that next week. I also haven't done any automated renaming yet, I'll hold off on that so as not to add to the cognitive burden until we're pretty much done with review. ||branch||CI|| |[15299-trunk|https://github.com/beobal/cassandra/tree/15299-trunk]|[circle|https://app.circleci.com/pipelines/github/beobal/cassandra?branch=15299-trunk]| was (Author: beobal): Sorry it's been a while without any visible movement here, but I've just pushed some more commits to address the latest comments from [~ifesdjeen] and [~omichallat]. I've added some tests for protocol negotiation, correct handling of corrupt messages and resource management. {quote} * In CQLMessageHandler#processOneContainedMessage, when we can't acquire capacity and, subsequently, we're not passing the frame further down the line. Shouold we release the frame in this case, since usually we're releasing the source frame after flush. {quote} Done, though we only need to do this when {{throwOnOverload == true}} as otherwise we process the inflight request before applying backpressure. {quote} * ReusableBuffer is unused. {quote} Ah yes, removed {quote} * Server has a few unused imports and eventExecutorGroup which is unused.{quote} Cleaned up the imports and removed eventExecutorGroup {quote} * I'm not sure if we currently handle releasing corrupted frames.{quote} For self-contained frames, there's nothing to do here as no resources have been acquired before the corruption is detected, hence {{CorruptFrame::release}} is a no-op. For frames which are part of a large message,
[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211482#comment-17211482 ] Caleb Rackliffe edited comment on CASSANDRA-15299 at 10/10/20, 12:27 AM: - [~samt] Here are the detailed/tactical points from my first pass at review. Some of these items, especially the testing bits, are things I could potentially branch and work on a bit myself, so let me know if you'd like a hand. I still want to make one more top-down pass at things Monday now that the smaller details are clearer. *Correctness* - {{AbstractMessageHandler}} -- Are we incrementing {{receivedCount}} both on the first frame of a large message (in {{processFirstFrameOfLargeMessage()}}) and on the last frame (in {{processSubsequentFrameOfLargeMessage()}})? -- {{forceOverAllocation()}} only allocates enough to hit the limit, but is it possible for the subsequent release to de-allocate the full frame size? - {{Flusher}} uses a {{PayloadAllocator}} from either the CRC or LZ4 encoder. CRC includes the header and trailer size when it hits the {{BufferPool}}, while LZ4 uses the {{Payload}} constructor that doesn't take those into account. Is this correct? - {{PostV5ExceptionHandler}} - in {{exceptionCaught()}}, does {{payload.release()}} need to be in a {{finally}} block? - {{CQLMessageHandler}} - should {{processOneContainedMessage()}} release the frame in a {{finally}} block? - It isn't really a correctness issue per-se, but {{FrameSet}} would feel safer if we used composition rather than inheritance. (Was the motivator not creating the additional object?) *Testing* - {{CQLTester}} - It might be a little awkward to follow the semantics of "require" -> "prepare" -> "initialize". Maybe we should just inline {{prepareNetwork()}}. - The large frame accumulation state machine might be a good unit test target. Lowest level might be a test-only subclass of {{AbstractMessageHandler.LargeMessage}} that simulates the cases we're interested in. Another would be to go one level up and test the {{CQLMessageHandler.LargeMessage}}, but that pulls all of {{CQLMessageHandler}} in, as it's not a static class. (We could make it static, create a little interface to abstract {{CQLMessageHandler}} away, etc.) - It might be worth pulling up {{WaitQueue}} (which is already static) as a top-level class, then throw a few unit-level tests around it it. - {{StartupMessage}} - There's enough logic that it might be nice to unit test execute(). (Mocking {{Connection}} and checking replays, etc.) *Documentation* - Should describe {{native_transport_receive_queue_capacity_in_bytes}} in {{cassandra.yaml}} and mention it in the Jira docs section? - Is the plan to just merge the {{ql_protocol_V5_framing.asc}} content into {{native_protocol_v5.spec}} when things solidify? - {{cql_protocol_V5_framing.asc}} - worth mentioning somewhere there (or if this ends up in the official V5 spec) that non-self-contained frames still only hold part of one message and can't contain the end of one large message and the start of another one - {{OutboundMessageQueue}} - The race from CASSANDRA-15958 is fixed? (Noticed the comment was removed...) - {{InboundMessageHandler}} - The class-level JavaDoc is now almost identical to {{AbstractMessageHandler}}, so we might want to narrow down to the places where the former specializes. (ex. Only it uses, {{ProcessMessage}}.) The same thing goes for the {{LargeMessage}} abstract class and its two sub-classes. - {{CQLMessageHandler}} - could use brief class-level JavaDoc on how it extends {{AbstractMessageHandler}} - {{transport.Frame}} - brief class-level JavaDoc might help clarify its scope/usage (even if it's renamed) - Would class-level JavaDoc for {{transport.Dispatcher}} and {{transport.Flusher}} be helpful. (ex. How do they interact with each other?) - {{Payload#finish()}} - It might be useful to have a bit of detail in the method JavaDoc explaining when we're expected to call it in the {{Payload}} lifecycle. - Do we need a deprecation plan for {{native_transport_frame_block_size_in_kb}} ...or not so much, because it was only added for 4.0 anyway? *Zombies* - {{transport.Frame.Header.BODY_LENGTH_SIZE}} is unused - {{PasswordAuthenticatorTest}} - {{import java.net.InetSocketAddress}} is unused - {{OutboundConnection}} - {{import java.util.stream.Stream}} is unused - {{StressSettings}} - {{import com.datastax.driver.core.Metadata}} and {{import com.google.common.collect.ImmutableMap}} are unused - {{FrameCompressor.LZ4Compressor}} - {{compress()}} doesn't need the {{throws}} clause, and {{outputOffset + 0}} could be simplified - {{ProtocolNegotiationTest}} - can remove the dead {{throws}} clauses from a few tests - {{ProtocolErrorTest}} - {{testErrorMessageWithNullString()}} doesn't need throws clause - {{CQLMessageHandler}} - {{UNKNOWN_S
[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211482#comment-17211482 ] Caleb Rackliffe edited comment on CASSANDRA-15299 at 10/10/20, 12:28 AM: - [~samt] Here are the detailed/tactical points from my first pass at review. Some of these items, especially the testing bits, are things I could potentially branch and work on a bit myself, so let me know if you'd like a hand. I still want to make one more top-down pass at things Monday now that the smaller details are clearer. *Correctness* - {{AbstractMessageHandler}} -- Are we incrementing {{receivedCount}} both on the first frame of a large message (in {{processFirstFrameOfLargeMessage()}}) and on the last frame (in {{processSubsequentFrameOfLargeMessage()}})? -- {{forceOverAllocation()}} only allocates enough to hit the limit, but is it possible for the subsequent release to de-allocate the full frame size? - {{Flusher}} uses a {{PayloadAllocator}} from either the CRC or LZ4 encoder. CRC includes the header and trailer size when it hits the {{BufferPool}}, while LZ4 uses the {{Payload}} constructor that doesn't take those into account. Is this correct? - {{PostV5ExceptionHandler}} - in {{exceptionCaught()}}, does {{payload.release()}} need to be in a {{finally}} block? - {{CQLMessageHandler}} - should {{processOneContainedMessage()}} release the frame in a {{finally}} block? - It isn't really a correctness issue per-se, but {{FrameSet}} would feel safer if we used composition rather than inheritance. (Was the motivator not creating the additional object?) *Testing* - {{CQLTester}} - It might be a little awkward to follow the semantics of "require" -> "prepare" -> "initialize". Maybe we should just inline {{prepareNetwork()}}. - The large frame accumulation state machine might be a good unit test target. Lowest level might be a test-only subclass of {{AbstractMessageHandler.LargeMessage}} that simulates the cases we're interested in. Another would be to go one level up and test the {{CQLMessageHandler.LargeMessage}}, but that pulls all of {{CQLMessageHandler}} in, as it's not a static class. (We could make it static, create a little interface to abstract {{CQLMessageHandler}} away, etc.) - It might be worth pulling up {{WaitQueue}} (which is already static) as a top-level class, then throw a few unit-level tests around it it. - {{StartupMessage}} - There's enough logic that it might be nice to unit test execute(). (Mocking {{Connection}} and checking replays, etc.) *Documentation* - Should describe {{native_transport_receive_queue_capacity_in_bytes}} in {{cassandra.yaml}} and mention it in the Jira docs section? - Is the plan to just merge the {{ql_protocol_V5_framing.asc}} content into {{native_protocol_v5.spec}} when things solidify? - {{cql_protocol_V5_framing.asc}} - worth mentioning somewhere there (or if this ends up in the official V5 spec) that non-self-contained frames still only hold part of one message and can't contain the end of one large message and the start of another one - {{OutboundMessageQueue}} - The race from CASSANDRA-15958 is fixed? (Noticed the comment was removed...) - {{InboundMessageHandler}} - The class-level JavaDoc is now almost identical to {{AbstractMessageHandler}}, so we might want to narrow down to the places where the former specializes. (ex. Only it uses, {{ProcessMessage}}.) The same thing goes for the {{LargeMessage}} abstract class and its two sub-classes. - {{CQLMessageHandler}} - could use brief class-level JavaDoc on how it extends {{AbstractMessageHandler}} - {{transport.Frame}} - brief class-level JavaDoc might help clarify its scope/usage (even if it's renamed) - Would class-level JavaDoc for {{transport.Dispatcher}} and {{transport.Flusher}} be helpful. (ex. How do they interact with each other?) - {{Payload#finish()}} - It might be useful to have a bit of detail in the method JavaDoc explaining when we're expected to call it in the {{Payload}} lifecycle. - Do we need a deprecation plan for {{native_transport_frame_block_size_in_kb}} ...or not so much, because it was only added for 4.0 anyway? *Zombies* - {{transport.Frame.Header.BODY_LENGTH_SIZE}} is unused - {{PasswordAuthenticatorTest}} - {{import java.net.InetSocketAddress}} is unused - {{OutboundConnection}} - {{import java.util.stream.Stream}} is unused - {{StressSettings}} - {{import com.datastax.driver.core.Metadata}} and {{import com.google.common.collect.ImmutableMap}} are unused - {{FrameCompressor.LZ4Compressor}} - {{compress()}} doesn't need the {{throws}} clause, and {{outputOffset + 0}} could be simplified - {{ProtocolNegotiationTest}} - can remove the dead {{throws}} clauses from a few tests - {{ProtocolErrorTest}} - {{testErrorMessageWithNullString()}} doesn't need throws clause - {{CQLMessageHandler}} - {{UNKNOWN_S
[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220921#comment-17220921 ] Sam Tunnicliffe edited comment on CASSANDRA-15299 at 10/27/20, 9:31 AM: Here are some numbers from running tlp-stress against 4.0-beta2 vs the 15299 branch {{70923bae4}} with protocol v4 & v5. The dataset was sized to be mostly in-memory and reading/writing at {{CL.ONE}}. Each workload ran for 10 minutes, with a 1m warmup, on a 3 node cluster. Both latency and throughput are pretty much on a par, with v5 showing a bit of an improvement where the workload is amenable to compression. Memory usage and GC were pretty much the same too. If anything, during the v5 runs the servers spent less time in GC, but this was so close as not be significant. There are definitely some places we can improve the performance of v5, and I don't think anything here indicates a serious regression in either v4 or v5 performance. Given the additional integrity checks in v5, I think not regressing meets the perf bar here, at least in the first instance. h4. 100% read workload, reads a single row at a time. {code:java} Count Latency (p99) 1min (req/s) 4.0-beta2 V4135449878 72.43 224222.86 15299 V4138424112 68.3 229765.45 15299 V5137618437 70.35 231602.36 4.0-beta2 V4 LZ4103348953 66.68 173437.38 15299 V4 LZ4105114560 68.83 176192.36 15299 V5 LZ4131833462 70.19 222092.99 {code} h4. Mixed r/w workload (50/50). K/V with blobs upto 65k {code:java} Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) 4.0-beta2 V434455009 76.59 56557.78 | 34464217 74.85 56546.40 15299 V433368171 74.90 54859.94 | 33361940 67.77 54848.57 15299 V532991815 76.00 54780.74 | 33001153 76.72 54856.99 4.0-beta2 V4 LZ432152220 83.41 53306.03 | 32147206 83.22 53334.00 15299 V4 LZ431158895 71.01 51106.60 | 31153453 72.75 51087.01 15299 V5 LZ432634296 75.73 54370.71 | 32644765 76.70 54396.15 {code} h4. Mixed r/w/d workload (60/30/10). Wide rows with values up to 200b and slicing on both the selects and deletions. {code:java} Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) 4.0-beta2 V418725688 197.47 25377.16 | 9357971 193.86 12687.94 | 3117394 178.44 4221.90 15299 V417975636 185.88 24160.78 | 8986125 184.97 12087.13 | 2995562 179.99 4023.88 15299 V518429252 192.91 25349.35 | 9223277 188.15 12678.46 | 3073312 184.24 4232.23 4.0-beta2 V4 LZ418407719 179.94 25160.16 | 9197664 178.25 12575.03 | 3068134 180.90 4195.38 15299 V4 LZ417678994 171.39 24073.09 | 8842952 196.06 12064.18 | 2947344 170.53 4026.37 15299 V5 LZ418274085 208.57 25127.02 | 9138491 163.23 12558.28 | 3045264 203.54 4188.88 {code} was (Author: beobal): Here are some numbers from running tlp-stress against 4.0-beta2 vs the 15299 branch {{70923bae4}} with protocol v4 & v5. The dataset was sized to be mostly in-memory and reading/writing at {{CL.ONE}}. Each workload ran for 10 minutes, with a 1m warmup, on a 3 node cluster. Both latency and throughput are pretty much on a par, with v5 showing a bit of an improvement where the workload is amenable to compression. Memory usage and GC were pretty much the same too. If anything, during the v5 runs the servers spent less time in GC, but this was so close as not be significant. There are definitely some places we can improve the performance of v5, and I don't think anything here indicates a serious regression in either v4 or v5 performance. Given the additional integrity checks in v5, I think not regressing meets the perf bar here, at least in the first instance. h4. 100% read workload, reads a single row at a time. {code} Count Latency (p99) 1min (req/s) 4.0-beta2 V4135449878 72.43 224222.86 15299 V4138424112 68.3 229765.45 15299 V5137618437 70.35 231602.36 4.0-beta2 V4 LZ4103348953 66.68 173437.38 15299 V4 LZ4105114560
[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17222476#comment-17222476 ] Caleb Rackliffe edited comment on CASSANDRA-15299 at 10/28/20, 10:01 PM: - Reading through the overallocation commit, it seems we might be able to factor a little helper out of {{Decoder#extractHeader()}} and {{decodeFrame()}} that decodes and validates the header flags. ex. {noformat} private EnumSet decodeFlags(ProtocolVersion version, int flags) { EnumSet decodedFlags = Header.Flag.deserialize(flags); if (version.isBeta() && !decodedFlags.contains(Header.Flag.USE_BETA)) throw new ProtocolException(String.format("Beta version of the protocol used (%s), but USE_BETA flag is unset", version), version); return decodedFlags; } {noformat} I've been able to get through the most recent commits, and things look pretty good. Just want to go over {{ClientResourceLimitsTest}} one more time... was (Author: maedhroz): Reading through the overallocation commit, it seems we might be able to factor a little helper out of {{Decoder#extractHeader()}} and {{decodeFrame()}} that decodes and validates the header flags. ex. {noformat} private EnumSet decodeFlags(ProtocolVersion version, int flags) { EnumSet decodedFlags = Header.Flag.deserialize(flags); if (version.isBeta() && !decodedFlags.contains(Header.Flag.USE_BETA)) throw new ProtocolException(String.format("Beta version of the protocol used (%s), but USE_BETA flag is unset", version), version); return decodedFlags; } {noformat} > CASSANDRA-13304 follow-up: improve checksumming and compression in protocol > v5-beta > --- > > Key: CASSANDRA-15299 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15299 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Aleksey Yeschenko >Assignee: Sam Tunnicliffe >Priority: Normal > Labels: protocolv5 > Fix For: 4.0-alpha > > > CASSANDRA-13304 made an important improvement to our native protocol: it > introduced checksumming/CRC32 to request and response bodies. It’s an > important step forward, but it doesn’t cover the entire stream. In > particular, the message header is not covered by a checksum or a crc, which > poses a correctness issue if, for example, {{streamId}} gets corrupted. > Additionally, we aren’t quite using CRC32 correctly, in two ways: > 1. We are calculating the CRC32 of the *decompressed* value instead of > computing the CRC32 on the bytes written on the wire - losing the properties > of the CRC32. In some cases, due to this sequencing, attempting to decompress > a corrupt stream can cause a segfault by LZ4. > 2. When using CRC32, the CRC32 value is written in the incorrect byte order, > also losing some of the protections. > See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for > explanation for the two points above. > Separately, there are some long-standing issues with the protocol - since > *way* before CASSANDRA-13304. Importantly, both checksumming and compression > operate on individual message bodies rather than frames of multiple complete > messages. In reality, this has several important additional downsides. To > name a couple: > # For compression, we are getting poor compression ratios for smaller > messages - when operating on tiny sequences of bytes. In reality, for most > small requests and responses we are discarding the compressed value as it’d > be smaller than the uncompressed one - incurring both redundant allocations > and compressions. > # For checksumming and CRC32 we pay a high overhead price for small messages. > 4 bytes extra is *a lot* for an empty write response, for example. > To address the correctness issue of {{streamId}} not being covered by the > checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we > should switch to a framing protocol with multiple messages in a single frame. > I suggest we reuse the framing protocol recently implemented for internode > messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, > and that we do it before native protocol v5 graduates from beta. See > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java > and > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache
[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220921#comment-17220921 ] Sam Tunnicliffe edited comment on CASSANDRA-15299 at 10/29/20, 11:11 AM: - Here are some numbers from running tlp-stress against 4.0-beta2 vs the 15299 branch {{70923bae4}} with protocol v4 & v5. The dataset was sized to be mostly in-memory and reading/writing at {{CL.ONE}}. Each workload ran for 10 minutes, with a 1m warmup, on a 3 node cluster. Both latency and throughput are pretty much on a par, with v5 showing a bit of an improvement where the workload is amenable to compression. Memory usage and GC were pretty much the same too. If anything, during the v5 runs the servers spent less time in GC, but this was so close as not be significant. There are definitely some places we can improve the performance of v5, and I don't think anything here indicates a serious regression in either v4 or v5 performance. Given the additional integrity checks in v5, I think not regressing meets the perf bar here, at least in the first instance. h4. 100% read workload, reads a single row at a time. {code:java} Reads Count Latency (p99) 1min (req/s) 4.0-beta2 V4135449878 72.43 224222.86 15299 V4138424112 68.3 229765.45 15299 V5137618437 70.35 231602.36 4.0-beta2 V4 LZ4103348953 66.68 173437.38 15299 V4 LZ4105114560 68.83 176192.36 15299 V5 LZ4131833462 70.19 222092.99 {code} h4. Mixed r/w workload (50/50). K/V with blobs upto 65k {code:java} WritesReads Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) 4.0-beta2 V434455009 76.59 56557.78 | 34464217 74.85 56546.40 15299 V433368171 74.90 54859.94 | 33361940 67.77 54848.57 15299 V532991815 76.00 54780.74 | 33001153 76.72 54856.99 4.0-beta2 V4 LZ432152220 83.41 53306.03 | 32147206 83.22 53334.00 15299 V4 LZ431158895 71.01 51106.60 | 31153453 72.75 51087.01 15299 V5 LZ432634296 75.73 54370.71 | 32644765 76.70 54396.15 {code} h4. Mixed r/w/d workload (60/30/10). Wide rows with values up to 200b and slicing on both the selects and deletions. {code:java} WritesReads Deletes Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) 4.0-beta2 V418725688 197.47 25377.16 | 9357971 193.86 12687.94 | 3117394 178.44 4221.90 15299 V417975636 185.88 24160.78 | 8986125 184.97 12087.13 | 2995562 179.99 4023.88 15299 V518429252 192.91 25349.35 | 9223277 188.15 12678.46 | 3073312 184.24 4232.23 4.0-beta2 V4 LZ418407719 179.94 25160.16 | 9197664 178.25 12575.03 | 3068134 180.90 4195.38 15299 V4 LZ417678994 171.39 24073.09 | 8842952 196.06 12064.18 | 2947344 170.53 4026.37 15299 V5 LZ418274085 208.57 25127.02 | 9138491 163.23 12558.28 | 3045264 203.54 4188.88 {code} was (Author: beobal): Here are some numbers from running tlp-stress against 4.0-beta2 vs the 15299 branch {{70923bae4}} with protocol v4 & v5. The dataset was sized to be mostly in-memory and reading/writing at {{CL.ONE}}. Each workload ran for 10 minutes, with a 1m warmup, on a 3 node cluster. Both latency and throughput are pretty much on a par, with v5 showing a bit of an improvement where the workload is amenable to compression. Memory usage and GC were pretty much the same too. If anything, during the v5 runs the servers spent less time in GC, but this was so close as not be significant. There are definitely some places we can improve the performance of v5, and I don't think anything here indicates a serious regression in either v4 or v5 performance. Given the additional integrity checks in v5, I think not regressing meets the perf bar here, at least in the first instance. h4. 100% read workload, reads a single row at a time. {code:java} Count Latency (p99) 1min (req/s) 4.0-beta2 V4
[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226048#comment-17226048 ] Alex Petrov edited comment on CASSANDRA-15299 at 11/4/20, 1:36 PM: --- Patch looks good to me, I have only several comments: * in {{CQLMessageHeader#processLargeMessage}}, we seem to be creating a {{LargeMessage}} object to call {{processCqlFrame()}} over a frame that is assembled from a single byte buffer. Maybe we can just call {{processCqlFrame}} directly? * this is probably something that we should address outside this ticket, but still: "corrupt frame recovered" seems to be slightly misleading wording, same as {{Frame#recoverable}}. We can not recover the frame itself, we just can skip/drop it. Maybe we can rename this, along with metrics in Messaging, before they become public in 4.0. * in {{CQLMessageHeader#processCqlFrame}}, we only call {{handleErrorAndRelease}}. However, it may theoretically happen that we fail before we finish {{messageDecoder.decode(channel, frame)}}. Maybe we can do something like the [1], to make it consistent with what we do in {{ProtocolDecoder#decode}}? * in {{CQLMessageHeader$LargeMessage#onComplete}}, we wrap {{processCqlFrame(assembleFrame()}} call in try/catch which is identical to try/catch block from {{processCqlFrame}} itself. Just checking if this is intended. * in {{Dispatcher#processRequest}}, we can slightly simplify code by declaring {{FlushItem flushItem}} variable outside try/catch block and assigning a corresponding value in try or catch, and only calling {{flush}} once. * during sever bootstrap initialization, we're using deprecated low/high watermark child options, probably we should use {{.childOption(ChannelOption.WRITE_BUFFER_WATER_MARK, new WriteBufferWaterMark(8 * 1024, 32 * 1024))}} instead. * {{SimpleClientBurnTest#random}} is unused If this is helpful, I've also put nits mentioned above (and a couple cleanups) in a commit [here|https://github.com/apache/cassandra/commit/76ee6e00f42446d94679e9c9001c81ebfa9418ab]. * this seems to be test-only, but still might be good to fix, there seems to be a leak in simple client (see [2]) Not sure if this is helpful, but I've also put together a v5 flow chart, which might be helpful if anyone wants a quick overview of what's going on in v5. [1] {code} protected void processCqlFrame(Frame frame) { M message = null; try { message = messageDecoder.decode(channel, frame); dispatcher.accept(channel, message, this::toFlushItem); } catch (Exception e) { if (message == null) frame.release(); handleErrorAndRelease(e, frame.header); } } {code} [2] {code} ERROR [nioEventLoopGroup-2-2] 2020-11-02 14:52:12,101 ResourceLeakDetector.java:320 - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information. Recent access records: Created at: io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:363) io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187) io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:178) io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:115) org.apache.cassandra.transport.Message.encode(Message.java:360) org.apache.cassandra.transport.SimpleClient$InitialHandler$3.write(SimpleClient.java:510) io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717) io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764) io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071) io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) java.lang.Thread.run(Thread.java:748) {code} was (Author: ifesdjeen): Patch looks good to me, I have only several comments: * in {{CQLMessageHeader#processLargeMessage}}, we seem to be creating a {{LargeMessage}} object to call {{processCqlFrame()}} over a frame that is assembled from a single byte buffer. Maybe we can just call {{processCqlFrame}} directly? * this is
[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17231462#comment-17231462 ] Michael Semb Wever edited comment on CASSANDRA-15299 at 11/13/20, 1:48 PM: --- bq. One thing that's a bit concerning is that the cassandra-test branch of the driver, which is what dtests are currently using, is currently 693 commits behind the master branch. If we're updating to use a new updated version of the driver, does that mean the {{cassandra-test}} branch is being sync'd up to master in the progress? {quote}Docker images: - beobal/cassandra-testing-ubuntu1910-java11:2020 - beobal/cassandra-testing-ubuntu1910-java11-w-dependencies:2020{quote} Is it time to start deploying these images under [{{apache/}}|https://hub.docker.com/u/apache] ? If agreed, I can open an infra ticket to set up deployment of docker images. bq. I'll open PRs to cassandra-builds and cassandra-dtest before going any further here. Go for it! :-) was (Author: michaelsembwever): bq. One thing that's a bit concerning is that the cassandra-test branch of the driver, which is what dtests are currently using, is currently 693 commits behind the master branch. If we're updating to use a new updated version of the driver, does that mean the {{cassandra-test}} branch being sync'd up to master in the progress? {quote}Docker images: - beobal/cassandra-testing-ubuntu1910-java11:2020 - beobal/cassandra-testing-ubuntu1910-java11-w-dependencies:2020{quote} Is it time to start deploying these images under [{{apache/}}|https://hub.docker.com/u/apache] ? If agreed, I can open an infra ticket to set up deployment of docker images. bq. I'll open PRs to cassandra-builds and cassandra-dtest before going any further here. Go for it! :-) > CASSANDRA-13304 follow-up: improve checksumming and compression in protocol > v5-beta > --- > > Key: CASSANDRA-15299 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15299 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Aleksey Yeschenko >Assignee: Sam Tunnicliffe >Priority: Normal > Labels: protocolv5 > Fix For: 4.0-alpha > > Attachments: Process CQL Frame.png, V5 Flow Chart.png > > > CASSANDRA-13304 made an important improvement to our native protocol: it > introduced checksumming/CRC32 to request and response bodies. It’s an > important step forward, but it doesn’t cover the entire stream. In > particular, the message header is not covered by a checksum or a crc, which > poses a correctness issue if, for example, {{streamId}} gets corrupted. > Additionally, we aren’t quite using CRC32 correctly, in two ways: > 1. We are calculating the CRC32 of the *decompressed* value instead of > computing the CRC32 on the bytes written on the wire - losing the properties > of the CRC32. In some cases, due to this sequencing, attempting to decompress > a corrupt stream can cause a segfault by LZ4. > 2. When using CRC32, the CRC32 value is written in the incorrect byte order, > also losing some of the protections. > See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for > explanation for the two points above. > Separately, there are some long-standing issues with the protocol - since > *way* before CASSANDRA-13304. Importantly, both checksumming and compression > operate on individual message bodies rather than frames of multiple complete > messages. In reality, this has several important additional downsides. To > name a couple: > # For compression, we are getting poor compression ratios for smaller > messages - when operating on tiny sequences of bytes. In reality, for most > small requests and responses we are discarding the compressed value as it’d > be smaller than the uncompressed one - incurring both redundant allocations > and compressions. > # For checksumming and CRC32 we pay a high overhead price for small messages. > 4 bytes extra is *a lot* for an empty write response, for example. > To address the correctness issue of {{streamId}} not being covered by the > checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we > should switch to a framing protocol with multiple messages in a single frame. > I suggest we reuse the framing protocol recently implemented for internode > messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, > and that we do it before native protocol v5 graduates from beta. See > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java > and > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/Fr
[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17231545#comment-17231545 ] Adam Holmberg edited comment on CASSANDRA-15299 at 11/13/20, 2:57 PM: -- {quote}This is the thing I'm unsure about - the python driver is not an asf project (yet), so it's not really up to me/us whether we update that branch ... TBH, I don't recall the full reasoning for using the cassandra-test branch in the first place ...{quote} The {{cassandra-test}} branch was created explicitly to allow server-side testing of merged client-impacting features, independent of driver releases. It was born of a time when there were multiple client-impacting changes in flight so a single commit would not suffice. Normally we would coordinate a merge into that driver branch with a PR for the server, so CI could be tested with it. Updating the branch should not be an issue. I, or [~aboudreault] can facilitate. was (Author: aholmber): {quote}This is the thing I'm unsure about - the python driver is not an asf project (yet), so it's not really up to me/us whether we update that branch ... TBH, I don't recall the full reasoning for using the cassandra-test branch in the first place ...{quote} The {{cassandra-test}} branch was created explicitly to allow server-side testing of merged client-impacting features, independent of driver releases. It was born of a time when there were multiple client-impacting changes in-flight so a single commit would not suffice. Normally we would coordinate a merge into that driver branch with a PR for the server, so CI could be tested with it. Updating the branch should not be an issue. I, or [~aboudreault] can facilitate. > CASSANDRA-13304 follow-up: improve checksumming and compression in protocol > v5-beta > --- > > Key: CASSANDRA-15299 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15299 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Aleksey Yeschenko >Assignee: Sam Tunnicliffe >Priority: Normal > Labels: protocolv5 > Fix For: 4.0-alpha > > Attachments: Process CQL Frame.png, V5 Flow Chart.png > > > CASSANDRA-13304 made an important improvement to our native protocol: it > introduced checksumming/CRC32 to request and response bodies. It’s an > important step forward, but it doesn’t cover the entire stream. In > particular, the message header is not covered by a checksum or a crc, which > poses a correctness issue if, for example, {{streamId}} gets corrupted. > Additionally, we aren’t quite using CRC32 correctly, in two ways: > 1. We are calculating the CRC32 of the *decompressed* value instead of > computing the CRC32 on the bytes written on the wire - losing the properties > of the CRC32. In some cases, due to this sequencing, attempting to decompress > a corrupt stream can cause a segfault by LZ4. > 2. When using CRC32, the CRC32 value is written in the incorrect byte order, > also losing some of the protections. > See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for > explanation for the two points above. > Separately, there are some long-standing issues with the protocol - since > *way* before CASSANDRA-13304. Importantly, both checksumming and compression > operate on individual message bodies rather than frames of multiple complete > messages. In reality, this has several important additional downsides. To > name a couple: > # For compression, we are getting poor compression ratios for smaller > messages - when operating on tiny sequences of bytes. In reality, for most > small requests and responses we are discarding the compressed value as it’d > be smaller than the uncompressed one - incurring both redundant allocations > and compressions. > # For checksumming and CRC32 we pay a high overhead price for small messages. > 4 bytes extra is *a lot* for an empty write response, for example. > To address the correctness issue of {{streamId}} not being covered by the > checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we > should switch to a framing protocol with multiple messages in a single frame. > I suggest we reuse the framing protocol recently implemented for internode > messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, > and that we do it before native protocol v5 graduates from beta. See > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java > and > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java. -- This message was sent by Atlassian Jira (v8.3.4#803005) --