[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140675#comment-17140675
 ] 

Alex Petrov commented on CASSANDRA-15299:
-----------------------------------------

[~samt] thank you for the patch! I think it's a very good addition to v5 
protocol. This is not a final review, just a "preview" or "first pass", since I 
wanted to break it down a little. I agree that the core of the patch is rather 
self-contained, which has helped reviewing it a lot.

There are several things that I wanted to bring to your attention:
 * {{Flusher.java#encodeIntoFramedResponsePayload}}: in [one 
place|https://github.com/apache/cassandra/pull/646/files#diff-f2777c908750448a1472c430eab01d94R130]
 there's a TODO saying we should make sure to limit a buffer to {{MAX_SIZE - 
1}}. But 
[later|https://github.com/apache/cassandra/pull/646/files#diff-f2777c908750448a1472c430eab01d94R142]
 we allocate without setting the limit.
 * {{Flusher.java#flushLargeMessage}}: in one place, when 
[flushing|https://github.com/apache/cassandra/pull/646/files#diff-f2777c908750448a1472c430eab01d94R161-R162]
 a large message, we don't call {{release}} on the payload.
 * {{ExceptionHandlers$PostV5ExceptionHandler#exceptionCaught}}: when 
[flushing|https://github.com/apache/cassandra/pull/646/files#diff-b962f06a031121f71b18d5ce59cdc51eR111]
 an exception, we don't call {{release}} on the payload.
 * There are several places in {{SimpleClient}} where {{largePayload#release}} 
isn't called.
 * We might want to rethink / optimise memory usage a bit. One thing is that 
we're always allocating buffers of {{FrameEncoder.Payload.MAX_SIZE - 1}}, which 
is 131081 bytes. Unfortunately, this size is rather large (even if we return 
unused bytes quickly). We don't have a memory leak caused by unreleased buffers 
partly because of this size: in buffer pool, we allocate chunks over 128KiB 
directly instead of going through pooled chunks. I realize that we're returning 
all "unused" space when finalizing the buffer (if we ignore or change >128KiB 
limit), but it looks like we can just avoid these allocations alltogether. What 
we could do to mitigate this probelm and avoid pre-allocating memory chunks is 
to collect not {{Channel/Payload}} pairs, but {{Channel/Collection<Frame>}}, 
and allocate a chunk of required size already when we're flushing.
 * If I understand it correctly, {{Flusher#flushWrittenChannels}}, 
[here|https://github.com/apache/cassandra/pull/646/files#diff-f2777c908750448a1472c430eab01d94R221],
 a call to {{payloads.remove()}} might cause 
{{ConcurrentModificationException}} from hashmap.

Other things are mostly discussions/suggestions/nits:
 * As you have mentioned, it would be great to rename {{Frame}} to something 
different, like {{Envelope}}, since right now we have {{FrameDecoder#Frame}} 
and {{Frame$Decoder}} and variable names that correspond with class names, 
which makes it all hard to follow.
 * Looks like {{MessageConsumer}} could be generic, since we cast it to either 
request or response.
 * {{Server#requestPayloadInFlightPerEndpoint}} can be a non-static Server 
member.
 * {{Frame$Decoder}} and other classes that are related to legacy path can be 
extracted to a separate class, since Frame itself is still useful, but classes 
that facilitate legacy encoding/decoding/etc can be extracted.
 * {{Frame#encodeHeaderInto}} seems to be duplicating the logic we have in 
{{Frame$Encoder#encodeHeader}}, should we unify the two? Maybe we can have 
encoding/decoding methods shared for both legacy and new paths, for example, as 
static methods?
 * {{Flusher#flushLargeMessage}} three paths (first/middle frames/last frame) 
cases can be combined to either parametrized method call or a single loop
 * {{finish/writeAndFlush/release}} triplet seems to be used together a lot. 
Should we have a helper method for the three?
 * Same goes for {{allocate}} + {{limit}} calls that often go together.
 * {{Dispatcher#processRequest}}, we don't need to cast {{error}} to 
{{Message.Response}} if we change its type to {{ErrorMessage}}.
 * Should we hide {{flusher.queued.add()}} behind a method to disallow 
accessing queue directly?
 * We can change the code a bit to make {{FlushItemConverter}} instances 
explicit. Right now, we basically have two converters both called 
{{#toFlushItem}} in {{CQLMessageHandler}} and {{LegacyDispatchHandler}}. We 
could have them as inner classes. It's somewhat useful since if you change the 
signature of this method, or stop using it, it'll be hard to find that it is 
actually an implementation of converter.
 * In {{CqlMessageHandler#releaseAfterFlush}}, we can call 
{{sourceFrame#release()}} instead of {{sourceFrame.body.release()}} for 
consistency with other calls
 * Looks like {{CQLMessageHandler#processCorruptFrame}}, initially had an 
intention of handling recovery, but now just throws a CRC exception regardless. 
This does match description, but usage of {{isRecoverable}} seems to be 
redundant here, unless we change semantics of recovery.

> CASSANDRA-13304 follow-up: improve checksumming and compression in protocol 
> v5-beta
> -----------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15299
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15299
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Messaging/Client
>            Reporter: Aleksey Yeschenko
>            Assignee: Sam Tunnicliffe
>            Priority: Normal
>              Labels: protocolv5
>             Fix For: 4.0-alpha
>
>
> CASSANDRA-13304 made an important improvement to our native protocol: it 
> introduced checksumming/CRC32 to request and response bodies. It’s an 
> important step forward, but it doesn’t cover the entire stream. In 
> particular, the message header is not covered by a checksum or a crc, which 
> poses a correctness issue if, for example, {{streamId}} gets corrupted.
> Additionally, we aren’t quite using CRC32 correctly, in two ways:
> 1. We are calculating the CRC32 of the *decompressed* value instead of 
> computing the CRC32 on the bytes written on the wire - losing the properties 
> of the CRC32. In some cases, due to this sequencing, attempting to decompress 
> a corrupt stream can cause a segfault by LZ4.
> 2. When using CRC32, the CRC32 value is written in the incorrect byte order, 
> also losing some of the protections.
> See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for 
> explanation for the two points above.
> Separately, there are some long-standing issues with the protocol - since 
> *way* before CASSANDRA-13304. Importantly, both checksumming and compression 
> operate on individual message bodies rather than frames of multiple complete 
> messages. In reality, this has several important additional downsides. To 
> name a couple:
> # For compression, we are getting poor compression ratios for smaller 
> messages - when operating on tiny sequences of bytes. In reality, for most 
> small requests and responses we are discarding the compressed value as it’d 
> be smaller than the uncompressed one - incurring both redundant allocations 
> and compressions.
> # For checksumming and CRC32 we pay a high overhead price for small messages. 
> 4 bytes extra is *a lot* for an empty write response, for example.
> To address the correctness issue of {{streamId}} not being covered by the 
> checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we 
> should switch to a framing protocol with multiple messages in a single frame.
> I suggest we reuse the framing protocol recently implemented for internode 
> messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, 
> and that we do it before native protocol v5 graduates from beta. See 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java
>  and 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to