[ 
https://issues.apache.org/jira/browse/CASSANDRA-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15923239#comment-15923239
 ] 

Michael Kjellman commented on CASSANDRA-13304:
----------------------------------------------

[~aholmber] There is the hypothetical world and the real world. Hypothetically, 
everyone would just use TLS (in fact if we make this argument that TLS should 
just be enough because it's going to be our solution to corrupted bits in the 
transport we shouldn't even support unencrypted connections anymore). But the 
fact is that the tooling required to manage certs (rotating, issuing, renewing) 
between all the various teams and players isn't a solved or easy problem. Also, 
if an app team is already encrypting the actual content at the app level, 
re-encrypting the already encrypted bytes might not always make sense.

Additionally, re: the comment that TCP Checksums "should be enough"; real life 
experience shows this isn't the case. Also, there is the case where the 
corruption can occur in the NIC Offloading itself -- meaning the TCP checksum 
that is calculated matches on both ends because it's calculated on the 
corrupted bytes itself. In many NICs this memory isn't protected -- unlike 
having ECC protection on the DIMMs itself -- and without checksums at the 
protocol level we can't detect this.

Given we are ultimately a database -- and that data loss is really the only 
thing we should ultimately ever be concerned about -- and given this is a very 
real problem -- sure, we are going to be adding some overhead here but we don't 
really have a choice -- and ultimately we need to do things that protect our 
users data with the top priority. Being 1% faster (pulling that number out of 
the air right now) doesn't matter if we corrupt 1% of the data doing so.

Implementing checksumming on the frame body (and not re-archetecting the actual 
protocol itself) accomplishes the core goal here that the data itself isn't 
corrupted in flight. The real downside here would be potentially something like 
a bit getting flipped that causes us to allocate a super huge buffer during 
deserialization -- that's unfortunate -- but we would fail on the 
deserialization path and at least fail. Additionally, we could flip the bit on 
one of the flags -- so if that got dropped too we could have issues, but they 
would ultimately cause us to fail in deserialization -- so although inefficient 
I think we're still "okay". Long term, as I had originally mentioned, it might 
make sense to revisit some of the core things of the protocol... Like in 
practice does anyone actually compress and not compress individual messages on 
the fly? I think every real-world implementation will negotiate it at the 
startup and then compress everything... throwing a Compressor/Decompressor in 
the Netty pipeline that blindly checksums (or compresses) every n-bytes  would 
cover the entire protocol.. but because we've got all this header stuff and 
like changing of state inside the frame itself depending on flags it's not 
something super easy to change.. And if we do that there is going to be a 
pretty big cost to get everyone downstream to basically re-write for a changed 
protocol vs just implementing some chunking and checksum logic gated by a 
protocol version check which is pretty self contained in comparison.

The CRC Wikipedia itself is actually pretty good. 
https://en.wikipedia.org/wiki/Cyclic_redundancy_check. Checkout the Computation 
section. From the wikipedia article:
{quote}
A CRC is called an n-bit CRC when its check value is n bits long. For a given 
n, multiple CRCs are possible, each with a different polynomial. Such a 
polynomial has highest degree n, which means it has n + 1 terms. In other 
words, the polynomial has a length of n + 1; its encoding requires n + 1 bits. 
Note that most polynomial specifications either drop the MSB or LSB, since they 
are always 1. The CRC and associated polynomial typically have a name of the 
form CRC-n-XXX as in the table below.

The simplest error-detection system, the parity bit, is in fact a trivial 1-bit 
CRC: it uses the generator polynomial x + 1 (two terms), and has the name CRC-1.
{quote}

I *think* I tried to address all of the points in your comment... let me know 
if I missed some by accident...

> Add checksumming to the native protocol
> ---------------------------------------
>
>                 Key: CASSANDRA-13304
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13304
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Michael Kjellman
>            Assignee: Michael Kjellman
>              Labels: client-impacting
>         Attachments: 13304_v1.diff
>
>
> The native binary transport implementation doesn't include checksums. This 
> makes it highly susceptible to silently inserting corrupted data either due 
> to hardware issues causing bit flips on the sender/client side, C*/receiver 
> side, or network in between.
> Attaching an implementation that makes checksum'ing mandatory (assuming both 
> client and server know about a protocol version that supports checksums) -- 
> and also adds checksumming to clients that request compression.
> The serialized format looks something like this:
> {noformat}
>  *                      1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
>  *  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |  Number of Compressed Chunks  |     Compressed Length (e1)    /
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * /  Compressed Length cont. (e1) |    Uncompressed Length (e1)   /
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * | Uncompressed Length cont. (e1)| CRC32 Checksum of Lengths (e1)|
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * | Checksum of Lengths cont. (e1)|    Compressed Bytes (e1)    +//
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |                      CRC32 Checksum (e1)                     ||
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |                    Compressed Length (e2)                     |
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |                   Uncompressed Length (e2)                    |
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |                CRC32 Checksum of Lengths (e2)                 |
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |                     Compressed Bytes (e2)                   +//
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |                      CRC32 Checksum (e2)                     ||
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |                    Compressed Length (en)                     |
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |                   Uncompressed Length (en)                    |
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |                CRC32 Checksum of Lengths (en)                 |
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |                      Compressed Bytes (en)                  +//
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |                      CRC32 Checksum (en)                     ||
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
> {noformat}
> The first pass here adds checksums only to the actual contents of the frame 
> body itself (and doesn't actually checksum lengths and headers). While it 
> would be great to fully add checksuming across the entire protocol, the 
> proposed implementation will ensure we at least catch corrupted data and 
> likely protect ourselves pretty well anyways.
> I didn't go to the trouble of implementing a Snappy Checksum'ed Compressor 
> implementation as it's been deprecated for a while -- is really slow and 
> crappy compared to LZ4 -- and we should do everything in our power to make 
> sure no one in the community is still using it. I left it in (for obvious 
> backwards compatibility aspects) old for clients that don't know about the 
> new protocol.
> The current protocol has a 256MB (max) frame body -- where the serialized 
> contents are simply written in to the frame body.
> If the client sends a compression option in the startup, we will install a 
> FrameCompressor inline. Unfortunately, we went with a decision to treat the 
> frame body separately from the header bits etc in a given message. So, 
> instead we put a compressor implementation in the options and then if it's 
> not null, we push the serialized bytes for the frame body *only* thru the 
> given FrameCompressor implementation. The existing implementations simply 
> provide all the bytes for the frame body in one go to the compressor 
> implementation and then serialize it with the length of the compressed bytes 
> up front.
> Unfortunately, this won't work for checksum'ing for obvious reasons as we 
> can't naively just checksum the entire (potentially) 256MB frame body and 
> slap it at the end... so,
> The best place to start with the changes is in {{ChecksumedCompressor}}. I 
> implemented one single place to perform the checksuming (and to support 
> checksuming) the actual required chunking logic. Implementations of 
> ChecksumedCompressor only implement the actual calls to the given compression 
> algorithm for the provided bytes.
> Although the interface takes a {{Checksum}}, right now the attached patch 
> uses CRC32 everywhere. As of right now, given JDK8+ has support for doing the 
> calculation with the Intel instruction set, CRC32 is about as fast as we can 
> get right now.
> I went with a 32kb "default" for the chunk size -- meaning we will chunk the 
> entire frame body into 32kb chunks, compress each one of those chunks, and 
> checksum the chunk. Upon discussing with a bunch of people and researching 
> how checksums actually work and how much data they will protect etc -- if we 
> use 32kb chunks with CRC32 we can catch up to 32 bits flipped in a row (but 
> more importantly catch the more likely corruption where a single bit is 
> flipped) with pretty high certainty. 64kb seems to introduce too much of a 
> probability of missing corruption.
> The maximum block size LZ4 operates on is a 64kb chunk -- so this combined 
> with the need to make sure the CRC32 checksums are actually going to catch 
> stuff -- chunking at 32kb seemed like a good reasonable value to use when 
> weighing both checksums and compression (to ensure we don't kill our 
> compression ratio etc).
> I'm not including client changes here -- I asked around and I'm not really 
> sure what the policy there is -- do we update the python driver? java driver? 
> how has the timing of this stuff been handled in the past?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to