[ 
https://issues.apache.org/jira/browse/CASSANDRA-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15867950#comment-15867950
 ] 

Jason Brown commented on CASSANDRA-13126:
-----------------------------------------

I agree with [~tvdw]'s assessment here: once part of the data stream is lost, 
you're pretty much screwed. It's possible there's some route to recovery, but 
that probably includes some degree of luck and fortuitous timing. I think the 
simplest solution is to close the channel/socket, as I suspect error recovery 
code might be tricky and there may be security holes in that (I am not a 
security expert so I may be wrong).

bq. Wouldn't frequently reconnecting clients possibly cause more memory 
pressure in this case and further escalate the issue?

Quite possibly, although {{ConnectionLimitHandler}} might be able to help, but 
even that will have some costs before it executes in a channel.

> native transport protocol corruption when using SSL
> ---------------------------------------------------
>
>                 Key: CASSANDRA-13126
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13126
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Tom van der Woerdt
>            Priority: Critical
>
> This is a series of conditions that can result in client connections becoming 
> unusable.
> 1) Cassandra GC must be well-tuned, to have short GC pauses every minute or so
> 2) *client* SSL must be enabled and transmitting a significant amount of data
> 3) Cassandra must run with the default library versions
> 4) disableexplicitgc must be set (this is the default in the current 
> cassandra-env.sh)
> This ticket relates to CASSANDRA-13114 which is a possible workaround (but 
> not a fix) for the SSL requirement to trigger this bug.
> * Netty allocates nio.ByteBuffers for every outgoing SSL message.
> * ByteBuffers consist of two parts, the jvm object and the off-heap object. 
> The jvm object is small and goes with regular GC cycles, the off-heap object 
> gets freed only when the small jvm object is freed. To avoid exploding the 
> native memory use, the jvm defaults to limiting its allocation to the max 
> heap size. Allocating beyond that limit triggers a System.gc(), a retry, and 
> potentially an exception.
> * System.gc is a no-op under disableexplicitgc
> * This means ByteBuffers are likely to throw an exception when too many 
> objects are being allocated
> * The netty version shipped in Cassandra is broken when using SSL (see 
> CASSANDRA-13114) and causes significantly too many bytebuffers to be 
> allocated.
> This gets more complicated though.
> When /some/ clients use SSL, and others don't, the clients not using SSL can 
> still be affected by this bug, as bytebuffer starvation caused by ssl will 
> leak to other users.
> ByteBuffers are used very early on in the native protocol as well. Before 
> even being able to decode the network protocol, this error can be thrown :
> {noformat}
> io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError: Direct 
> buffer memory
> {noformat}
> Note that this comes back with stream_id 0, so clients end up waiting for the 
> client timeout before the query is considered failed and retried.
> A few frames later on the same connection, this appears:
> {noformat}
> Provided frame does not appear to be Snappy compressed
> {noformat}
> And after that everything errors out with:
> {noformat}
> Invalid or unsupported protocol version (54); the lowest supported version is 
> 3 and the greatest is 4
> {noformat}
> So this bug ultimately affects the binary protocol and the connection becomes 
> useless if not downright dangerous.
> I think there are several things that need to be done here.
> * CASSANDRA-13114 should be fixed (easy, and probably needs to land in 3.0.11 
> anyway)
> * Connections should be closed after a DecoderException
> * DisableExplicitGC should be removed from the default JVM arguments
> Any of these three would limit the impact to clients.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to