[jira] [Commented] (CASSANDRA-15013) Message Flusher queue can grow unbounded, potentially running JVM out of memory

Benedict (JIRA) Wed, 01 May 2019 08:36:16 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831069#comment-16831069
 ]


Benedict commented on CASSANDRA-15013:
--------------------------------------

Hi Sumanth,

Thanks for the updated patch. This is not a full review, just some initial 
feedback.

First, I think we could do with revisiting the naming of a couple of things. 
The config parameters should probably be prefixed with native_transport for 
consistency, and the connection parameter should be shorter (since we encode 
every byte) and perhaps convey the intent - maybe THROW_ON_OVERLOAD?

Secondly, the patch looks to have some data races when updating shared state. 
It might be sensible to reuse what we have already produced for this in 
CASSANDRA-15066 - if you look in [this 
branch|https://github.com/belliottsmith/cassandra/tree/messaging-improvements] 
you will find a class called {{ResourceLimits}} that we use to impose 
per-endpoint and global limits, which is exactly what you’re doing here. 
There’s not much sense in duplicating the work, so perhaps you can copy the 
class and use it in this patch; it shouldn’t change before we commit.

Similarly, the accesses of {{requestPayloadInFlightPerEndpoint}} need to be 
made atomic. In {{initChannel}} this means grabbing the result of 
{{computeIfAbsent}} and to {{tryRef}} this - if we fail, we need to immediately 
remove the object you fetched from the map and try again (not just loop, else 
we may block on another thread removing it). In {{tidy}} we need to remove the 
(key, value) pair to ensure we do not remove a newer object that has replaced 
us. Similarly for invocations of {{containsKey}}, we need to instead invoke 
{{get}} and check the result is not null.

In general, in concurrent operation we need to access the shared state once up 
front, and only refresh it when necessary, as it can change underneath us at 
any time.

Finally, we should remove the task queue length limit from {{requestExecutor}} 
else this patch won’t actually stop us blocking the event loop.

Look forward to giving the final patch a full review in the near future. 

> Message Flusher queue can grow unbounded, potentially running JVM out of 
> memory
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15013
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15013
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Messaging/Client
>            Reporter: Sumanth Pasupuleti
>            Assignee: Sumanth Pasupuleti
>            Priority: Normal
>              Labels: pull-request-available
>             Fix For: 4.0, 3.0.x, 3.11.x
>
>         Attachments: BlockedEpollEventLoopFromHeapDump.png, 
> BlockedEpollEventLoopFromThreadDump.png, RequestExecutorQueueFull.png, heap 
> dump showing each ImmediateFlusher taking upto 600MB.png
>
>
> This is a follow-up ticket out of CASSANDRA-14855, to make the Flusher queue 
> bounded, since, in the current state, items get added to the queue without 
> any checks on queue size, nor with any checks on netty outbound buffer to 
> check the isWritable state.
> We are seeing this issue hit our production 3.0 clusters quite often.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15013) Message Flusher queue can grow unbounded, potentially running JVM out of memory

Reply via email to