[jira] [Comment Edited] (CASSANDRA-15013) Message Flusher queue can grow unbounded, potentially running JVM out of memory

Sumanth Pasupuleti (JIRA) Sun, 19 May 2019 00:20:44 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16843339#comment-16843339
 ]


Sumanth Pasupuleti edited comment on CASSANDRA-15013 at 5/19/19 7:19 AM:
-------------------------------------------------------------------------

Thanks for the additional feedback. I've made following further changes to the 
[patch|https://github.com/sumanth-pasupuleti/cassandra/commit/98126f5d887228f5e88eca66f007873b52a0aacf]:
* Removed {{requestsProcessed}} metric.
* Added {{BackpressureDeployed}} metric to indicate how many times server 
attempted to apply backpressure.
* {{EndpointPayloadTracker}} manages {{requestPayloadInFlightPerEndpoint}}, and 
an external consumer of {{EndpointPayloadTracker}} calls a static {{get}} that 
internally does {{tryRef}}
* {{EndpointAndGlobal.release}} returns {{ABOVE_LIMIT}} or {{BELOW_LIMIT}} 
Outcome.

[Passing 
Tests|https://circleci.com/workflow-run/561b655c-c3d8-4ea6-9997-ed24754ad133]

{{EndpointPayloadTracker}} is the only accessor of 
{{globalRequestPayloadInFlight}}, however, it did not seem to make semantic 
sense of moving {{globalRequestPayloadInFlight}} as a member of 
{{EndpointPayloadTracker}}.

Regarding starting/stopping all channels for an endpoint at once, I agree it 
would make the system better respect the limits than otherwise letting each 
channel cross once (incase of backpressure mode). However, since each channel's 
connection properties maybe different (THROW_ON_OVERLOAD vs BACKPRESSURE), we 
will then have to check the property and make a decision accordingly. I am 
inclining towards punting this for now. Let me know what you think.

Regarding per-client limit, I see it as a good logical extension to this patch, 
and I would like to tackle it as a separate ticket. Also, as an additional 
ticket, I would like to explore extending this concurrency limitting patch to 
go beyond the current parameter of incoming payload which works well for 
writes, but not necessarily for reads, maybe considering the response payload 
and/or no. of concurrent tasks in flight.


was (Author: sumanth.pasupuleti):
Thanks for the additional feedback. I've made following further changes to the 
[patch|https://github.com/sumanth-pasupuleti/cassandra/commit/98126f5d887228f5e88eca66f007873b52a0aacf]:
* Removed {{requestsProcessed}} metric.
* Added BackpressureDeployed metric to indicate how many times server attempted 
to apply backpressure.
* EndpointPayloadTracker manages requestPayloadInFlightPerEndpoint, and an 
external consumer of EndpointPayloadTracker calls a static get that internally 
does tryRef
* EndpointAndGlobal.release returns ABOVE_LIMIT or BELOW_LIMIT

[Passing 
Tests|https://circleci.com/workflow-run/561b655c-c3d8-4ea6-9997-ed24754ad133]

EndpointPayloadTracker is the only accessor of globalRequestPayloadInFlight, 
however, it did not seem to make semantic sense of moving 
globalRequestPayloadInFlight as a member of EndpointPayloadTracker.

Regarding starting/stopping all channels for an endpoint at once, I agree it 
would make the system better respect the limits than otherwise letting each 
channel cross once (incase of backpressure mode). However, since each channel's 
connection properties maybe different (THROW_ON_OVERLOAD vs BACKPRESSURE), we 
will then have to check the property and make a decision accordingly. I am 
inclining towards punting this for now. Let me know what you think.

Regarding per-client limit, I see it as a good logical extension to this patch, 
and I would like to tackle it as a separate ticket. Also, as an additional 
ticket, I would like to explore extending this concurrency limitting patch to 
go beyond the current parameter of incoming payload which works well for 
writes, but not necessarily for reads, maybe considering the response payload 
and/or no. of concurrent tasks in flight.

> Message Flusher queue can grow unbounded, potentially running JVM out of 
> memory
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15013
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15013
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Messaging/Client
>            Reporter: Sumanth Pasupuleti
>            Assignee: Sumanth Pasupuleti
>            Priority: Normal
>              Labels: pull-request-available
>             Fix For: 4.0, 3.0.x, 3.11.x
>
>         Attachments: BlockedEpollEventLoopFromHeapDump.png, 
> BlockedEpollEventLoopFromThreadDump.png, RequestExecutorQueueFull.png, heap 
> dump showing each ImmediateFlusher taking upto 600MB.png
>
>
> This is a follow-up ticket out of CASSANDRA-14855, to make the Flusher queue 
> bounded, since, in the current state, items get added to the queue without 
> any checks on queue size, nor with any checks on netty outbound buffer to 
> check the isWritable state.
> We are seeing this issue hit our production 3.0 clusters quite often.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15013) Message Flusher queue can grow unbounded, potentially running JVM out of memory

Reply via email to