[ 
https://issues.apache.org/jira/browse/CASSANDRA-15013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16822372#comment-16822372
 ] 

Sumanth Pasupuleti edited comment on CASSANDRA-15013 at 4/20/19 6:05 AM:
-------------------------------------------------------------------------

Updated patch: [https://github.com/apache/cassandra/pull/313]

Passing UTs and DTests: 
https://circleci.com/workflow-run/31dabaa6-eab8-4f00-a711-f1b210bf7578

Thanks [~benedict]. I learnt from your suggestion, {{Ref}} class is useful for 
getting around the race conditions I was initially worried about, to evict 
endpoint from the map.
Attached patch evicts endpoint along the lines of your proposal, except that, I 
used a new class {{EndpointPayloadTracker}}, in place of suggested class 
({{Dispatcher}}). Having Dispatcher mapped against endpoint makes it as 1:1 
Dispatcher per endpoint, whereas currently it is one Dispatcher per Channel, 
and I rely on that association to store channel level inflight payload, which 
is then useful to turn off backpressure on a channel (one of the conditions I 
check to {{setAutoRead}}(true) is when channel level inflight payload comes 
down to zero).

A few other changes I have made as part of this updated patch
 * Removed channel level threshold with the worry of too many config knobs 
(channel level, endpoint level, global level). So each time endpoint/global 
thresholds are exceeded, a channel is put backpressure on, or an 
overloadedexception is thrown.
 * In addition to memory based limit, added another tracker and limit check 
based on number of requests in flight - this is to keep a check on a situation 
where there are too many in-coming requests with small enough payload that get 
around memory limit checks, but result in blocking event loop threads.


was (Author: sumanth.pasupuleti):
Updated patch: [https://github.com/apache/cassandra/pull/313]

Thanks [~benedict]. I learnt from your suggestion, {{Ref}} class is useful for 
getting around the race conditions I was initially worried about, to evict 
endpoint from the map.
Attached patch evicts endpoint along the lines of your proposal, except that, I 
used a new class {{EndpointPayloadTracker}}, in place of suggested class 
({{Dispatcher}}). Having Dispatcher mapped against endpoint makes it as 1:1 
Dispatcher per endpoint, whereas currently it is one Dispatcher per Channel, 
and I rely on that association to store channel level inflight payload, which 
is then useful to turn off backpressure on a channel (one of the conditions I 
check to {{setAutoRead}}(true) is when channel level inflight payload comes 
down to zero).

A few other changes I have made as part of this updated patch
 * Removed channel level threshold with the worry of too many config knobs 
(channel level, endpoint level, global level). So each time endpoint/global 
thresholds are exceeded, a channel is put backpressure on, or an 
overloadedexception is thrown.
 * In addition to memory based limit, added another tracker and limit check 
based on number of requests in flight - this is to keep a check on a situation 
where there are too many in-coming requests with small enough payload that get 
around memory limit checks, but result in blocking event loop threads.

> Message Flusher queue can grow unbounded, potentially running JVM out of 
> memory
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15013
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15013
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Messaging/Client
>            Reporter: Sumanth Pasupuleti
>            Assignee: Sumanth Pasupuleti
>            Priority: Normal
>              Labels: pull-request-available
>             Fix For: 4.0, 3.0.x, 3.11.x
>
>         Attachments: BlockedEpollEventLoopFromHeapDump.png, 
> BlockedEpollEventLoopFromThreadDump.png, RequestExecutorQueueFull.png, heap 
> dump showing each ImmediateFlusher taking upto 600MB.png
>
>
> This is a follow-up ticket out of CASSANDRA-14855, to make the Flusher queue 
> bounded, since, in the current state, items get added to the queue without 
> any checks on queue size, nor with any checks on netty outbound buffer to 
> check the isWritable state.
> We are seeing this issue hit our production 3.0 clusters quite often.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to