[ https://issues.apache.org/jira/browse/CASSANDRA-15013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16822372#comment-16822372 ]
Sumanth Pasupuleti commented on CASSANDRA-15013: ------------------------------------------------ Updated patch: [https://github.com/apache/cassandra/pull/313] Thanks [~benedict]. I learnt from your suggestion, {{Ref}} class is useful for getting around the race conditions I was initially worried about, to evict endpoint from the map. Attached patch evicts endpoint along the lines of your proposal, except that, I used a new class {{EndpointPayloadTracker}}, in place of suggested class ({{Dispatcher}}). Having Dispatcher mapped against endpoint makes it as 1:1 Dispatcher per endpoint, whereas currently it is one Dispatcher per Channel, and I rely on that association to store channel level inflight payload, which is then useful to turn off backpressure on a channel (one of the conditions I check to {{setAutoRead}}(true) is when channel level inflight payload comes down to zero). A few other changes I have made as part of this updated patch * Removed channel level threshold with the worry of too many config knobs (channel level, endpoint level, global level). So each time endpoint/global thresholds are exceeded, a channel is put backpressure on, or an overloadedexception is thrown. * In addition to memory based limit, added another tracker and limit check based on number of requests in flight - this is to keep a check on a situation where there are too many in-coming requests with small enough payload that get around memory limit checks, but result in blocking event loop threads. > Message Flusher queue can grow unbounded, potentially running JVM out of > memory > ------------------------------------------------------------------------------- > > Key: CASSANDRA-15013 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15013 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client > Reporter: Sumanth Pasupuleti > Assignee: Sumanth Pasupuleti > Priority: Normal > Labels: pull-request-available > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: BlockedEpollEventLoopFromHeapDump.png, > BlockedEpollEventLoopFromThreadDump.png, RequestExecutorQueueFull.png, heap > dump showing each ImmediateFlusher taking upto 600MB.png > > > This is a follow-up ticket out of CASSANDRA-14855, to make the Flusher queue > bounded, since, in the current state, items get added to the queue without > any checks on queue size, nor with any checks on netty outbound buffer to > check the isWritable state. > We are seeing this issue hit our production 3.0 clusters quite often. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org