[jira] [Comment Edited] (CASSANDRA-15013) Message Flusher queue can grow unbounded, potentially running JVM out of memory

Benedict (JIRA) Wed, 15 May 2019 04:29:10 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16840301#comment-16840301
 ]


Benedict edited comment on CASSANDRA-15013 at 5/15/19 11:28 AM:
----------------------------------------------------------------

I've pushed some minor suggestions 
[here|https://github.com/belliottsmith/cassandra/tree/15013-suggestions] around 
naming:

# Tried to make the native_transport config parameters have more consistent 
naming with prior parameters - feel free to modify them further, if you think 
you can improve them still
# {{forceAllocate}} -> {{allocate}}, which is usually the alternative to 
{{tryAllocate}}
# Shortened THROW_ON_OVERLOAD parameter

There are three remaining bugs, and I've paused review until they can be 
addressed:

# {{this::releaseItem}} is unsafe to provide to the {{Flusher}} constructor, 
since these are unique to a channel, and the {{Flusher}} is per-eventLoop.  If 
we choose to hash all connections on an endpoint to a single eventLoop this 
would be easy to accommodate, or otherwise {{FlushItem}} needs to be the 
implementor of {{release()}}
# I don't think we can use {{Ref}} for management of the 
{{EndpointPayloadTracker}}.  The {{Tidy}} implementation requires a reference 
to the object itself, and anyway logically deleting itself after release 
defeats the point of Ref (which is leak detection).  It's impossible for it to 
detect a leak and cleanup, if the strong reference is cleaned up by this 
process (since there will always be a strong reference until it invokes, and it 
requires there to be no strong references, it will never invoke).  Probably we 
should use a simple AtomicInteger to manage reference counts.  I think it would 
be cleanest to encapsulate the map management inside a static method in 
{{EndpointPayloadTracker}} as well.
# I think we currently have a race condition around the release of a channel 
(and its {{EndpointPayloadTracker}}) and the attempt to release capacity from 
the {{EndpointPayloadTracker}} we have requests in flight for.  Channels can be 
invalidated before we complete requests issued by them, so we must be sure to 
release from the tracker we allocated from, so that we do not wrap into 
negative on release.



was (Author: benedict):
I've pushed some minor suggestions 
[here|https://github.com/belliottsmith/cassandra/tree/15013-suggestions] around 
naming:

# Tried to make the native_transport config parameters have more consistent 
naming with prior parameters - feel free to modify them further, if you think 
you can improve them still
# {{forceAllocate}} -> {{allocate}}, which is usually the alternative to 
{{tryAllocate}}
# Shortened THROW_ON_OVERLOAD parameter

There are three remaining bugs, and I've paused review until they can be 
addressed:

# {{this::releaseItem}} is unsafe to provide to the {{Flusher}} constructor, 
since these are unique to an address, and the {{Flusher}} is per-eventLoop.  If 
we choose to hash all connections on an endpoint to a single eventLoop this 
would be easy to accommodate, or otherwise {{FlushItem}} needs to be the 
implementor of {{release()}}
# I don't think we can use {{Ref}} for management of the 
{{EndpointPayloadTracker}}.  The {{Tidy}} implementation requires a reference 
to the object itself, and anyway logically deleting itself after release 
defeats the point of Ref (which is leak detection).  It's impossible for it to 
detect a leak and cleanup, if the strong reference is cleaned up by this 
process (since there will always be a strong reference until it invokes, and it 
requires there to be no strong references, it will never invoke).  Probably we 
should use a simple AtomicInteger to manage reference counts.  I think it would 
be cleanest to encapsulate the map management inside a static method in 
{{EndpointPayloadTracker}} as well.
# I think we currently have a race condition around the release of a channel 
(and its {{EndpointPayloadTracker}}) and the attempt to release capacity from 
the {{EndpointPayloadTracker}} we have requests in flight for.  Channels can be 
invalidated before we complete requests issued by them, so we must be sure to 
release from the tracker we allocated from, so that we do not wrap into 
negative on release.


> Message Flusher queue can grow unbounded, potentially running JVM out of 
> memory
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15013
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15013
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Messaging/Client
>            Reporter: Sumanth Pasupuleti
>            Assignee: Sumanth Pasupuleti
>            Priority: Normal
>              Labels: pull-request-available
>             Fix For: 4.0, 3.0.x, 3.11.x
>
>         Attachments: BlockedEpollEventLoopFromHeapDump.png, 
> BlockedEpollEventLoopFromThreadDump.png, RequestExecutorQueueFull.png, heap 
> dump showing each ImmediateFlusher taking upto 600MB.png
>
>
> This is a follow-up ticket out of CASSANDRA-14855, to make the Flusher queue 
> bounded, since, in the current state, items get added to the queue without 
> any checks on queue size, nor with any checks on netty outbound buffer to 
> check the isWritable state.
> We are seeing this issue hit our production 3.0 clusters quite often.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15013) Message Flusher queue can grow unbounded, potentially running JVM out of memory

Reply via email to