[jira] [Comment Edited] (CASSANDRA-14855) Message Flusher scheduling fell off the event loop, resulting in out of memory

Nate McCall (JIRA) Fri, 09 Nov 2018 08:47:34 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-14855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16681689#comment-16681689
 ]


Nate McCall edited comment on CASSANDRA-14855 at 11/9/18 4:46 PM:
------------------------------------------------------------------

[~sumanth.pasupuleti] Thanks for the detailed analysis and patch. Was this the 
first time your team has seen this or has it potentially manifested in other 
clusters previously?


was (Author: zznate):
[~sumanth.pasupuleti] Thanks for the detailed analysis. Was this the first time 
your team has seen this or has it potentially manifested in other clusters 
previously?

> Message Flusher scheduling fell off the event loop, resulting in out of memory
> ------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-14855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14855
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sumanth Pasupuleti
>            Priority: Major
>             Fix For: 3.0.17
>
>         Attachments: blocked_thread_pool.png, cpu.png, 
> eventloop_scheduledtasks.png, flusher running state.png, heap.png, 
> heap_dump.png, read_latency.png
>
>
> We recently had a production issue where about 10 nodes in a 96 node cluster 
> ran out of heap. 
> From heap dump analysis, I believe there is enough evidence to indicate 
> `queued` data member of the Flusher got too big, resulting in out of memory.
> Below are specifics on what we found from the heap dump (relevant screenshots 
> attached):
> * non-empty "queued" data member of Flusher having retaining heap of 0.5GB, 
> and multiple such instances.
> * "running" data member of Flusher having "true" value
> * Size of scheduledTasks on the eventloop was 0.
> We suspect something (maybe an exception) caused the Flusher running state to 
> continue to be true, but was not able to schedule itself with the event loop.
> Could not find any ERROR in the system.log, except for following INFO logs 
> around the incident time.
> {code:java}
> INFO [epollEventLoopGroup-2-4] 2018-xx-xx xx:xx:xx,592 Message.java:619 - 
> Unexpected exception during request; channel = [id: 0x8d288811, 
> L:/xxx.xx.xxx.xxx:7104 - R:/xxx.xx.x.xx:18886]
> io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: 
> Connection timed out
>  at io.netty.channel.unix.Errors.newIOException(Errors.java:117) 
> ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.channel.unix.Errors.ioResult(Errors.java:138) 
> ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:175) 
> ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at 
> io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:238)
>  ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at 
> io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:926)
>  ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at 
> io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:397) 
> [netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:302) 
> [netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>  [netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>  [netty-all-4.0.44.Final.jar:4.0.44.Final]
> {code}
> I would like to pursue the following proposals to fix this issue:
> # ImmediateFlusher: Backport trunk's ImmediateFlusher ( 
> [CASSANDRA-13651|https://issues.apache.org/jira/browse/CASSANDRA-13651] 
> https://github.com/apache/cassandra/commit/96ef514917e5a4829dbe864104dbc08a7d0e0cec)
>   to 3.0.x and maybe to other versions as well, since ImmediateFlusher seems 
> to be more robust than the existing Flusher as it does not depend on any 
> running state/scheduling.
> # Make "queued" data member of the Flusher bounded to avoid any potential of 
> causing out of memory due to otherwise unbounded nature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-14855) Message Flusher scheduling fell off the event loop, resulting in out of memory

Reply via email to