[ https://issues.apache.org/jira/browse/CASSANDRA-14855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benedict updated CASSANDRA-14855: --------------------------------- Fix Version/s: (was: 3.0.17) 3.0.18 Status: Patch Available (was: Open) > Message Flusher scheduling fell off the event loop, resulting in out of memory > ------------------------------------------------------------------------------ > > Key: CASSANDRA-14855 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14855 > Project: Cassandra > Issue Type: Bug > Components: Core > Reporter: Sumanth Pasupuleti > Assignee: Sumanth Pasupuleti > Priority: Major > Labels: pull-request-available > Fix For: 3.0.18 > > Attachments: blocked_thread_pool.png, cpu.png, > eventloop_scheduledtasks.png, flusher running state.png, heap.png, > heap_dump.png, read_latency.png > > Time Spent: 20m > Remaining Estimate: 0h > > We recently had a production issue where about 10 nodes in a 96 node cluster > ran out of heap. > From heap dump analysis, I believe there is enough evidence to indicate > `queued` data member of the Flusher got too big, resulting in out of memory. > Below are specifics on what we found from the heap dump (relevant screenshots > attached): > * non-empty "queued" data member of Flusher having retaining heap of 0.5GB, > and multiple such instances. > * "running" data member of Flusher having "true" value > * Size of scheduledTasks on the eventloop was 0. > We suspect something (maybe an exception) caused the Flusher running state to > continue to be true, but was not able to schedule itself with the event loop. > Could not find any ERROR in the system.log, except for following INFO logs > around the incident time. > {code:java} > INFO [epollEventLoopGroup-2-4] 2018-xx-xx xx:xx:xx,592 Message.java:619 - > Unexpected exception during request; channel = [id: 0x8d288811, > L:/xxx.xx.xxx.xxx:7104 - R:/xxx.xx.x.xx:18886] > io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: > Connection timed out > at io.netty.channel.unix.Errors.newIOException(Errors.java:117) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.unix.Errors.ioResult(Errors.java:138) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:175) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:238) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:926) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:397) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:302) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > {code} > I would like to pursue the following proposals to fix this issue: > # ImmediateFlusher: Backport trunk's ImmediateFlusher ( > [CASSANDRA-13651|https://issues.apache.org/jira/browse/CASSANDRA-13651] > https://github.com/apache/cassandra/commit/96ef514917e5a4829dbe864104dbc08a7d0e0cec) > to 3.0.x and maybe to other versions as well, since ImmediateFlusher seems > to be more robust than the existing Flusher as it does not depend on any > running state/scheduling. > # Make "queued" data member of the Flusher bounded to avoid any potential of > causing out of memory due to otherwise unbounded nature. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org