[ https://issues.apache.org/jira/browse/CASSANDRA-12689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15513841#comment-15513841 ]
Benjamin Roth commented on CASSANDRA-12689: ------------------------------------------- As a graph this may look like this: https://cl.ly/0N3l0D1v1P1H You can see the mutations increase linearly. The drop was always after having restarted C*. This is just an example, this scenarios happened much more often. This is the load graph of the same time window: https://cl.ly/2m1S2K081o3n > All MutationStage threads blocked, kills server > ----------------------------------------------- > > Key: CASSANDRA-12689 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12689 > Project: Cassandra > Issue Type: Bug > Components: Local Write-Read Paths > Reporter: Benjamin Roth > Priority: Critical > > Under heavy load (e.g. due to repair during normal operations), a lot of > NullPointerExceptions occur in MutationStage. Unfortunately, the log is not > very chatty, trace is missing: > 2016-09-22T06:29:47+00:00 cas6 [MutationStage-1] > org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService Uncaught > exception on thread Thread[MutationStage-1,5,main]: {} > 2016-09-22T06:29:47+00:00 cas6 #011java.lang.NullPointerException: null > Then, after some time, in most cases ALL threads in MutationStage pools are > completely blocked. This leads to piling up pending tasks until server runs > OOM and is completely unresponsive due to GC. Threads will NEVER unblock > until server restart. Even if load goes completely down, all hints are > paused, and no compaction or repair is running. Only restart helps. > I can understand that pending tasks in MutationStage may pile up under heavy > load, but tasks should be processed and dequeud after load goes down. This is > definitively not the case. This looks more like a an unhandled exception > leading to a stuck lock. > Stack trace from jconsole, all Threads in MutationStage show same trace. > Name: MutationStage-48 > State: WAITING on java.util.concurrent.CompletableFuture$Signaller@fcc8266 > Total blocked: 137 Total waited: 138.513 > Stack trace: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137) > org.apache.cassandra.db.Mutation.apply(Mutation.java:227) > org.apache.cassandra.db.Mutation.apply(Mutation.java:241) > org.apache.cassandra.hints.Hint.apply(Hint.java:96) > org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:91) > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162) > org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134) > org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) > java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)