[ https://issues.apache.org/jira/browse/CASSANDRA-12689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15537972#comment-15537972 ]
Benjamin Roth commented on CASSANDRA-12689: ------------------------------------------- Hi Tyler, Thanks for the review. Of course this solution is error prone, but as I stated earlier it's IMHO the only one that fixes it now with no risk. I had a conversation with @zznate these days and he asked me to remove that "ugly test switches" like TEST_FORCE_DEFERABLE_MUTATIONS. I personally don't care - I am just a newbie to CS. Either I can leave the test switches in, apply your feedback and commit that dtest or I throw them away but then this situation is not testable any more with dtest. And the next problem with dtest is: Only the positive test works nice. The negative test ends in write timeouts and shows tons of errors. I implemented it for a proof but I would not recommend to commit it. So the TEST_FORCE_DEFERABLE_MUTATIONS can also be thrown away, it's only required for the negative test. If I commit that dtest, should I create a new test file or append that test to an existing one? So, @thobbs + @zznate please tell me how to move on. I already thought that PaxosState.commit maybe will allso need to be not deferrable but was not sure if that also happens within the mutation stage. Unfortunately I am missing the bigger picture in the CS code at the moment. > All MutationStage threads blocked, kills server > ----------------------------------------------- > > Key: CASSANDRA-12689 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12689 > Project: Cassandra > Issue Type: Bug > Components: Local Write-Read Paths > Reporter: Benjamin Roth > Assignee: Benjamin Roth > Priority: Critical > Fix For: 3.0.x, 3.x > > > Under heavy load (e.g. due to repair during normal operations), a lot of > NullPointerExceptions occur in MutationStage. Unfortunately, the log is not > very chatty, trace is missing: > {noformat} > 2016-09-22T06:29:47+00:00 cas6 [MutationStage-1] > org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService Uncaught > exception on thread Thread[MutationStage-1,5,main]: {} > 2016-09-22T06:29:47+00:00 cas6 #011java.lang.NullPointerException: null > {noformat} > Then, after some time, in most cases ALL threads in MutationStage pools are > completely blocked. This leads to piling up pending tasks until server runs > OOM and is completely unresponsive due to GC. Threads will NEVER unblock > until server restart. Even if load goes completely down, all hints are > paused, and no compaction or repair is running. Only restart helps. > I can understand that pending tasks in MutationStage may pile up under heavy > load, but tasks should be processed and dequeud after load goes down. This is > definitively not the case. This looks more like a an unhandled exception > leading to a stuck lock. > Stack trace from jconsole, all Threads in MutationStage show same trace. > {noformat} > Name: MutationStage-48 > State: WAITING on java.util.concurrent.CompletableFuture$Signaller@fcc8266 > Total blocked: 137 Total waited: 138.513 > {noformat} > Stack trace: > {noformat} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137) > org.apache.cassandra.db.Mutation.apply(Mutation.java:227) > org.apache.cassandra.db.Mutation.apply(Mutation.java:241) > org.apache.cassandra.hints.Hint.apply(Hint.java:96) > org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:91) > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162) > org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134) > org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) > java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)