[ https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Swen Fuhrmann reassigned CASSANDRA-15902: ----------------------------------------- Assignee: Swen Fuhrmann > OOM because repair session thread not closed when terminating repair > -------------------------------------------------------------------- > > Key: CASSANDRA-15902 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15902 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair > Reporter: Swen Fuhrmann > Assignee: Swen Fuhrmann > Priority: Normal > > In our cluster, after a while some nodes running slowly out of memory. On > that nodes we observed that Cassandra Reaper cancel repairs with a JMX call > to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because reaching > timeout of 30 min. > In the memory heap dump we see >100 instances of > {{io.netty.util.concurrent.FastThreadLocalThread}}. In the thread dump we see > lot of repair threads: > {noformat} > grep "Repair#" threaddump.txt | wc -l > 50 {noformat} > > The repair jobs are waiting for the validation to finish: > {noformat} > "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x0000000012fc5000 > nid=0x542a waiting on condition [0x00007f81ee414000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000007939bcfc8> (a > com.google.common.util.concurrent.AbstractFuture$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at > com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > at > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137) > at > com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509) > at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) > at > org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown > Source) > at java.lang.Thread.run(Thread.java:748) {noformat} > > Thats the line where the threads stuck: > {noformat} > // Wait for validation to complete > Futures.getUnchecked(validations); {noformat} > > The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops > the thread pool executor. It looks like that futures which are in progress > will therefor never be completed and the repair thread waits forever and > won't be finished. > > Environment: > Cassandra version: 3.11.4 > Cassandra Reaper: 1.4.0 > Java Runtime: > {noformat} > openjdk version "1.8.0_212" > OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03) > OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org