[jira] [Created] (CASSANDRA-17172) incremental repairs get stuck often

James Brown (Jira) Fri, 26 Nov 2021 20:38:08 -0800

James Brown created CASSANDRA-17172:
---------------------------------------


             Summary: incremental repairs get stuck often
                 Key: CASSANDRA-17172
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17172
             Project: Cassandra
          Issue Type: Bug
          Components: Consistency/Repair
            Reporter: James Brown


We're on 4.0.1 and switched to incremental repairs shortly after upgrading to 
4.0.x. They work fine about 95% of the time, but once in a while a session will 
get stuck and will have to be cancelled (with `nodetool repair_admin cancel -s 
<uuid>`). Typically the session will be in REPAIRING but nothing will actually 
be happening.

Output of nodetool repair_admin:

{{$ nodetool repair_admin
id                                   | state     | last activity | coordinator  
                        | participants                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
  | participants_wp
3a059b10-4ef6-11ec-925f-8f7bcf0ba035 | REPAIRING | 6771 (s)      | 
/[fd00:ea51:d057:200:1:0:0:8e]:25472 | 
fd00:ea51:d057:200:1:0:0:8e,fd00:ea51:d057:200:1:0:0:8f,fd00:ea51:d057:200:1:0:0:92,fd00:ea51:d057:100:1:0:0:571,fd00:ea51:d057:100:1:0:0:570,fd00:ea51:d057:200:1:0:0:93,fd00:ea51:d057:100:1:0:0:573,fd00:ea51:d057:200:1:0:0:90,fd00:ea51:d057:200:1:0:0:91,fd00:ea51:d057:100:1:0:0:572,fd00:ea51:d057:100:1:0:0:575,fd00:ea51:d057:100:1:0:0:574,fd00:ea51:d057:200:1:0:0:94,fd00:ea51:d057:100:1:0:0:577,fd00:ea51:d057:200:1:0:0:95,fd00:ea51:d057:100:1:0:0:576
 | 
[fd00:ea51:d057:200:1:0:0:8e]:25472,[fd00:ea51:d057:200:1:0:0:8f]:25472,[fd00:ea51:d057:200:1:0:0:92]:25472,[fd00:ea51:d057:100:1:0:0:571]:25472,[fd00:ea51:d057:100:1:0:0:570]:25472,[fd00:ea51:d057:200:1:0:0:93]:25472,[fd00:ea51:d057:100:1:0:0:573]:25472,[fd00:ea51:d057:200:1:0:0:90]:25472,[fd00:ea51:d057:200:1:0:0:91]:25472,[fd00:ea51:d057:100:1:0:0:572]:25472,[fd00:ea51:d057:100:1:0:0:575]:25472,[fd00:ea51:d057:100:1:0:0:574]:25472,[fd00:ea51:d057:200:1:0:0:94]:25472,[fd00:ea51:d057:100:1:0:0:577]:25472,[fd00:ea51:d057:200:1:0:0:95]:25472,[fd00:ea51:d057:100:1:0:0:576]:25472}}

Running `jstack` on the coordinator shows two repair threads, both idle:

{{"Repair#167:1" #602177 daemon prio=5 os_prio=0 cpu=9.60ms elapsed=57359.81s 
tid=0x00007fa6d1741800 nid=0x18e6c waiting on condition  [0x00007fc529f9a000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
- parking to wait for  <0x000000045ba93a18> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at 
java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:2123)
at 
java.util.concurrent.LinkedBlockingQueue.poll(java.base@11.0.11/LinkedBlockingQueue.java:458)
at 
java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.11/ThreadPoolExecutor.java:1053)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1114)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)
at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)

"Repair#170:1" #654814 daemon prio=5 os_prio=0 cpu=9.62ms elapsed=7369.98s 
tid=0x00007fa6aec09000 nid=0x1a96f waiting on condition  [0x00007fc535aae000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
- parking to wait for  <0x00000004c45bf7d8> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at 
java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:2123)
at 
java.util.concurrent.LinkedBlockingQueue.poll(java.base@11.0.11/LinkedBlockingQueue.java:458)
at 
java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.11/ThreadPoolExecutor.java:1053)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1114)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)
at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)}}

nodetool netstats says there is nothing happening:

{{$ nodetool netstats | head -n 2
Mode: NORMAL
Not sending any streams.}}

There's nothing interesting in the logs for this repair; the last relevant 
thing was a bunch of "Created 0 sync tasks based on 6 merkle tree responses for 
3a059b10-4ef6-11ec-
925f-8f7bcf0ba035 (took: 0ms)" and then back and forth for the last couple of 
hours with things like

{{2021-11-26T21:33:20Z cassandra10nuq 129529 | INFO  [OptionalTasks:1] 
LocalSessions.java:938 - Attempting to learn the outcome of unfinished local 
incremental repair session 3a059b10-4ef6-11ec-925f-8f7bcf0ba035
2021-11-26T21:33:20Z cassandra10nuq 129529 | INFO  [AntiEntropyStage:1] 
LocalSessions.java:987 - Received StatusResponse for repair session 
3a059b10-4ef6-11ec-925f-8f7bcf0ba035 with state REPAIRING, which is not 
actionable. Doing nothing.}}

Typically, cancelling the session and rerunning with the exact same command 
line will succeed.

I haven't witnessed this behavior in our testing cluster; it seems to only 
happen on biggish keyspaces.

I know there have historically been issues which this, which is why tools like 
Reaper kill any repair that takes more than a short period of time, but I also 
thought they were expected to have been fixed with 4.0.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-17172) incremental repairs get stuck often

Reply via email to