[ https://issues.apache.org/jira/browse/CASSANDRA-19564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17840197#comment-17840197 ]
Jon Haddad edited comment on CASSANDRA-19564 at 4/23/24 6:53 PM: ----------------------------------------------------------------- I've rolled out a small patch (https://github.com/rustyrazorblade/cassandra/tree/jhaddad/4.1.4-extra-logging) that adds a monitoring thread to each flush and have found that in every case I see stacktraces related to the filesystem, which is ZFS: {noformat} "MemtablePostFlush:1" daemon prio=5 Id=429 RUNNABLE at java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.unlink0(Native Method) at java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.unlink(UnixNativeDispatcher.java:156) at java.base@11.0.22/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:236) at java.base@11.0.22/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105) at java.base@11.0.22/java.nio.file.Files.delete(Files.java:1142) at app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:252) at app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:299) at app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:306) ... Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@1fc5251a {noformat} and {noformat} "MemtablePostFlush:1" daemon prio=5 Id=429 RUNNABLE at java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.lstat0(Native Method) at java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.lstat(UnixNativeDispatcher.java:332) at java.base@11.0.22/sun.nio.fs.UnixFileAttributes.get(UnixFileAttributes.java:72) at java.base@11.0.22/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:232) at java.base@11.0.22/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105) at java.base@11.0.22/java.nio.file.Files.delete(Files.java:1142) at app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:252) at app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:299) ... Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@1fc5251a {noformat} I have several dozen of these and am finding in every case we're either at {{java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.unlink0}} or {{java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.lstat0}} I'm moving this cluster off ZFS and onto XFS, if I find the issue goes away I'll close this out. I don't think there's anything we can do about unreliable filesystems other than improving our error reporting around it. was (Author: rustyrazorblade): I've rolled out a small patch (https://github.com/rustyrazorblade/cassandra/tree/jhaddad/4.1.4-extra-logging) that adds a monitoring thread to each flush and have found that in every case I see stacktraces related to the filesystem (ZFS), such as this: {noformat} "MemtablePostFlush:1" daemon prio=5 Id=429 RUNNABLE at java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.unlink0(Native Method) at java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.unlink(UnixNativeDispatcher.java:156) at java.base@11.0.22/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:236) at java.base@11.0.22/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105) at java.base@11.0.22/java.nio.file.Files.delete(Files.java:1142) at app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:252) at app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:299) at app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:306) ... Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@1fc5251a {noformat} and {noformat} "MemtablePostFlush:1" daemon prio=5 Id=429 RUNNABLE at java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.lstat0(Native Method) at java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.lstat(UnixNativeDispatcher.java:332) at java.base@11.0.22/sun.nio.fs.UnixFileAttributes.get(UnixFileAttributes.java:72) at java.base@11.0.22/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:232) at java.base@11.0.22/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105) at java.base@11.0.22/java.nio.file.Files.delete(Files.java:1142) at app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:252) at app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:299) ... Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@1fc5251a {noformat} I have several dozen of these and am finding in every case we're either at {{java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.unlink0}} or {{java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.lstat0}} I'm moving this cluster off ZFS and onto XFS, if I find the issue goes away I'll close this out. I don't think there's anything we can do about unreliable filesystems other than improving our error reporting around it. > MemtablePostFlush deadlock leads to stuck nodes and crashes > ----------------------------------------------------------- > > Key: CASSANDRA-19564 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19564 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction, Local/Memtable > Reporter: Jon Haddad > Priority: Urgent > Fix For: 4.1.x > > Attachments: image-2024-04-16-11-55-54-750.png, > image-2024-04-16-12-29-15-386.png, image-2024-04-16-13-43-11-064.png, > image-2024-04-16-13-53-24-455.png, image-2024-04-17-18-46-29-474.png, > image-2024-04-17-19-13-06-769.png, image-2024-04-17-19-14-34-344.png, > screenshot-1.png > > > I've run into an issue on a 4.1.4 cluster where an entire node has locked up > due to what I believe is a deadlock in memtable flushing. Here's what I know > so far. I've stitched together what happened based on conversations, logs, > and some flame graphs. > *Log reports memtable flushing* > The last successful flush happens at 12:19. > {noformat} > INFO [NativePoolCleaner] 2024-04-16 12:19:53,634 > AbstractAllocatorMemtable.java:286 - Flushing largest CFS(Keyspace='ks', > ColumnFamily='version') to free up room. Used total: 0.24/0.33, live: > 0.16/0.20, flushing: 0.09/0.13, this: 0.13/0.15 > INFO [NativePoolCleaner] 2024-04-16 12:19:53,634 ColumnFamilyStore.java:1012 > - Enqueuing flush of ks.version, Reason: MEMTABLE_LIMIT, Usage: 660.521MiB > (13%) on-heap, 790.606MiB (15%) off-heap > {noformat} > *MemtablePostFlush appears to be blocked* > At this point, MemtablePostFlush completed tasks stops incrementing, active > stays at 1 and pending starts to rise. > {noformat} > MemtablePostFlush 1 1 3446 0 0 > {noformat} > > The flame graph reveals that PostFlush.call is stuck. I don't have the line > number, but I know we're stuck in > {{org.apache.cassandra.db.ColumnFamilyStore.PostFlush#call}} given the visual > below: > *!image-2024-04-16-13-43-11-064.png!* > *Memtable flushing is now blocked.* > All MemtableFlushWriter threads are Parked waiting on > {{{}OpOrder.Barrier.await{}}}. A wall clock profile of 30s reveals all time > is spent here. Presumably we're waiting on the single threaded Post Flush. > !image-2024-04-16-12-29-15-386.png! > *Memtable allocations start to block* > Eventually it looks like the NativeAllocator stops successfully allocating > memory. I assume it's waiting on memory to be freed, but since memtable > flushes are blocked, we wait indefinitely. > Looking at a wall clock flame graph, all writer threads have reached the > allocation failure path of {{MemtableAllocator.allocate()}}. I believe we're > waiting on {{signal.awaitThrowUncheckedOnInterrupt()}} > {noformat} > MutationStage 48 828425 980253369 0 0{noformat} > !image-2024-04-16-11-55-54-750.png! > > *Compaction Stops* > Since we write to the compaction history table, and that requires memtables, > compactions are now blocked as well. > > !image-2024-04-16-13-53-24-455.png! > > The node is now doing basically nothing and must be restarted. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org