[ 
https://issues.apache.org/jira/browse/CASSANDRA-19564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17840197#comment-17840197
 ] 

Jon Haddad edited comment on CASSANDRA-19564 at 4/23/24 6:53 PM:
-----------------------------------------------------------------

I've rolled out a small patch 
(https://github.com/rustyrazorblade/cassandra/tree/jhaddad/4.1.4-extra-logging) 
that adds a monitoring thread to each flush and have found that in every case I 
see stacktraces related to the filesystem, which is ZFS:

{noformat}
"MemtablePostFlush:1" daemon prio=5 Id=429 RUNNABLE
        at java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.unlink0(Native 
Method)
        at 
java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.unlink(UnixNativeDispatcher.java:156)
        at 
java.base@11.0.22/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:236)
        at 
java.base@11.0.22/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105)
        at java.base@11.0.22/java.nio.file.Files.delete(Files.java:1142)
        at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:252)
        at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:299)
        at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:306)
        ...

        Number of locked synchronizers = 1
        - java.util.concurrent.ThreadPoolExecutor$Worker@1fc5251a
{noformat}

and 

{noformat}
"MemtablePostFlush:1" daemon prio=5 Id=429 RUNNABLE
        at java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.lstat0(Native 
Method)
        at 
java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.lstat(UnixNativeDispatcher.java:332)
        at 
java.base@11.0.22/sun.nio.fs.UnixFileAttributes.get(UnixFileAttributes.java:72)
        at 
java.base@11.0.22/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:232)
        at 
java.base@11.0.22/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105)
        at java.base@11.0.22/java.nio.file.Files.delete(Files.java:1142)
        at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:252)
        at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:299)
        ...

        Number of locked synchronizers = 1
        - java.util.concurrent.ThreadPoolExecutor$Worker@1fc5251a
{noformat}

I have several dozen of these and am finding in every case we're either at 
{{java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.unlink0}} or 
{{java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.lstat0}}

I'm moving this cluster off ZFS and onto XFS, if I find the issue goes away 
I'll close this out.  I don't think there's anything we can do about unreliable 
filesystems other than improving our error reporting around it.



was (Author: rustyrazorblade):
I've rolled out a small patch 
(https://github.com/rustyrazorblade/cassandra/tree/jhaddad/4.1.4-extra-logging) 
that adds a monitoring thread to each flush and have found that in every case I 
see stacktraces related to the filesystem (ZFS), such as this:

{noformat}
"MemtablePostFlush:1" daemon prio=5 Id=429 RUNNABLE
        at java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.unlink0(Native 
Method)
        at 
java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.unlink(UnixNativeDispatcher.java:156)
        at 
java.base@11.0.22/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:236)
        at 
java.base@11.0.22/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105)
        at java.base@11.0.22/java.nio.file.Files.delete(Files.java:1142)
        at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:252)
        at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:299)
        at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:306)
        ...

        Number of locked synchronizers = 1
        - java.util.concurrent.ThreadPoolExecutor$Worker@1fc5251a
{noformat}

and 

{noformat}
"MemtablePostFlush:1" daemon prio=5 Id=429 RUNNABLE
        at java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.lstat0(Native 
Method)
        at 
java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.lstat(UnixNativeDispatcher.java:332)
        at 
java.base@11.0.22/sun.nio.fs.UnixFileAttributes.get(UnixFileAttributes.java:72)
        at 
java.base@11.0.22/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:232)
        at 
java.base@11.0.22/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105)
        at java.base@11.0.22/java.nio.file.Files.delete(Files.java:1142)
        at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:252)
        at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:299)
        ...

        Number of locked synchronizers = 1
        - java.util.concurrent.ThreadPoolExecutor$Worker@1fc5251a
{noformat}

I have several dozen of these and am finding in every case we're either at 
{{java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.unlink0}} or 
{{java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.lstat0}}

I'm moving this cluster off ZFS and onto XFS, if I find the issue goes away 
I'll close this out.  I don't think there's anything we can do about unreliable 
filesystems other than improving our error reporting around it.


> MemtablePostFlush deadlock leads to stuck nodes and crashes
> -----------------------------------------------------------
>
>                 Key: CASSANDRA-19564
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19564
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/Compaction, Local/Memtable
>            Reporter: Jon Haddad
>            Priority: Urgent
>             Fix For: 4.1.x
>
>         Attachments: image-2024-04-16-11-55-54-750.png, 
> image-2024-04-16-12-29-15-386.png, image-2024-04-16-13-43-11-064.png, 
> image-2024-04-16-13-53-24-455.png, image-2024-04-17-18-46-29-474.png, 
> image-2024-04-17-19-13-06-769.png, image-2024-04-17-19-14-34-344.png, 
> screenshot-1.png
>
>
> I've run into an issue on a 4.1.4 cluster where an entire node has locked up 
> due to what I believe is a deadlock in memtable flushing. Here's what I know 
> so far.  I've stitched together what happened based on conversations, logs, 
> and some flame graphs.
> *Log reports memtable flushing*
> The last successful flush happens at 12:19. 
> {noformat}
> INFO  [NativePoolCleaner] 2024-04-16 12:19:53,634 
> AbstractAllocatorMemtable.java:286 - Flushing largest CFS(Keyspace='ks', 
> ColumnFamily='version') to free up room. Used total: 0.24/0.33, live: 
> 0.16/0.20, flushing: 0.09/0.13, this: 0.13/0.15
> INFO  [NativePoolCleaner] 2024-04-16 12:19:53,634 ColumnFamilyStore.java:1012 
> - Enqueuing flush of ks.version, Reason: MEMTABLE_LIMIT, Usage: 660.521MiB 
> (13%) on-heap, 790.606MiB (15%) off-heap
> {noformat}
> *MemtablePostFlush appears to be blocked*
> At this point, MemtablePostFlush completed tasks stops incrementing, active 
> stays at 1 and pending starts to rise.
> {noformat}
> MemtablePostFlush   1    1   3446   0   0
> {noformat}
>  
> The flame graph reveals that PostFlush.call is stuck.  I don't have the line 
> number, but I know we're stuck in 
> {{org.apache.cassandra.db.ColumnFamilyStore.PostFlush#call}} given the visual 
> below:
> *!image-2024-04-16-13-43-11-064.png!*
> *Memtable flushing is now blocked.*
> All MemtableFlushWriter threads are Parked waiting on 
> {{{}OpOrder.Barrier.await{}}}. A wall clock profile of 30s reveals all time 
> is spent here.  Presumably we're waiting on the single threaded Post Flush.
> !image-2024-04-16-12-29-15-386.png!
> *Memtable allocations start to block*
> Eventually it looks like the NativeAllocator stops successfully allocating 
> memory. I assume it's waiting on memory to be freed, but since memtable 
> flushes are blocked, we wait indefinitely.
> Looking at a wall clock flame graph, all writer threads have reached the 
> allocation failure path of {{MemtableAllocator.allocate()}}.  I believe we're 
> waiting on {{signal.awaitThrowUncheckedOnInterrupt()}}
> {noformat}
>  MutationStage    48    828425      980253369      0    0{noformat}
> !image-2024-04-16-11-55-54-750.png!
>  
> *Compaction Stops*
> Since we write to the compaction history table, and that requires memtables, 
> compactions are now blocked as well.
>  
> !image-2024-04-16-13-53-24-455.png!
>  
> The node is now doing basically nothing and must be restarted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to