I am deploying Rook 1.10.13 with Ceph 17.2.6 on our Kubernetes clusters. We are 
using the Ceph Shared Filesystem a lot and, we have never faced an issue. 

Lately, we have deployed it on Oracle Linux 9 VMs (previous/existing 
deployments use Centos/RHEL 7) and we are facing the next issue: 

We have 30 worker nodes running a StatefulSet with 30 replicas (each one 
running on a worker node). The pod in that StatefulSet runs a container with a 
java process that waits until jobs are submitted. When a job arrives, it 
processes the request and writes the data into a ceph sharedfs. That ceph 
sharedfs is a single PVC that is used by all the pods in the StatefulSet.

The problem is that from time to time, some java processes get stuck forever 
when accessing the fs… e.g. (more than 6 hours in the snipped text below):
```
"th-0-data-writer-site" #503 [505] prio=5 os_prio=0 cpu=451.11ms 
elapsed=22084.19s tid=0x00007f8c3c04db10 nid=505 runnable  [0x00007f8d8fdfc000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.fs.UnixNativeDispatcher.lstat0(java.base@22-ea/Native Method)
        at 
sun.nio.fs.UnixNativeDispatcher.lstat(java.base@22-ea/UnixNativeDispatcher.java:351)
        at 
sun.nio.fs.UnixFileAttributes.get(java.base@22-ea/UnixFileAttributes.java:72)
        at 
sun.nio.fs.UnixFileSystemProvider.implDelete(java.base@22-ea/UnixFileSystemProvider.java:274)
        at 
sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(java.base@22-ea/AbstractFileSystemProvider.java:109)
        at java.nio.file.Files.deleteIfExists(java.base@22-ea/Files.java:1191)
        at 
com.x.streams.dataprovider.FileSystemDataProvider.close(FileSystemDataProvider.java:109)
        at 
com.x.streams.components.XDataWriter.closeWriters(XDataWriter.java:241)
        at 
com.x.streams.components.XDataWriter.onTerminate(XDataWriter.java:255)
        at com.x.streams.core.StreamReader.doOnTerminate(StreamReader.java:136)
        at com.x.streams.core.StreamReader.processData(StreamReader.java:112)
        at 
com.x.streams.core.ExecutionEngine$ProcessingThreadTask.run(ExecutionEngine.java:604)
        at java.lang.Thread.runWith(java.base@22-ea/Thread.java:1583)
        at java.lang.Thread.run(java.base@22-ea/Thread.java:1570)
```

Once the system reaches that point, it cannot be recovered until we kill (the 
pod of) the active mds replica.
If we look at `ceph health detail`, we see this:

```
[root@rook-ceph-tools-75c947bc9d-ggb7m /]# ceph health detail
HEALTH_WARN 3 clients failing to respond to capability release; 1 MDSs report 
slow requests
[WRN] MDS_CLIENT_LATE_RELEASE: 3 clients failing to respond to capability 
release
    mds.ceph-filesystem-a(mds.0): Client worker45:csi-cephfs-node failing to 
respond to capability release client_id: 5927564
    mds.ceph-filesystem-a(mds.0): Client worker1:csi-cephfs-node failing to 
respond to capability release client_id: 7804133
    mds.ceph-filesystem-a(mds.0): Client worker39:csi-cephfs-node failing to 
respond to capability release client_id: 8391464
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
    mds.ceph-filesystem-a(mds.0): 31 slow requests are blocked > 30 secs
```

Any hint about how to troubleshoot it? My intuition says that it could happen 
that certain sharedfs released caps cannot reach the MDS and then that portion 
of the sharedfs is locked for good. But I am completely making this up. I’d 
appreciate it if anyone could provide some indications about how to 
troubleshoot it or give me some hints.
We have some clusters running in production with almost the same configuration 
(but the OS) and everything runs ok there. But we cannot find the reason why we 
are getting this behavior here.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to