[ceph-users] Dead node (watcher) won't timeout on RBD

Max Boone Sat, 15 Apr 2023 07:58:39 -0700

After a critical node failure on my lab cluster, which won't come back up and 
is still down, the RBD objects are still being watched / mounted according to 
ceph. I can't shell to the node to rbd unbind them as the node is down. I am 
absolutely certain that nothing is using these images and they don't have 
snapshots either (and this IP is not even remotely close to the those of the 
monitors in the cluster). I blocked the IP usingceph osd blocklist add but 
after 30 minutes, they are still being watched. Them being watched (they are 
RWO ceph-csi volumes) prevents me from re-using them in the cluster. As far as 
I'm aware, ceph should remove the watchers after 30 minutes and they've been 
blocklisted for hours now.root@node0:~# rbd status 
kubernetes/csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff
Watchers:
        watcher=10.0.0.103:0/992994811 client.1634081 cookie=139772597209280
root@node0:~# rbd snap list 
kubernetes/csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff
root@node0:~# rbd info kubernetes/csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff
rbd image 'csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff':
        size 10 GiB in 2560 objects
        order 22 (4 MiB objects)
        snapshot_count: 0
        id: 4ff5353b865e1
        block_name_prefix: rbd_data.4ff5353b865e1
        format: 2
        features: layering
        op_features: 
        flags: 
        create_timestamp: Fri Mar 31 14:46:51 2023
        access_timestamp: Fri Mar 31 14:46:51 2023
        modify_timestamp: Fri Mar 31 14:46:51 2023
root@node0:~# rados -p kubernetes listwatchers rbd_header.4ff5353b865e1
watcher=10.0.0.103:0/992994811 client.1634081 cookie=139772597209280
root@node0:~# ceph osd blocklist ls
10.0.0.103:0/0 2023-04-16T13:58:34.854232+0200
listed 1 entries
root@node0:~# ceph daemon osd.0 config get osd_client_watch_timeout
{
    "osd_client_watch_timeout": "30"
}
 Is it possible to kick a watcher out manually, or is there not much I can do 
here besides shutting down the entire cluster (or OSDs) and getting them back 
up? If it is a bug, I'm happy to help figuring out it's root cause and see if I 
can help writing a fix. Cheers, Max.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Dead node (watcher) won't timeout on RBD

Reply via email to