[ceph-users] Dead node (watcher) won't timeout on RBD
Hey all, I recently had a k8s node failure in my homelab, and even though I powered it off (and it's done for, so it won't get back up), it still shows up as watcher in rbd status. ``` root@node0:~# rbd status kubernetes/csi-vol-3e7af8ae-ceb6-4c94-8435-2f8dc29b313b Watchers: watcher=10.0.0.103:0/1520114202 client.1697844 cookie=140289402510784 watcher=10.0.0.103:0/39967552 client.1805496 cookie=140549449430704 root@node0:~# ceph osd blocklist ls 10.0.0.103:0/0 2023-04-15T13:15:39.061379+0200 listed 1 entries ``` Even though the node is down & I have blocked it multiple times for hours, it won't disappear. Meaning, ceph-csi-rbd claims the image is mounted already (manually binding works fine, and can cleanly unbind as well, but can't unbind from a node that doesn't exist anymore). Is there any possibility to force kick an rbd client / watcher from ceph (e.g. switching the mgr / mon) or to see why this is not timing out? I found some historical mails & issues (related to rook, which I don't use) regarding a param `osd_client_watch_timeout` but can't find how that relates to the RBD images. Cheers, Max. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Dead node (watcher) won't timeout on RBD
After a critical node failure on my lab cluster, which won't come back up and is still down, the RBD objects are still being watched / mounted according to ceph. I can't shell to the node to rbd unbind them as the node is down. I am absolutely certain that nothing is using these images and they don't have snapshots either (and this IP is not even remotely close to the those of the monitors in the cluster). I blocked the IP usingceph osd blocklist add but after 30 minutes, they are still being watched. Them being watched (they are RWO ceph-csi volumes) prevents me from re-using them in the cluster. As far as I'm aware, ceph should remove the watchers after 30 minutes and they've been blocklisted for hours now.root@node0:~# rbd status kubernetes/csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff Watchers: watcher=10.0.0.103:0/992994811 client.1634081 cookie=139772597209280 root@node0:~# rbd snap list kubernetes/csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff root@node0:~# rbd info kubernetes/csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff rbd image 'csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff': size 10 GiB in 2560 objects order 22 (4 MiB objects) snapshot_count: 0 id: 4ff5353b865e1 block_name_prefix: rbd_data.4ff5353b865e1 format: 2 features: layering op_features: flags: create_timestamp: Fri Mar 31 14:46:51 2023 access_timestamp: Fri Mar 31 14:46:51 2023 modify_timestamp: Fri Mar 31 14:46:51 2023 root@node0:~# rados -p kubernetes listwatchers rbd_header.4ff5353b865e1 watcher=10.0.0.103:0/992994811 client.1634081 cookie=139772597209280 root@node0:~# ceph osd blocklist ls 10.0.0.103:0/0 2023-04-16T13:58:34.854232+0200 listed 1 entries root@node0:~# ceph daemon osd.0 config get osd_client_watch_timeout { "osd_client_watch_timeout": "30" } Is it possible to kick a watcher out manually, or is there not much I can do here besides shutting down the entire cluster (or OSDs) and getting them back up? If it is a bug, I'm happy to help figuring out it's root cause and see if I can help writing a fix. Cheers, Max. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io