[ceph-users] Dead node (watcher) won't timeout on RBD

2023-04-25 Thread max
Hey all,

I recently had a k8s node failure in my homelab, and even though I powered it 
off (and it's done for, so it won't get back up), it still shows up as watcher 
in rbd status.

```
root@node0:~# rbd status kubernetes/csi-vol-3e7af8ae-ceb6-4c94-8435-2f8dc29b313b
Watchers:
watcher=10.0.0.103:0/1520114202 client.1697844 cookie=140289402510784
watcher=10.0.0.103:0/39967552 client.1805496 cookie=140549449430704
root@node0:~# ceph osd blocklist ls
10.0.0.103:0/0 2023-04-15T13:15:39.061379+0200
listed 1 entries
```

Even though the node is down & I have blocked it multiple times for hours, it 
won't disappear. Meaning, ceph-csi-rbd claims the image is mounted already 
(manually binding works fine, and can cleanly unbind as well, but can't unbind 
from a node that doesn't exist anymore).

Is there any possibility to force kick an rbd client / watcher from ceph (e.g. 
switching the mgr / mon) or to see why this is not timing out?

I found some historical mails & issues (related to rook, which I don't use) 
regarding a param `osd_client_watch_timeout` but can't find how that relates to 
the RBD images.

Cheers,
Max.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Dead node (watcher) won't timeout on RBD

2023-04-15 Thread Max Boone


After a critical node failure on my lab cluster, which won't come back up and 
is still down, the RBD objects are still being watched / mounted according to 
ceph. I can't shell to the node to rbd unbind them as the node is down. I am 
absolutely certain that nothing is using these images and they don't have 
snapshots either (and this IP is not even remotely close to the those of the 
monitors in the cluster). I blocked the IP usingceph osd blocklist add but 
after 30 minutes, they are still being watched. Them being watched (they are 
RWO ceph-csi volumes) prevents me from re-using them in the cluster. As far as 
I'm aware, ceph should remove the watchers after 30 minutes and they've been 
blocklisted for hours now.root@node0:~# rbd status 
kubernetes/csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff
Watchers:
watcher=10.0.0.103:0/992994811 client.1634081 cookie=139772597209280
root@node0:~# rbd snap list 
kubernetes/csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff
root@node0:~# rbd info kubernetes/csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff
rbd image 'csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff':
size 10 GiB in 2560 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: 4ff5353b865e1
block_name_prefix: rbd_data.4ff5353b865e1
format: 2
features: layering
op_features: 
flags: 
create_timestamp: Fri Mar 31 14:46:51 2023
access_timestamp: Fri Mar 31 14:46:51 2023
modify_timestamp: Fri Mar 31 14:46:51 2023
root@node0:~# rados -p kubernetes listwatchers rbd_header.4ff5353b865e1
watcher=10.0.0.103:0/992994811 client.1634081 cookie=139772597209280
root@node0:~# ceph osd blocklist ls
10.0.0.103:0/0 2023-04-16T13:58:34.854232+0200
listed 1 entries
root@node0:~# ceph daemon osd.0 config get osd_client_watch_timeout
{
"osd_client_watch_timeout": "30"
}
 Is it possible to kick a watcher out manually, or is there not much I can do 
here besides shutting down the entire cluster (or OSDs) and getting them back 
up? If it is a bug, I'm happy to help figuring out it's root cause and see if I 
can help writing a fix. Cheers, Max.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io