[ceph-users] Re: Stuck OSD service specification - can't remove
Hi, did you ever resolve that? I'm stuck with the same "deleting" service in 'ceph orch ls' and found your thread. Thanks, Eugen ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Stuck OSD service specification - can't remove
I ended up in the same situation while playing around with a test cluster. The SUSE team has an article [1] for this case, the following helped me resolve this issue. I had three different osd specs in place for the same three nodes: osd 33w nautilus2;nautilus3 osd.osd-hdd-ssd 3 2m ago 2w nautilus;nautilus2;nautilus3 osd.osd-hdd-ssd-mix 3 2m ago - I replaced the "service_name" with the more suiting value ("osd.osd-hdd-ssd") in the unit.meta file of each OSD containing the invalid spec, then restarted each affected OSD. It probably wouldn't have been necessary but I wanted to see the effect immediately, so I failed over the mgr (ceph mgr fail), now I only have one valid osd spec. # before nautilus3:~ # grep service_name /var/lib/ceph/201a2fbc-ce7b-44a3-9ed7-39427972083b/osd.3/unit.meta "service_name": "osd", # after nautilus3:~ # grep service_name /var/lib/ceph/201a2fbc-ce7b-44a3-9ed7-39427972083b/osd.3/unit.meta "service_name": "osd.osd-hdd-ssd", nautilus3:~ # ceph orch ls osd NAME PORTS RUNNING REFRESHED AGE PLACEMENT osd.osd-hdd-ssd 9 10m ago2w nautilus;nautilus2;nautilus3 Regards, Eugen [1] https://www.suse.com/support/kb/doc/?id=20667 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Stuck OSD service specification - can't remove
Hi, I tried to respond directly in the web ui of the mailing list but my message is queued for moderation. I just wanted to update a solution that worked for me when a service spec is stuck in a pending state, maybe this will help others in the same situation. While playing around with a test cluster I ended up with a "deleting" osd service spec. The SUSE team has an article [1] for this case, the following helped me resolve this issue. I had three different osd specs in place for the same three nodes: ---snip--- osd 33w nautilus2;nautilus3 osd.osd-hdd-ssd 3 2m ago 2w nautilus;nautilus2;nautilus3 osd.osd-hdd-ssd-mix 3 2m ago - ---snip--- I replaced the "service_name" with the more suiting value ("osd.osd-hdd-ssd") in the unit.meta file of each OSD containing the invalid spec, then restarted each affected OSD. It probably wouldn't have been necessary but I wanted to see the effect immediately, so I failed over the mgr (ceph mgr fail), now I only have one valid osd spec. ---snip--- # before nautilus3:~ # grep service_name /var/lib/ceph/201a2fbc-ce7b-44a3-9ed7-39427972083b/osd.3/unit.meta "service_name": "osd", # after nautilus3:~ # grep service_name /var/lib/ceph/201a2fbc-ce7b-44a3-9ed7-39427972083b/osd.3/unit.meta "service_name": "osd.osd-hdd-ssd", nautilus3:~ # ceph orch ls osd NAME PORTS RUNNING REFRESHED AGE PLACEMENT osd.osd-hdd-ssd 9 10m ago2w nautilus;nautilus2;nautilus3 ---snip--- Regards, Eugen [1] https://www.suse.com/support/kb/doc/?id=20667 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Stuck OSD service specification - can't remove
Hello David, did you resolve it? I have the same problem for rgw. I upgraded from N to P. Regards, Jie ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Stuck OSD service specification - can't remove
Hi David, I had a similar issue yesterday where I wanted to remove an OSD on an OSD node which had 2 OSDs so for that I used "ceph orch osd rm" command which completed successfully but after rebooting that OSD node I saw it was still trying to start the systemd service for that OSD and one CPU core was 100% busy trying to do a "crun delete" which I suppose here is trying to delete an image or container. So what I did here is to kill this process and I also had to run the following command: ceph orch daemon rm osd.3 --force After that everything was fine again. This is a Ceph 15.2.11 cluster on Ubuntu 20.04 and podman. Hope that helps. ‐‐‐ Original Message ‐‐‐ On Friday, May 7, 2021 1:24 AM, David Orman wrote: > Has anybody run into a 'stuck' OSD service specification? I've tried > to delete it, but it's stuck in 'deleting' state, and has been for > quite some time (even prior to upgrade, on 15.2.x). This is on 16.2.3: > > NAME PORTS RUNNING REFRESHED AGE PLACEMENT > osd.osd_spec 504/525 12m label:osd > root@ceph01:/# ceph orch rm osd.osd_spec > Removed service osd.osd_spec > > From active monitor: > > debug 2021-05-06T23:14:48.909+ 7f17d310b700 0 > log_channel(cephadm) log [INF] : Remove service osd.osd_spec > > Yet in ls, it's still there, same as above. --export on it: > > root@ceph01:/# ceph orch ls osd.osd_spec --export > service_type: osd > service_id: osd_spec > service_name: osd.osd_spec > placement: {} > unmanaged: true > spec: > filter_logic: AND > objectstore: bluestore > > We've tried --force, as well, with no luck. > > To be clear, the --export even prior to delete looks nothing like the > actual service specification we're using, even after I re-apply it, so > something seems 'bugged'. Here's the OSD specification we're applying: > > service_type: osd > service_id: osd_spec > placement: > label: "osd" > data_devices: > rotational: 1 > db_devices: > rotational: 0 > db_slots: 12 > > I would appreciate any insight into how to clear this up (without > removing the actual OSDs, we're just wanting to apply the updated > service specification - we used to use host placement rules and are > switching to label-based). > > Thanks, > David > > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Stuck OSD service specification - can't remove
Hi, I'm not attempting to remove the OSDs, but instead the service/placement specification. I want the OSDs/data to persist. --force did not work on the service, as noted in the original email. Thank you, David On Fri, May 7, 2021 at 1:36 AM mabi wrote: > > Hi David, > > I had a similar issue yesterday where I wanted to remove an OSD on an OSD > node which had 2 OSDs so for that I used "ceph orch osd rm" command which > completed successfully but after rebooting that OSD node I saw it was still > trying to start the systemd service for that OSD and one CPU core was 100% > busy trying to do a "crun delete" which I suppose here is trying to delete an > image or container. So what I did here is to kill this process and I also had > to run the following command: > > ceph orch daemon rm osd.3 --force > > After that everything was fine again. This is a Ceph 15.2.11 cluster on > Ubuntu 20.04 and podman. > > Hope that helps. > > ‐‐‐ Original Message ‐‐‐ > On Friday, May 7, 2021 1:24 AM, David Orman wrote: > > > Has anybody run into a 'stuck' OSD service specification? I've tried > > to delete it, but it's stuck in 'deleting' state, and has been for > > quite some time (even prior to upgrade, on 15.2.x). This is on 16.2.3: > > > > NAME PORTS RUNNING REFRESHED AGE PLACEMENT > > osd.osd_spec 504/525 12m label:osd > > root@ceph01:/# ceph orch rm osd.osd_spec > > Removed service osd.osd_spec > > > > From active monitor: > > > > debug 2021-05-06T23:14:48.909+ 7f17d310b700 0 > > log_channel(cephadm) log [INF] : Remove service osd.osd_spec > > > > Yet in ls, it's still there, same as above. --export on it: > > > > root@ceph01:/# ceph orch ls osd.osd_spec --export > > service_type: osd > > service_id: osd_spec > > service_name: osd.osd_spec > > placement: {} > > unmanaged: true > > spec: > > filter_logic: AND > > objectstore: bluestore > > > > We've tried --force, as well, with no luck. > > > > To be clear, the --export even prior to delete looks nothing like the > > actual service specification we're using, even after I re-apply it, so > > something seems 'bugged'. Here's the OSD specification we're applying: > > > > service_type: osd > > service_id: osd_spec > > placement: > > label: "osd" > > data_devices: > > rotational: 1 > > db_devices: > > rotational: 0 > > db_slots: 12 > > > > I would appreciate any insight into how to clear this up (without > > removing the actual OSDs, we're just wanting to apply the updated > > service specification - we used to use host placement rules and are > > switching to label-based). > > > > Thanks, > > David > > > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Stuck OSD service specification - can't remove
This turns out to be worse than we thought. We attempted another Ceph upgrade (15.2.10->16.2.3) on another cluster, and have run into this again. We're seeing strange behavior with the OSD specifications, which also have a count that is #OSDs + #hosts, so for example, on a 504 OSD cluster (21 nodes of 24 OSDs), we see: osd.osd_spec504/5256s * It never deletes, and we cannot apply a specification over it (we attempt, and it stays in deleting state - and a --export does not show any specification). On 15.2.10 we didn't have this problem, it appears new in 16.2.x. We are using 16.2.3. Thanks, David On Fri, May 7, 2021 at 9:06 AM David Orman wrote: > > Hi, > > I'm not attempting to remove the OSDs, but instead the > service/placement specification. I want the OSDs/data to persist. > --force did not work on the service, as noted in the original email. > > Thank you, > David > > On Fri, May 7, 2021 at 1:36 AM mabi wrote: > > > > Hi David, > > > > I had a similar issue yesterday where I wanted to remove an OSD on an OSD > > node which had 2 OSDs so for that I used "ceph orch osd rm" command which > > completed successfully but after rebooting that OSD node I saw it was still > > trying to start the systemd service for that OSD and one CPU core was 100% > > busy trying to do a "crun delete" which I suppose here is trying to delete > > an image or container. So what I did here is to kill this process and I > > also had to run the following command: > > > > ceph orch daemon rm osd.3 --force > > > > After that everything was fine again. This is a Ceph 15.2.11 cluster on > > Ubuntu 20.04 and podman. > > > > Hope that helps. > > > > ‐‐‐ Original Message ‐‐‐ > > On Friday, May 7, 2021 1:24 AM, David Orman wrote: > > > > > Has anybody run into a 'stuck' OSD service specification? I've tried > > > to delete it, but it's stuck in 'deleting' state, and has been for > > > quite some time (even prior to upgrade, on 15.2.x). This is on 16.2.3: > > > > > > NAME PORTS RUNNING REFRESHED AGE PLACEMENT > > > osd.osd_spec 504/525 12m label:osd > > > root@ceph01:/# ceph orch rm osd.osd_spec > > > Removed service osd.osd_spec > > > > > > From active monitor: > > > > > > debug 2021-05-06T23:14:48.909+ 7f17d310b700 0 > > > log_channel(cephadm) log [INF] : Remove service osd.osd_spec > > > > > > Yet in ls, it's still there, same as above. --export on it: > > > > > > root@ceph01:/# ceph orch ls osd.osd_spec --export > > > service_type: osd > > > service_id: osd_spec > > > service_name: osd.osd_spec > > > placement: {} > > > unmanaged: true > > > spec: > > > filter_logic: AND > > > objectstore: bluestore > > > > > > We've tried --force, as well, with no luck. > > > > > > To be clear, the --export even prior to delete looks nothing like the > > > actual service specification we're using, even after I re-apply it, so > > > something seems 'bugged'. Here's the OSD specification we're applying: > > > > > > service_type: osd > > > service_id: osd_spec > > > placement: > > > label: "osd" > > > data_devices: > > > rotational: 1 > > > db_devices: > > > rotational: 0 > > > db_slots: 12 > > > > > > I would appreciate any insight into how to clear this up (without > > > removing the actual OSDs, we're just wanting to apply the updated > > > service specification - we used to use host placement rules and are > > > switching to label-based). > > > > > > Thanks, > > > David > > > > > > ceph-users mailing list -- ceph-users@ceph.io > > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io