[ceph-users] Re: Stuck OSD service specification - can't remove

2023-02-21 Thread Eugen Block
Hi, did you ever resolve that? I'm stuck with the same "deleting"  
service in 'ceph orch ls' and found your thread.


Thanks,
Eugen

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck OSD service specification - can't remove

2023-03-15 Thread eblock
I ended up in the same situation while playing around with a test cluster. The 
SUSE team has an article [1] for this case, the following helped me resolve 
this issue. I had three different osd specs in place for the same three nodes:

osd   33w   nautilus2;nautilus3   

osd.osd-hdd-ssd   3  2m ago  2w   
nautilus;nautilus2;nautilus3  
osd.osd-hdd-ssd-mix   3  2m ago  -

I replaced the "service_name" with the more suiting value ("osd.osd-hdd-ssd") 
in the unit.meta file of each OSD containing the invalid spec, then restarted 
each affected OSD. It probably wouldn't have been necessary but I wanted to see 
the effect immediately, so I failed over the mgr (ceph mgr fail), now I only 
have one valid osd spec.

# before
nautilus3:~ # grep service_name 
/var/lib/ceph/201a2fbc-ce7b-44a3-9ed7-39427972083b/osd.3/unit.meta
"service_name": "osd",
# after
nautilus3:~ # grep service_name 
/var/lib/ceph/201a2fbc-ce7b-44a3-9ed7-39427972083b/osd.3/unit.meta 
"service_name": "osd.osd-hdd-ssd",

nautilus3:~ # ceph orch ls osd
NAME PORTS  RUNNING  REFRESHED  AGE  PLACEMENT 
osd.osd-hdd-ssd   9  10m ago2w   nautilus;nautilus2;nautilus3

Regards,
Eugen

[1] https://www.suse.com/support/kb/doc/?id=20667
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck OSD service specification - can't remove

2023-03-16 Thread Eugen Block

Hi,

I tried to respond directly in the web ui of the mailing list but my  
message is queued for moderation. I just wanted to update a solution  
that worked for me when a service spec is stuck in a pending state,  
maybe this will help others in the same situation.


While playing around with a test cluster I ended up with a "deleting"  
osd service spec. The SUSE team has an article [1] for this case, the  
following helped me resolve this issue. I had three different osd  
specs in place for the same three nodes:


---snip---
osd   33w   nautilus2;nautilus3

osd.osd-hdd-ssd   3  2m ago  2w
nautilus;nautilus2;nautilus3

osd.osd-hdd-ssd-mix   3  2m ago  -
---snip---

I replaced the "service_name" with the more suiting value  
("osd.osd-hdd-ssd") in the unit.meta file of each OSD containing the  
invalid spec, then restarted each affected OSD. It probably wouldn't  
have been necessary but I wanted to see the effect immediately, so I  
failed over the mgr (ceph mgr fail), now I only have one valid osd spec.


---snip---
# before
nautilus3:~ # grep service_name
/var/lib/ceph/201a2fbc-ce7b-44a3-9ed7-39427972083b/osd.3/unit.meta
"service_name": "osd",

# after
nautilus3:~ # grep service_name
/var/lib/ceph/201a2fbc-ce7b-44a3-9ed7-39427972083b/osd.3/unit.meta
"service_name": "osd.osd-hdd-ssd",

nautilus3:~ # ceph orch ls osd
NAME PORTS  RUNNING  REFRESHED  AGE  PLACEMENT
osd.osd-hdd-ssd   9  10m ago2w   nautilus;nautilus2;nautilus3
---snip---

Regards,
Eugen

[1] https://www.suse.com/support/kb/doc/?id=20667
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck OSD service specification - can't remove

2024-05-01 Thread Wang Jie
Hello David, did you resolve it? I have the same problem for rgw. I upgraded 
from N to P.


Regards,
Jie
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck OSD service specification - can't remove

2021-05-07 Thread mabi
Hi David,

I had a similar issue yesterday where I wanted to remove an OSD on an OSD node 
which had 2 OSDs so for that I used "ceph orch osd rm" command which completed 
successfully but after rebooting that OSD node I saw it was still trying to 
start the systemd service for that OSD and one CPU core was 100% busy trying to 
do a "crun delete" which I suppose here is trying to delete an image or 
container. So what I did here is to kill this process and I also had to run the 
following command:

ceph orch daemon rm osd.3 --force

After that everything was fine again. This is a Ceph 15.2.11 cluster on Ubuntu 
20.04 and podman.

Hope that helps.

‐‐‐ Original Message ‐‐‐
On Friday, May 7, 2021 1:24 AM, David Orman  wrote:

> Has anybody run into a 'stuck' OSD service specification? I've tried
> to delete it, but it's stuck in 'deleting' state, and has been for
> quite some time (even prior to upgrade, on 15.2.x). This is on 16.2.3:
>
> NAME PORTS RUNNING REFRESHED AGE PLACEMENT
> osd.osd_spec 504/525  12m label:osd
> root@ceph01:/# ceph orch rm osd.osd_spec
> Removed service osd.osd_spec
>
> From active monitor:
>
> debug 2021-05-06T23:14:48.909+ 7f17d310b700 0
> log_channel(cephadm) log [INF] : Remove service osd.osd_spec
>
> Yet in ls, it's still there, same as above. --export on it:
>
> root@ceph01:/# ceph orch ls osd.osd_spec --export
> service_type: osd
> service_id: osd_spec
> service_name: osd.osd_spec
> placement: {}
> unmanaged: true
> spec:
> filter_logic: AND
> objectstore: bluestore
>
> We've tried --force, as well, with no luck.
>
> To be clear, the --export even prior to delete looks nothing like the
> actual service specification we're using, even after I re-apply it, so
> something seems 'bugged'. Here's the OSD specification we're applying:
>
> service_type: osd
> service_id: osd_spec
> placement:
> label: "osd"
> data_devices:
> rotational: 1
> db_devices:
> rotational: 0
> db_slots: 12
>
> I would appreciate any insight into how to clear this up (without
> removing the actual OSDs, we're just wanting to apply the updated
> service specification - we used to use host placement rules and are
> switching to label-based).
>
> Thanks,
> David
>
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck OSD service specification - can't remove

2021-05-07 Thread David Orman
Hi,

I'm not attempting to remove the OSDs, but instead the
service/placement specification. I want the OSDs/data to persist.
--force did not work on the service, as noted in the original email.

Thank you,
David

On Fri, May 7, 2021 at 1:36 AM mabi  wrote:
>
> Hi David,
>
> I had a similar issue yesterday where I wanted to remove an OSD on an OSD 
> node which had 2 OSDs so for that I used "ceph orch osd rm" command which 
> completed successfully but after rebooting that OSD node I saw it was still 
> trying to start the systemd service for that OSD and one CPU core was 100% 
> busy trying to do a "crun delete" which I suppose here is trying to delete an 
> image or container. So what I did here is to kill this process and I also had 
> to run the following command:
>
> ceph orch daemon rm osd.3 --force
>
> After that everything was fine again. This is a Ceph 15.2.11 cluster on 
> Ubuntu 20.04 and podman.
>
> Hope that helps.
>
> ‐‐‐ Original Message ‐‐‐
> On Friday, May 7, 2021 1:24 AM, David Orman  wrote:
>
> > Has anybody run into a 'stuck' OSD service specification? I've tried
> > to delete it, but it's stuck in 'deleting' state, and has been for
> > quite some time (even prior to upgrade, on 15.2.x). This is on 16.2.3:
> >
> > NAME PORTS RUNNING REFRESHED AGE PLACEMENT
> > osd.osd_spec 504/525  12m label:osd
> > root@ceph01:/# ceph orch rm osd.osd_spec
> > Removed service osd.osd_spec
> >
> > From active monitor:
> >
> > debug 2021-05-06T23:14:48.909+ 7f17d310b700 0
> > log_channel(cephadm) log [INF] : Remove service osd.osd_spec
> >
> > Yet in ls, it's still there, same as above. --export on it:
> >
> > root@ceph01:/# ceph orch ls osd.osd_spec --export
> > service_type: osd
> > service_id: osd_spec
> > service_name: osd.osd_spec
> > placement: {}
> > unmanaged: true
> > spec:
> > filter_logic: AND
> > objectstore: bluestore
> >
> > We've tried --force, as well, with no luck.
> >
> > To be clear, the --export even prior to delete looks nothing like the
> > actual service specification we're using, even after I re-apply it, so
> > something seems 'bugged'. Here's the OSD specification we're applying:
> >
> > service_type: osd
> > service_id: osd_spec
> > placement:
> > label: "osd"
> > data_devices:
> > rotational: 1
> > db_devices:
> > rotational: 0
> > db_slots: 12
> >
> > I would appreciate any insight into how to clear this up (without
> > removing the actual OSDs, we're just wanting to apply the updated
> > service specification - we used to use host placement rules and are
> > switching to label-based).
> >
> > Thanks,
> > David
> >
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck OSD service specification - can't remove

2021-05-10 Thread David Orman
This turns out to be worse than we thought. We attempted another Ceph
upgrade (15.2.10->16.2.3) on another cluster, and have run into this
again. We're seeing strange behavior with the OSD specifications,
which also have a count that is #OSDs + #hosts, so for example, on a
504 OSD cluster (21 nodes of 24 OSDs), we see:

osd.osd_spec504/5256s   *

It never deletes, and we cannot apply a specification over it (we
attempt, and it stays in deleting state - and a --export does not show
any specification).

On 15.2.10 we didn't have this problem, it appears new in 16.2.x. We
are using 16.2.3.

Thanks,
David


On Fri, May 7, 2021 at 9:06 AM David Orman  wrote:
>
> Hi,
>
> I'm not attempting to remove the OSDs, but instead the
> service/placement specification. I want the OSDs/data to persist.
> --force did not work on the service, as noted in the original email.
>
> Thank you,
> David
>
> On Fri, May 7, 2021 at 1:36 AM mabi  wrote:
> >
> > Hi David,
> >
> > I had a similar issue yesterday where I wanted to remove an OSD on an OSD 
> > node which had 2 OSDs so for that I used "ceph orch osd rm" command which 
> > completed successfully but after rebooting that OSD node I saw it was still 
> > trying to start the systemd service for that OSD and one CPU core was 100% 
> > busy trying to do a "crun delete" which I suppose here is trying to delete 
> > an image or container. So what I did here is to kill this process and I 
> > also had to run the following command:
> >
> > ceph orch daemon rm osd.3 --force
> >
> > After that everything was fine again. This is a Ceph 15.2.11 cluster on 
> > Ubuntu 20.04 and podman.
> >
> > Hope that helps.
> >
> > ‐‐‐ Original Message ‐‐‐
> > On Friday, May 7, 2021 1:24 AM, David Orman  wrote:
> >
> > > Has anybody run into a 'stuck' OSD service specification? I've tried
> > > to delete it, but it's stuck in 'deleting' state, and has been for
> > > quite some time (even prior to upgrade, on 15.2.x). This is on 16.2.3:
> > >
> > > NAME PORTS RUNNING REFRESHED AGE PLACEMENT
> > > osd.osd_spec 504/525  12m label:osd
> > > root@ceph01:/# ceph orch rm osd.osd_spec
> > > Removed service osd.osd_spec
> > >
> > > From active monitor:
> > >
> > > debug 2021-05-06T23:14:48.909+ 7f17d310b700 0
> > > log_channel(cephadm) log [INF] : Remove service osd.osd_spec
> > >
> > > Yet in ls, it's still there, same as above. --export on it:
> > >
> > > root@ceph01:/# ceph orch ls osd.osd_spec --export
> > > service_type: osd
> > > service_id: osd_spec
> > > service_name: osd.osd_spec
> > > placement: {}
> > > unmanaged: true
> > > spec:
> > > filter_logic: AND
> > > objectstore: bluestore
> > >
> > > We've tried --force, as well, with no luck.
> > >
> > > To be clear, the --export even prior to delete looks nothing like the
> > > actual service specification we're using, even after I re-apply it, so
> > > something seems 'bugged'. Here's the OSD specification we're applying:
> > >
> > > service_type: osd
> > > service_id: osd_spec
> > > placement:
> > > label: "osd"
> > > data_devices:
> > > rotational: 1
> > > db_devices:
> > > rotational: 0
> > > db_slots: 12
> > >
> > > I would appreciate any insight into how to clear this up (without
> > > removing the actual OSDs, we're just wanting to apply the updated
> > > service specification - we used to use host placement rules and are
> > > switching to label-based).
> > >
> > > Thanks,
> > > David
> > >
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io