I've been using this procedure to remove OSDs...
OSD_ID=
ceph auth del osd.${OSD_ID}
ceph osd down ${OSD_ID}
ceph osd out ${OSD_ID}
ceph osd rm ${OSD_ID}
ceph osd crush remove osd.${OSD_ID}
systemctl disable ceph-osd@${OSD_ID}.service
systemctl stop ceph-osd@${OSD_ID}.service
sed -i "/ceph-$OSD_ID/d" /etc/fstab
umount /var/lib/ceph/osd/ceph-${OSD_ID}
Would you say this is the correct order of events?
Thanks!
On Wed, Jun 28, 2017 at 9:34 AM, David Turner <[email protected]> wrote:
> A couple things. You didn't `ceph osd crush remove osd.21` after doing
> the other bits. Also you will want to remove the bucket (re: host) from
> the crush map as it will now be empty. Right now you have a host in the
> crush map with a weight, but no osds to put that data on. It has a weight
> because of the 2 OSDs that are still in it that were removed from the
> cluster but not from the crush map. It's confusing to your cluster.
>
> If you had removed the OSDs from the crush map when you ran the other
> commands, then the dead host would have still been in the crush map but
> with a weight of 0 and wouldn't cause any problems.
>
> On Wed, Jun 28, 2017 at 4:15 AM Jan Kasprzak <[email protected]> wrote:
>
>> Hello,
>>
>> TL;DR: what to do when my cluster reports stuck unclean pgs?
>>
>> Detailed description:
>>
>> One of the nodes in my cluster died. CEPH correctly rebalanced itself,
>> and reached the HEALTH_OK state. I have looked at the failed server,
>> and decided to take it out of the cluster permanently, because the
>> hardware
>> is indeed faulty. It used to host two OSDs, which were marked down and out
>> in "ceph osd dump".
>>
>> So from the HEALTH_OK I ran the following commands:
>>
>> # ceph auth del osd.20
>> # ceph auth del osd.21
>> # ceph osd rm osd.20
>> # ceph osd rm osd.21
>>
>> After that, CEPH started to rebalance itself, but now it reports some PGs
>> as "stuck unclean", and there is no "recovery I/O" visible in "ceph -s":
>>
>> # ceph -s
>> cluster 3065224c-ea2e-4558-8a81-8f935dde56e5
>> health HEALTH_WARN
>> 350 pgs stuck unclean
>> recovery 26/1596390 objects degraded (0.002%)
>> recovery 58772/1596390 objects misplaced (3.682%)
>> monmap e16: 3 mons at {...}
>> election epoch 584, quorum 0,1,2 ...
>> osdmap e61435: 58 osds: 58 up, 58 in; 350 remapped pgs
>> flags require_jewel_osds
>> pgmap v35959908: 3776 pgs, 6 pools, 2051 GB data, 519 kobjects
>> 6244 GB used, 40569 GB / 46814 GB avail
>> 26/1596390 objects degraded (0.002%)
>> 58772/1596390 objects misplaced (3.682%)
>> 3426 active+clean
>> 349 active+remapped
>> 1 active
>> client io 5818 B/s rd, 8457 kB/s wr, 0 op/s rd, 71 op/s wr
>>
>> # ceph health detail
>> HEALTH_WARN 350 pgs stuck unclean; recovery 26/1596390 objects degraded
>> (0.002%); recovery 58772/1596390 objects misplaced (3.682%)
>> pg 28.fa is stuck unclean for 14408925.966824, current state
>> active+remapped, last acting [38,52,4]
>> pg 28.e7 is stuck unclean for 14408925.966886, current state
>> active+remapped, last acting [29,42,22]
>> pg 23.dc is stuck unclean for 61698.641750, current state
>> active+remapped, last acting [50,33,23]
>> pg 23.d9 is stuck unclean for 61223.093284, current state
>> active+remapped, last acting [54,31,23]
>> pg 28.df is stuck unclean for 14408925.967120, current state
>> active+remapped, last acting [33,7,15]
>> pg 34.38 is stuck unclean for 60904.322881, current state
>> active+remapped, last acting [18,41,9]
>> pg 34.fe is stuck unclean for 60904.241762, current state
>> active+remapped, last acting [58,1,44]
>> [...]
>> pg 28.8f is stuck unclean for 66102.059671, current state active, last
>> acting [8,40,5]
>> [...]
>> recovery 26/1596390 objects degraded (0.002%)
>> recovery 58772/1596390 objects misplaced (3.682%)
>>
>> Apart from that, the data stored in CEPH pools seems to be reachable
>> and usable as before.
>>
>> The nodes run CentOS 7 and ceph 10.2.5 (RPMS downloaded from CEPH
>> repository).
>>
>> What other debugging info should I provide, or what to do in order
>> to unstuck the stuck pgs? Thanks!
>>
>> -Yenya
>>
>> --
>> | Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net -
>> private}> |
>> | http://www.fi.muni.cz/~kas/ GPG:
>> 4096R/A45477D5 |
>> > That's why this kind of vulnerability is a concern: deploying stuff is
>> <
>> > often about collecting an obscene number of .jar files and pushing them
>> <
>> > up to the application server. --pboddie at LWN
>> <
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com