Re: [ceph-users] pgs stuck unclean after removing OSDs

Mazzystr Wed, 28 Jun 2017 07:13:21 -0700

I've been using this procedure to remove OSDs...

OSD_ID=
ceph auth del osd.${OSD_ID}
ceph osd down ${OSD_ID}
ceph osd out ${OSD_ID}
ceph osd rm ${OSD_ID}
ceph osd crush remove osd.${OSD_ID}
systemctl disable ceph-osd@${OSD_ID}.service
systemctl stop ceph-osd@${OSD_ID}.service
sed -i "/ceph-$OSD_ID/d" /etc/fstab
umount /var/lib/ceph/osd/ceph-${OSD_ID}


Would you say this is the correct order of events?

Thanks!


On Wed, Jun 28, 2017 at 9:34 AM, David Turner <[email protected]> wrote:

> A couple things.  You didn't `ceph osd crush remove osd.21` after doing
> the other bits.  Also you will want to remove the bucket (re: host) from
> the crush map as it will now be empty.  Right now you have a host in the
> crush map with a weight, but no osds to put that data on.  It has a weight
> because of the 2 OSDs that are still in it that were removed from the
> cluster but not from the crush map.  It's confusing to your cluster.
>
> If you had removed the OSDs from the crush map when you ran the other
> commands, then the dead host would have still been in the crush map but
> with a weight of 0 and wouldn't cause any problems.
>
> On Wed, Jun 28, 2017 at 4:15 AM Jan Kasprzak <[email protected]> wrote:
>
>>         Hello,
>>
>> TL;DR: what to do when my cluster reports stuck unclean pgs?
>>
>> Detailed description:
>>
>> One of the nodes in my cluster died. CEPH correctly rebalanced itself,
>> and reached the HEALTH_OK state. I have looked at the failed server,
>> and decided to take it out of the cluster permanently, because the
>> hardware
>> is indeed faulty. It used to host two OSDs, which were marked down and out
>> in "ceph osd dump".
>>
>> So from the HEALTH_OK I ran the following commands:
>>
>> # ceph auth del osd.20
>> # ceph auth del osd.21
>> # ceph osd rm osd.20
>> # ceph osd rm osd.21
>>
>> After that, CEPH started to rebalance itself, but now it reports some PGs
>> as "stuck unclean", and there is no "recovery I/O" visible in "ceph -s":
>>
>> # ceph -s
>>     cluster 3065224c-ea2e-4558-8a81-8f935dde56e5
>>      health HEALTH_WARN
>>             350 pgs stuck unclean
>>             recovery 26/1596390 objects degraded (0.002%)
>>             recovery 58772/1596390 objects misplaced (3.682%)
>>      monmap e16: 3 mons at {...}
>>             election epoch 584, quorum 0,1,2 ...
>>      osdmap e61435: 58 osds: 58 up, 58 in; 350 remapped pgs
>>             flags require_jewel_osds
>>       pgmap v35959908: 3776 pgs, 6 pools, 2051 GB data, 519 kobjects
>>             6244 GB used, 40569 GB / 46814 GB avail
>>             26/1596390 objects degraded (0.002%)
>>             58772/1596390 objects misplaced (3.682%)
>>                 3426 active+clean
>>                  349 active+remapped
>>                    1 active
>>   client io 5818 B/s rd, 8457 kB/s wr, 0 op/s rd, 71 op/s wr
>>
>> # ceph health detail
>> HEALTH_WARN 350 pgs stuck unclean; recovery 26/1596390 objects degraded
>> (0.002%); recovery 58772/1596390 objects misplaced (3.682%)
>> pg 28.fa is stuck unclean for 14408925.966824, current state
>> active+remapped, last acting [38,52,4]
>> pg 28.e7 is stuck unclean for 14408925.966886, current state
>> active+remapped, last acting [29,42,22]
>> pg 23.dc is stuck unclean for 61698.641750, current state
>> active+remapped, last acting [50,33,23]
>> pg 23.d9 is stuck unclean for 61223.093284, current state
>> active+remapped, last acting [54,31,23]
>> pg 28.df is stuck unclean for 14408925.967120, current state
>> active+remapped, last acting [33,7,15]
>> pg 34.38 is stuck unclean for 60904.322881, current state
>> active+remapped, last acting [18,41,9]
>> pg 34.fe is stuck unclean for 60904.241762, current state
>> active+remapped, last acting [58,1,44]
>> [...]
>> pg 28.8f is stuck unclean for 66102.059671, current state active, last
>> acting [8,40,5]
>> [...]
>> recovery 26/1596390 objects degraded (0.002%)
>> recovery 58772/1596390 objects misplaced (3.682%)
>>
>> Apart from that, the data stored in CEPH pools seems to be reachable
>> and usable as before.
>>
>> The nodes run CentOS 7 and ceph 10.2.5 (RPMS downloaded from CEPH
>> repository).
>>
>> What other debugging info should I provide, or what to do in order
>> to unstuck the stuck pgs? Thanks!
>>
>> -Yenya
>>
>> --
>> | Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net -
>> private}> |
>> | http://www.fi.muni.cz/~kas/                         GPG:
>> 4096R/A45477D5 |
>> > That's why this kind of vulnerability is a concern: deploying stuff is
>> <
>> > often about collecting an obscene number of .jar files and pushing them
>> <
>> > up to the application server.                          --pboddie at LWN
>> <
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] pgs stuck unclean after removing OSDs

Reply via email to