Re: [ceph-users] Added two OSDs, 10% of pgs went inactive

Daniel K Thu, 21 Dec 2017 04:36:08 -0800

Caspar,

I found Nick Fisk's post yesterday
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-December/023223.html
and set osd_max_pg_per_osd_hard_ratio = 4 in my ceph.conf on the OSDs and
restarted the 10TB OSDs. The PGs went back active and recovery is complete
now.


My setup is similar to his in that there's a large difference in OSD size,
most are 1.8TB, but about 10% of them are 10TB.

The difference is I had a functional Luminous cluster, until increased the
number 10TB OSDs from 6 to 8. I'm still not sure why that caused *more* PGs
per OSD with the same pools.

Thanks!

Daniel


On Wed, Dec 20, 2017 at 10:23 AM, Caspar Smit <caspars...@supernas.eu>
wrote:

> Hi Daniel,
>
> I've had the same problem with creating a new 12.2.2 cluster where i
> couldn't get some pgs out of the "activating+remapped" status after i
> switched some OSD's from one chassis to another (there was no data on it
> yet).
>
> I tried restarting OSD's to no avail.
>
> Couldn't find anything about the stuck in "activating+remapped" state so
> in the end i threw away the pool and started over.
>
> Could this be a bug in 12.2.2 ?
>
> Kind regards,
> Caspar
>
> 2017-12-20 15:48 GMT+01:00 Daniel K <satha...@gmail.com>:
>
>> Just an update.
>>
>> Recovery completed but the PGS are still inactive.
>>
>> Still having a hard time understanding why adding OSDs caused this. I'm
>> on 12.2.2
>>
>> user@admin:~$ ceph -s
>>   cluster:
>>     id:     a3672c60-3051-440c-bd83-8aff7835ce53
>>     health: HEALTH_WARN
>>             Reduced data availability: 307 pgs inactive
>>             Degraded data redundancy: 307 pgs unclean
>>
>>   services:
>>     mon: 5 daemons, quorum stor585r2u8a,stor585r2u12a,sto
>> r585r2u16a,stor585r2u20a,stor585r2u24a
>>     mgr: stor585r2u8a(active)
>>     osd: 88 osds: 87 up, 87 in; 133 remapped pgs
>>
>>   data:
>>     pools:   12 pools, 3016 pgs
>>     objects: 387k objects, 1546 GB
>>     usage:   3313 GB used, 186 TB / 189 TB avail
>>     pgs:     10.179% pgs not active
>>              2709 active+clean
>>              174  activating
>>              133  activating+remapped
>>
>>   io:
>>     client:   8436 kB/s rd, 935 kB/s wr, 140 op/s rd, 64 op/s wr
>>
>>
>> On Tue, Dec 19, 2017 at 8:57 PM, Daniel K <satha...@gmail.com> wrote:
>>
>>> I'm trying to understand why adding OSDs would cause pgs to go inactive.
>>>
>>> This cluster has 88 OSDs, and had 6 OSD with device class "hdd_10TB_7.2k"
>>>
>>> I added two more OSDs, set the device class to "hdd_10TB_7.2k" and 10%
>>> of pgs went inactive.
>>>
>>> I have an EC pool on these OSDs with the profile:
>>> user@admin:~$ ceph osd erasure-code-profile get ISA_10TB_7.2k_4.2
>>> crush-device-class=hdd_10TB_7.2k
>>> crush-failure-domain=host
>>> crush-root=default
>>> k=4
>>> m=2
>>> plugin=isa
>>> technique=reed_sol_van.
>>>
>>> some outputs of ceph health detail and ceph osd df
>>> user@admin:~$ ceph osd df |grep 10TB
>>> 76 hdd_10TB_7.2k 9.09509  1.00000 9313G   349G 8963G 3.76 2.20 488
>>> 20 hdd_10TB_7.2k 9.09509  1.00000 9313G   345G 8967G 3.71 2.17 489
>>> 28 hdd_10TB_7.2k 9.09509  1.00000 9313G   344G 8968G 3.70 2.17 484
>>> 36 hdd_10TB_7.2k 9.09509  1.00000 9313G   345G 8967G 3.71 2.17 484
>>> 87 hdd_10TB_7.2k 9.09560  1.00000 9313G  8936M 9305G 0.09 0.05 311
>>> 86 hdd_10TB_7.2k 9.09560  1.00000 9313G  8793M 9305G 0.09 0.05 304
>>>  6 hdd_10TB_7.2k 9.09509  1.00000 9313G   344G 8969G 3.70 2.16 471
>>> 68 hdd_10TB_7.2k 9.09509  1.00000 9313G   344G 8969G 3.70 2.17 480
>>> user@admin:~$ ceph health detail|grep inactive
>>> HEALTH_WARN 68287/1928007 objects misplaced (3.542%); Reduced data
>>> availability: 307 pgs inactive; Degraded data redundancy: 341 pgs unclean
>>> PG_AVAILABILITY Reduced data availability: 307 pgs inactive
>>>     pg 24.60 is stuck inactive for 1947.792377, current state
>>> activating+remapped, last acting [36,20,76,6,68,28]
>>>     pg 24.63 is stuck inactive for 1946.571425, current state
>>> activating+remapped, last acting [28,76,6,20,68,36]
>>>     pg 24.71 is stuck inactive for 1947.625988, current state
>>> activating+remapped, last acting [6,68,20,36,28,76]
>>>     pg 24.73 is stuck inactive for 1947.705250, current state
>>> activating+remapped, last acting [36,6,20,76,68,28]
>>>     pg 24.74 is stuck inactive for 1947.828063, current state
>>> activating+remapped, last acting [68,36,28,20,6,76]
>>>     pg 24.75 is stuck inactive for 1947.475644, current state
>>> activating+remapped, last acting [6,28,76,36,20,68]
>>>     pg 24.76 is stuck inactive for 1947.712046, current state
>>> activating+remapped, last acting [20,76,6,28,68,36]
>>>     pg 24.78 is stuck inactive for 1946.576304, current state
>>> activating+remapped, last acting [76,20,68,36,6,28]
>>>     pg 24.7a is stuck inactive for 1947.820932, current state
>>> activating+remapped, last acting [36,20,28,68,6,76]
>>>     pg 24.7b is stuck inactive for 1947.858305, current state
>>> activating+remapped, last acting [68,6,20,28,76,36]
>>>     pg 24.7c is stuck inactive for 1947.753917, current state
>>> activating+remapped, last acting [76,6,20,36,28,68]
>>>     pg 24.7d is stuck inactive for 1947.803229, current state
>>> activating+remapped, last acting [68,6,28,20,36,76]
>>>     pg 24.7f is stuck inactive for 1947.792506, current state
>>> activating+remapped, last acting [28,20,76,6,68,36]
>>>     pg 24.8a is stuck inactive for 1947.823189, current state
>>> activating+remapped, last acting [28,76,20,6,36,68]
>>>     pg 24.8b is stuck inactive for 1946.579755, current state
>>> activating+remapped, last acting [76,68,20,28,6,36]
>>>     pg 24.8c is stuck inactive for 1947.555872, current state
>>> activating+remapped, last acting [76,36,68,6,28,20]
>>>     pg 24.8d is stuck inactive for 1946.589814, current state
>>> activating+remapped, last acting [36,6,28,76,68,20]
>>>     pg 24.8e is stuck inactive for 1947.802894, current state
>>> activating+remapped, last acting [28,6,68,36,76,20]
>>>     pg 24.8f is stuck inactive for 1947.528603, current state
>>> activating+remapped, last acting [76,28,6,68,20,36]
>>>     pg 25.60 is stuck inactive for 1947.620823, current state
>>> activating, last acting [20,6,87,36,28,68]
>>>     pg 25.61 is stuck inactive for 1947.883517, current state
>>> activating, last acting [28,36,86,76,6,87]
>>>     pg 25.62 is stuck inactive for 1542089.552271, current state
>>> activating, last acting [86,6,76,20,87,68]
>>>     pg 25.70 is stuck inactive for 1542089.729631, current state
>>> activating, last acting [86,87,76,20,68,28]
>>>     pg 25.71 is stuck inactive for 1947.642271, current state
>>> activating, last acting [28,86,68,20,6,36]
>>>     pg 25.75 is stuck inactive for 1947.825872, current state
>>> activating, last acting [68,86,36,20,76,6]
>>>     pg 25.76 is stuck inactive for 1947.737307, current state
>>> activating, last acting [36,87,28,6,68,76]
>>>     pg 25.77 is stuck inactive for 1947.218420, current state
>>> activating, last acting [87,36,86,28,76,6]
>>>     pg 25.79 is stuck inactive for 1947.253871, current state
>>> activating, last acting [6,36,86,28,68,76]
>>>     pg 25.7a is stuck inactive for 1542089.794085, current state
>>> activating, last acting [86,36,68,20,76,87]
>>>     pg 25.7c is stuck inactive for 1947.666774, current state
>>> activating, last acting [20,86,36,6,76,87]
>>>     pg 25.8a is stuck inactive for 1542089.687299, current state
>>> activating, last acting [87,36,68,20,86,28]
>>>     pg 25.8c is stuck inactive for 1947.545965, current state
>>> activating, last acting [76,6,28,87,36,86]
>>>     pg 25.8d is stuck inactive for 1947.213627, current state
>>> activating, last acting [86,36,87,20,28,76]
>>>     pg 25.8e is stuck inactive for 1947.230754, current state
>>> activating, last acting [87,86,68,28,76,20]
>>>     pg 25.8f is stuck inactive for 1542089.800416, current state
>>> activating, last acting [86,76,20,68,36,28]
>>>     pg 34.40 is stuck inactive for 1947.641110, current state
>>> activating, last acting [20,36,87,6,86,28]
>>>     pg 34.41 is stuck inactive for 1947.759524, current state
>>> activating, last acting [28,86,36,68,76,87]
>>>     pg 34.42 is stuck inactive for 1947.656110, current state
>>> activating, last acting [68,36,87,28,6,86]
>>>     pg 34.44 is stuck inactive for 1947.659653, current state
>>> activating, last acting [28,68,6,36,87,20]
>>>     pg 34.45 is stuck inactive for 1542089.795364, current state
>>> activating, last acting [86,28,76,36,6,68]
>>>     pg 34.46 is stuck inactive for 1947.570029, current state
>>> activating, last acting [28,20,87,6,86,36]
>>>     pg 34.47 is stuck inactive for 1947.667102, current state
>>> activating, last acting [20,86,68,76,36,87]
>>>     pg 34.48 is stuck inactive for 1947.632449, current state
>>> activating, last acting [28,76,6,86,87,20]
>>>     pg 34.4b is stuck inactive for 1947.671088, current state
>>> activating, last acting [36,87,68,28,6,20]
>>>     pg 34.4c is stuck inactive for 1947.699305, current state
>>> activating, last acting [20,6,86,28,87,68]
>>>     pg 34.4d is stuck inactive for 1542089.756804, current state
>>> activating, last acting [87,36,20,86,6,68]
>>>     pg 34.58 is stuck inactive for 1947.749120, current state
>>> activating, last acting [28,86,87,76,6,20]
>>>     pg 34.59 is stuck inactive for 1947.584327, current state
>>> activating, last acting [28,20,87,6,86,76]
>>>     pg 34.5a is stuck inactive for 1947.670953, current state
>>> activating, last acting [6,87,36,68,86,76]
>>>     pg 34.5b is stuck inactive for 1947.692114, current state
>>> activating, last acting [68,76,86,6,20,36]
>>>     pg 34.5e is stuck inactive for 1542089.773455, current state
>>> activating, last acting [86,68,28,87,6,36]
>>>
>>>
>>> It looks like recovery is happening, so they will eventually be active,
>>> but I'm trying to figure out what I did wrong, and how I could do this in
>>> the future to keep from taking 10% of my pgs offline.
>>>
>>>
>>> Thanks!
>>>
>>>
>>> Dan
>>>
>>>
>>>
>>>
>>>
>>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Added two OSDs, 10% of pgs went inactive

Reply via email to