Re: [ceph-users] Help Ceph Cluster Down

Arun POONIA Fri, 04 Jan 2019 05:42:22 -0800

Hi Caspar,

Yes and No, numbers are going up and down. If I run ceph -s command I can
see it decreases one time and later it increases again. I see there are so
many blocked/slow requests. Almost all the OSDs have slow requests. Around
12% PGs are inactive not sure how to activate them again.



[root@fre101 ~]# ceph health detail
2019-01-04 05:39:23.860142 7fc37a3a0700 -1 asok(0x7fc3740017a0)
AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to
bind the UNIX domain socket to
'/var/run/ceph-guests/ceph-client.admin.1066526.140477441513808.asok': (2)
No such file or directory
HEALTH_ERR 1 osds down; 3 pools have many more objects per pg than average;
472812/12392654 objects misplaced (3.815%); 3610 PGs pending on creation;
Reduced data availability: 6578 pgs inactive, 1882 pgs down, 86 pgs
peering, 850 pgs stale; Degraded data redundancy: 216694/12392654 objects
degraded (1.749%), 866 pgs degraded, 16 pgs undersized; 116082 slow
requests are blocked > 32 sec; 551 stuck requests are blocked > 4096 sec;
too many PGs per OSD (2709 > max 200)
OSD_DOWN 1 osds down
    osd.28 (root=default,host=fre119) is down
MANY_OBJECTS_PER_PG 3 pools have many more objects per pg than average
    pool glance-images objects per pg (10478) is more than 92.7257 times
cluster average (113)
    pool vms objects per pg (4717) is more than 41.7434 times cluster
average (113)
    pool volumes objects per pg (1220) is more than 10.7965 times cluster
average (113)
OBJECT_MISPLACED 472812/12392654 objects misplaced (3.815%)
PENDING_CREATING_PGS 3610 PGs pending on creation
    osds
[osd.0,osd.1,osd.10,osd.11,osd.14,osd.15,osd.17,osd.18,osd.19,osd.20,osd.21,osd.22,osd.23,osd.25,osd.26,osd.27,osd.28,osd.3,osd.30,osd.32,osd.33,osd.35,osd.36,osd.37,osd.38,osd.4,osd.5,osd.6,osd.7,osd.9]
have pending PGs.
PG_AVAILABILITY Reduced data availability: 6578 pgs inactive, 1882 pgs
down, 86 pgs peering, 850 pgs stale
    pg 10.900 is down, acting [18]
    pg 10.90e is stuck inactive for 60266.030164, current state activating,
last acting [2,38]
    pg 10.913 is stuck stale for 1887.552862, current state stale+down,
last acting [9]
    pg 10.915 is stuck inactive for 60266.215231, current state activating,
last acting [30,38]
    pg 11.903 is stuck inactive for 59294.465961, current state activating,
last acting [11,38]
    pg 11.910 is down, acting [21]
    pg 11.919 is down, acting [25]
    pg 12.902 is stuck inactive for 57118.544590, current state activating,
last acting [36,14]
    pg 13.8f8 is stuck inactive for 60707.167787, current state activating,
last acting [29,37]
    pg 13.901 is stuck stale for 60226.543289, current state
stale+active+clean, last acting [1,31]
    pg 13.905 is stuck inactive for 60266.050940, current state activating,
last acting [2,36]
    pg 13.909 is stuck inactive for 60707.160714, current state activating,
last acting [34,36]
    pg 13.90e is stuck inactive for 60707.410749, current state activating,
last acting [21,36]
    pg 13.911 is down, acting [25]
    pg 13.914 is stale+down, acting [29]
    pg 13.917 is stuck stale for 580.224688, current state stale+down, last
acting [16]
    pg 14.901 is stuck inactive for 60266.037762, current state
activating+degraded, last acting [22,37]
    pg 14.90f is stuck inactive for 60296.996447, current state activating,
last acting [30,36]
    pg 14.910 is stuck inactive for 60266.077310, current state
activating+degraded, last acting [17,37]
    pg 14.915 is stuck inactive for 60266.032445, current state activating,
last acting [34,36]
    pg 15.8fa is stuck stale for 560.223249, current state stale+down, last
acting [8]
    pg 15.90c is stuck inactive for 59294.402388, current state activating,
last acting [29,38]
    pg 15.90d is stuck inactive for 60266.176492, current state activating,
last acting [5,36]
    pg 15.915 is down, acting [0]
    pg 15.917 is stuck inactive for 56279.658951, current state activating,
last acting [13,38]
    pg 15.91c is stuck stale for 374.590704, current state stale+down, last
acting [12]
    pg 16.903 is stuck inactive for 56580.905961, current state activating,
last acting [25,37]
    pg 16.90e is stuck inactive for 60266.271680, current state activating,
last acting [14,37]
    pg 16.919 is stuck inactive for 59901.802184, current state activating,
last acting [20,37]
    pg 16.91e is stuck inactive for 60297.038159, current state activating,
last acting [22,37]
    pg 17.8e5 is stuck inactive for 60266.149061, current state activating,
last acting [25,36]
    pg 17.910 is stuck inactive for 59901.850204, current state activating,
last acting [26,37]
    pg 17.913 is stuck inactive for 60707.208364, current state activating,
last acting [13,36]
    pg 17.91a is stuck inactive for 60266.187509, current state activating,
last acting [4,37]
    pg 17.91f is down, acting [6]
    pg 18.908 is stuck inactive for 60707.216314, current state activating,
last acting [10,36]
    pg 18.911 is stuck stale for 244.570413, current state stale+down, last
acting [34]
    pg 18.919 is stuck inactive for 60265.980816, current state activating,
last acting [28,36]
    pg 18.91a is stuck inactive for 59901.814714, current state activating,
last acting [28,37]
    pg 18.91e is stuck inactive for 60707.179338, current state activating,
last acting [0,36]
    pg 19.90a is stuck inactive for 60203.089988, current state activating,
last acting [35,38]
    pg 20.8e0 is stuck inactive for 60296.839098, current state
activating+degraded, last acting [18,37]
    pg 20.913 is stuck inactive for 60296.977401, current state
activating+degraded, last acting [11,37]
    pg 20.91d is stuck inactive for 60296.891370, current state
activating+degraded, last acting [10,38]
    pg 21.8e1 is stuck inactive for 60707.422330, current state activating,
last acting [21,38]
    pg 21.907 is stuck inactive for 60296.855511, current state activating,
last acting [20,36]
    pg 21.90e is stuck inactive for 60266.055557, current state activating,
last acting [1,38]
    pg 21.917 is stuck inactive for 60296.940074, current state activating,
last acting [15,36]
    pg 22.90b is stuck inactive for 60707.286070, current state activating,
last acting [20,36]
    pg 22.90c is stuck inactive for 59901.788199, current state activating,
last acting [20,37]
    pg 22.90f is stuck inactive for 60297.062020, current state activating,
last acting [38,35]
PG_DEGRADED Degraded data redundancy: 216694/12392654 objects degraded
(1.749%), 866 pgs degraded, 16 pgs undersized
    pg 12.85a is active+undersized+degraded, acting [3]
    pg 14.843 is activating+degraded, acting [7,38]
    pg 14.85f is activating+degraded, acting [25,36]
    pg 14.865 is activating+degraded, acting [33,37]
    pg 14.87a is activating+degraded, acting [28,36]
    pg 14.87e is activating+degraded, acting [17,38]
    pg 14.882 is activating+degraded, acting [4,36]
    pg 14.88a is activating+degraded, acting [2,37]
    pg 14.893 is activating+degraded, acting [24,36]
    pg 14.897 is active+undersized+degraded, acting [34]
    pg 14.89c is activating+degraded, acting [14,38]
    pg 14.89e is activating+degraded, acting [15,38]
    pg 14.8a8 is active+undersized+degraded, acting [33]
    pg 14.8b1 is activating+degraded, acting [30,38]
    pg 14.8d4 is active+undersized+degraded, acting [13]
    pg 14.8d8 is active+undersized+degraded, acting [4]
    pg 14.8e6 is active+undersized+degraded, acting [10]
    pg 14.8e7 is active+undersized+degraded, acting [1]
    pg 14.8ef is activating+degraded, acting [9,36]
    pg 14.8f8 is active+undersized+degraded, acting [30]
    pg 14.901 is activating+degraded, acting [22,37]
    pg 14.910 is activating+degraded, acting [17,37]
    pg 14.913 is active+undersized+degraded, acting [18]
    pg 20.821 is activating+degraded, acting [37,33]
    pg 20.825 is activating+degraded, acting [25,36]
    pg 20.84f is active+undersized+degraded, acting [2]
    pg 20.85a is active+undersized+degraded, acting [11]
    pg 20.85f is activating+degraded, acting [1,38]
    pg 20.865 is activating+degraded, acting [8,38]
    pg 20.869 is activating+degraded, acting [27,37]
    pg 20.87b is active+undersized+degraded, acting [30]
    pg 20.88b is activating+degraded, acting [6,38]
    pg 20.895 is activating+degraded, acting [37,27]
    pg 20.89c is activating+degraded, acting [1,36]
    pg 20.8a3 is activating+degraded, acting [30,36]
    pg 20.8ad is activating+degraded, acting [1,38]
    pg 20.8af is activating+degraded, acting [33,37]
    pg 20.8b7 is activating+degraded, acting [0,38]
    pg 20.8b9 is activating+degraded, acting [20,38]
    pg 20.8d4 is activating+degraded, acting [28,37]
    pg 20.8d5 is activating+degraded, acting [24,37]
    pg 20.8e0 is activating+degraded, acting [18,37]
    pg 20.8e3 is activating+degraded, acting [21,38]
    pg 20.8ea is activating+degraded, acting [17,36]
    pg 20.8ee is active+undersized+degraded, acting [4]
    pg 20.8f2 is activating+degraded, acting [3,36]
    pg 20.8fb is activating+degraded, acting [10,38]
    pg 20.8fc is activating+degraded, acting [20,38]
    pg 20.913 is activating+degraded, acting [11,37]
    pg 20.916 is active+undersized+degraded, acting [21]
    pg 20.91d is activating+degraded, acting [10,38]
REQUEST_SLOW 116082 slow requests are blocked > 32 sec
    10619 ops are blocked > 2097.15 sec
    74227 ops are blocked > 1048.58 sec
    18561 ops are blocked > 524.288 sec
    10862 ops are blocked > 262.144 sec
    1037 ops are blocked > 131.072 sec
    520 ops are blocked > 65.536 sec
    256 ops are blocked > 32.768 sec
    osd.29 has blocked requests > 32.768 sec
    osd.15 has blocked requests > 262.144 sec
    osds 12,13,31 have blocked requests > 524.288 sec
    osds 1,8,16,19,23,25,26,33,37,38 have blocked requests > 1048.58 sec
    osds 3,4,5,6,10,14,17,22,27,30,32,35,36 have blocked requests > 2097.15
sec
REQUEST_STUCK 551 stuck requests are blocked > 4096 sec
    551 ops are blocked > 4194.3 sec
    osds 0,28 have stuck requests > 4194.3 sec
TOO_MANY_PGS too many PGs per OSD (2709 > max 200)
[root@fre101 ~]#
[root@fre101 ~]#
[root@fre101 ~]#
[root@fre101 ~]#
[root@fre101 ~]#
[root@fre101 ~]#
[root@fre101 ~]# ceph -s
2019-01-04 05:39:29.364100 7f0fb32f2700 -1 asok(0x7f0fac0017a0)
AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to
bind the UNIX domain socket to
'/var/run/ceph-guests/ceph-client.admin.1066635.139705286924624.asok': (2)
No such file or directory
  cluster:
    id:     adb9ad8e-f458-4124-bf58-7963a8d1391f
    health: HEALTH_ERR
            3 pools have many more objects per pg than average
            473825/12392654 objects misplaced (3.823%)
            3723 PGs pending on creation
            Reduced data availability: 6677 pgs inactive, 1948 pgs down,
157 pgs peering, 850 pgs stale
            Degraded data redundancy: 306567/12392654 objects degraded
(2.474%), 949 pgs degraded, 16 pgs undersized
            98047 slow requests are blocked > 32 sec
            33 stuck requests are blocked > 4096 sec
            too many PGs per OSD (2690 > max 200)

  services:
    mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03
    mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02
    osd: 39 osds: 39 up, 39 in; 76 remapped pgs
    rgw: 1 daemon active

  data:
    pools:   18 pools, 54656 pgs
    objects: 6051k objects, 10944 GB
    usage:   21934 GB used, 50687 GB / 72622 GB avail
    pgs:     13.267% pgs not active
             306567/12392654 objects degraded (2.474%)
             473825/12392654 objects misplaced (3.823%)
             44937 active+clean
             3850  activating
             1936  active+undersized
             1078  down
             864   stale+down
             597   peering
             591   activating+degraded
             316   active+undersized+degraded
             205   stale+active+clean
             133   stale+activating
             67    activating+remapped
             32    stale+activating+degraded
             21    stale+activating+remapped
             9     stale+active+undersized
             6     down+remapped
             5     stale+activating+undersized+degraded+remapped
             2     activating+degraded+remapped
             1     stale+activating+degraded+remapped
             1     stale+active+undersized+degraded
             1     remapped+peering
             1     active+clean+remapped
             1     stale+remapped+peering
             1     stale+peering
             1     activating+undersized+degraded+remapped

  io:
    client:   0 B/s rd, 23566 B/s wr, 0 op/s rd, 3 op/s wr

Thanks

Arun

On Fri, Jan 4, 2019 at 5:38 AM Caspar Smit <caspars...@supernas.eu> wrote:

> Are the numbers still decreasing?
>
> This one for instance:
>
> "3883 PGs pending on creation"
>
> Caspar
>
>
> Op vr 4 jan. 2019 om 14:23 schreef Arun POONIA <
> arun.poo...@nuagenetworks.net>:
>
>> Hi Caspar,
>>
>> Yes, cluster was working fine with number of PGs per OSD warning up until
>> now. I am not sure how to recover from stale down/inactive PGs. If you
>> happen to know about this can you let me know?
>>
>> Current State:
>>
>> [root@fre101 ~]# ceph -s
>> 2019-01-04 05:22:05.942349 7f314f613700 -1 asok(0x7f31480017a0)
>> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to
>> bind the UNIX domain socket to
>> '/var/run/ceph-guests/ceph-client.admin.1053724.139849638091088.asok': (2)
>> No such file or directory
>>   cluster:
>>     id:     adb9ad8e-f458-4124-bf58-7963a8d1391f
>>     health: HEALTH_ERR
>>             3 pools have many more objects per pg than average
>>             505714/12392650 objects misplaced (4.081%)
>>             3883 PGs pending on creation
>>             Reduced data availability: 6519 pgs inactive, 1870 pgs down,
>> 1 pg peering, 886 pgs stale
>>             Degraded data redundancy: 42987/12392650 objects degraded
>> (0.347%), 634 pgs degraded, 16 pgs undersized
>>             125827 slow requests are blocked > 32 sec
>>             2 stuck requests are blocked > 4096 sec
>>             too many PGs per OSD (2758 > max 200)
>>
>>   services:
>>     mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03
>>     mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02
>>     osd: 39 osds: 39 up, 39 in; 76 remapped pgs
>>     rgw: 1 daemon active
>>
>>   data:
>>     pools:   18 pools, 54656 pgs
>>     objects: 6051k objects, 10944 GB
>>     usage:   21933 GB used, 50688 GB / 72622 GB avail
>>     pgs:     11.927% pgs not active
>>              42987/12392650 objects degraded (0.347%)
>>              505714/12392650 objects misplaced (4.081%)
>>              48080 active+clean
>>              3885  activating
>>              1111  down
>>              759   stale+down
>>              614   activating+degraded
>>              74    activating+remapped
>>              46    stale+active+clean
>>              35    stale+activating
>>              21    stale+activating+remapped
>>              9     stale+active+undersized
>>              9     stale+activating+degraded
>>              5     stale+activating+undersized+degraded+remapped
>>              3     activating+degraded+remapped
>>              1     stale+activating+degraded+remapped
>>              1     stale+active+undersized+degraded
>>              1     remapped+peering
>>              1     active+clean+remapped
>>              1     activating+undersized+degraded+remapped
>>
>>   io:
>>     client:   0 B/s rd, 25397 B/s wr, 4 op/s rd, 4 op/s wr
>>
>> I will update number of PGs per OSD once these inactive or stale PGs come
>> online. I am not able to access VMs (VMs, Images) which are using Ceph.
>>
>> Thanks
>> Arun
>>
>> On Fri, Jan 4, 2019 at 4:53 AM Caspar Smit <caspars...@supernas.eu>
>> wrote:
>>
>>> Hi Arun,
>>>
>>> How did you end up with a 'working' cluster with so many pgs per OSD?
>>>
>>> "too many PGs per OSD (2968 > max 200)"
>>>
>>> To (temporarily) allow this kind of pgs per osd you could try this:
>>>
>>> Change these values in the global section in your ceph.conf:
>>>
>>> mon max pg per osd = 200
>>> osd max pg per osd hard ratio = 2
>>>
>>> It allows 200*2 = 400 Pgs per OSD before disabling the creation of new
>>> pgs.
>>>
>>> Above are the defaults (for Luminous, maybe other versions too)
>>> You can check your current settings with:
>>>
>>> ceph daemon mon.ceph-mon01 config show |grep pg_per_osd
>>>
>>> Since your current pgs per osd ratio is way higher then the default you
>>> could set them to for instance:
>>>
>>> mon max pg per osd = 1000
>>> osd max pg per osd hard ratio = 5
>>>
>>> Which allow for 5000 pgs per osd before disabling creation of new pgs.
>>>
>>> You'll need to inject the setting into the mons/osds and restart mgrs to
>>> make them active.
>>>
>>> ceph tell mon.* injectargs ‘--mon_max_pg_per_osd 1000’
>>> ceph tell mon.* injectargs ‘--osd_max_pg_per_osd_hard_ratio 5’
>>> ceph tell osd.* injectargs ‘--mon_max_pg_per_osd 1000’
>>> ceph tell osd.* injectargs ‘--osd_max_pg_per_osd_hard_ratio 5’
>>> restart mgrs
>>>
>>> Kind regards,
>>> Caspar
>>>
>>>
>>> Op vr 4 jan. 2019 om 04:28 schreef Arun POONIA <
>>> arun.poo...@nuagenetworks.net>:
>>>
>>>> Hi Chris,
>>>>
>>>> Indeed that's what happened. I didn't set noout flag either and I did
>>>> zapped disk on new server every time. In my cluster status fre201 is only
>>>> new server.
>>>>
>>>> Current Status after enabling 3 OSDs on fre201 host.
>>>>
>>>> [root@fre201 ~]# ceph osd tree
>>>> ID  CLASS WEIGHT   TYPE NAME       STATUS REWEIGHT PRI-AFF
>>>>  -1       70.92137 root default
>>>>  -2        5.45549     host fre101
>>>>   0   hdd  1.81850         osd.0       up  1.00000 1.00000
>>>>   1   hdd  1.81850         osd.1       up  1.00000 1.00000
>>>>   2   hdd  1.81850         osd.2       up  1.00000 1.00000
>>>>  -9        5.45549     host fre103
>>>>   3   hdd  1.81850         osd.3       up  1.00000 1.00000
>>>>   4   hdd  1.81850         osd.4       up  1.00000 1.00000
>>>>   5   hdd  1.81850         osd.5       up  1.00000 1.00000
>>>>  -3        5.45549     host fre105
>>>>   6   hdd  1.81850         osd.6       up  1.00000 1.00000
>>>>   7   hdd  1.81850         osd.7       up  1.00000 1.00000
>>>>   8   hdd  1.81850         osd.8       up  1.00000 1.00000
>>>>  -4        5.45549     host fre107
>>>>   9   hdd  1.81850         osd.9       up  1.00000 1.00000
>>>>  10   hdd  1.81850         osd.10      up  1.00000 1.00000
>>>>  11   hdd  1.81850         osd.11      up  1.00000 1.00000
>>>>  -5        5.45549     host fre109
>>>>  12   hdd  1.81850         osd.12      up  1.00000 1.00000
>>>>  13   hdd  1.81850         osd.13      up  1.00000 1.00000
>>>>  14   hdd  1.81850         osd.14      up  1.00000 1.00000
>>>>  -6        5.45549     host fre111
>>>>  15   hdd  1.81850         osd.15      up  1.00000 1.00000
>>>>  16   hdd  1.81850         osd.16      up  1.00000 1.00000
>>>>  17   hdd  1.81850         osd.17      up  0.79999 1.00000
>>>>  -7        5.45549     host fre113
>>>>  18   hdd  1.81850         osd.18      up  1.00000 1.00000
>>>>  19   hdd  1.81850         osd.19      up  1.00000 1.00000
>>>>  20   hdd  1.81850         osd.20      up  1.00000 1.00000
>>>>  -8        5.45549     host fre115
>>>>  21   hdd  1.81850         osd.21      up  1.00000 1.00000
>>>>  22   hdd  1.81850         osd.22      up  1.00000 1.00000
>>>>  23   hdd  1.81850         osd.23      up  1.00000 1.00000
>>>> -10        5.45549     host fre117
>>>>  24   hdd  1.81850         osd.24      up  1.00000 1.00000
>>>>  25   hdd  1.81850         osd.25      up  1.00000 1.00000
>>>>  26   hdd  1.81850         osd.26      up  1.00000 1.00000
>>>> -11        5.45549     host fre119
>>>>  27   hdd  1.81850         osd.27      up  1.00000 1.00000
>>>>  28   hdd  1.81850         osd.28      up  1.00000 1.00000
>>>>  29   hdd  1.81850         osd.29      up  1.00000 1.00000
>>>> -12        5.45549     host fre121
>>>>  30   hdd  1.81850         osd.30      up  1.00000 1.00000
>>>>  31   hdd  1.81850         osd.31      up  1.00000 1.00000
>>>>  32   hdd  1.81850         osd.32      up  1.00000 1.00000
>>>> -13        5.45549     host fre123
>>>>  33   hdd  1.81850         osd.33      up  1.00000 1.00000
>>>>  34   hdd  1.81850         osd.34      up  1.00000 1.00000
>>>>  35   hdd  1.81850         osd.35      up  1.00000 1.00000
>>>> -27        5.45549     host fre201
>>>>  36   hdd  1.81850         osd.36      up  1.00000 1.00000
>>>>  37   hdd  1.81850         osd.37      up  1.00000 1.00000
>>>>  38   hdd  1.81850         osd.38      up  1.00000 1.00000
>>>> [root@fre201 ~]#
>>>> [root@fre201 ~]#
>>>> [root@fre201 ~]#
>>>> [root@fre201 ~]#
>>>> [root@fre201 ~]#
>>>> [root@fre201 ~]# ceph -s
>>>>   cluster:
>>>>     id:     adb9ad8e-f458-4124-bf58-7963a8d1391f
>>>>     health: HEALTH_ERR
>>>>             3 pools have many more objects per pg than average
>>>>             585791/12391450 objects misplaced (4.727%)
>>>>             2 scrub errors
>>>>             2374 PGs pending on creation
>>>>             Reduced data availability: 6578 pgs inactive, 2025 pgs
>>>> down, 74 pgs peering, 1234 pgs stale
>>>>             Possible data damage: 2 pgs inconsistent
>>>>             Degraded data redundancy: 64969/12391450 objects degraded
>>>> (0.524%), 616 pgs degraded, 20 pgs undersized
>>>>             96242 slow requests are blocked > 32 sec
>>>>             228 stuck requests are blocked > 4096 sec
>>>>             too many PGs per OSD (2768 > max 200)
>>>>
>>>>   services:
>>>>     mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03
>>>>     mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02
>>>>     osd: 39 osds: 39 up, 39 in; 96 remapped pgs
>>>>     rgw: 1 daemon active
>>>>
>>>>   data:
>>>>     pools:   18 pools, 54656 pgs
>>>>     objects: 6050k objects, 10942 GB
>>>>     usage:   21900 GB used, 50721 GB / 72622 GB avail
>>>>     pgs:     0.002% pgs unknown
>>>>              12.050% pgs not active
>>>>              64969/12391450 objects degraded (0.524%)
>>>>              585791/12391450 objects misplaced (4.727%)
>>>>              47489 active+clean
>>>>              3670  activating
>>>>              1098  stale+down
>>>>              923   down
>>>>              575   activating+degraded
>>>>              563   stale+active+clean
>>>>              105   stale+activating
>>>>              78    activating+remapped
>>>>              72    peering
>>>>              25    stale+activating+degraded
>>>>              23    stale+activating+remapped
>>>>              9     stale+active+undersized
>>>>              6     stale+activating+undersized+degraded+remapped
>>>>              5     stale+active+undersized+degraded
>>>>              4     down+remapped
>>>>              4     activating+degraded+remapped
>>>>              2     active+clean+inconsistent
>>>>              1     stale+activating+degraded+remapped
>>>>              1     stale+active+clean+remapped
>>>>              1     stale+remapped+peering
>>>>              1     remapped+peering
>>>>              1     unknown
>>>>
>>>>   io:
>>>>     client:   0 B/s rd, 208 kB/s wr, 22 op/s rd, 22 op/s wr
>>>>
>>>>
>>>>
>>>> Thanks
>>>> Arun
>>>>
>>>>
>>>> On Thu, Jan 3, 2019 at 7:19 PM Chris <bitskr...@bitskrieg.net> wrote:
>>>>
>>>>> If you added OSDs and then deleted them repeatedly without waiting for
>>>>> replication to finish as the cluster attempted to re-balance across them,
>>>>> its highly likely that you are permanently missing PGs (especially if the
>>>>> disks were zapped each time).
>>>>>
>>>>> If those 3 down OSDs can be revived there is a (small) chance that you
>>>>> can right the ship, but 1400pg/OSD is pretty extreme.  I'm surprised
>>>>> the cluster even let you do that - this sounds like a data loss event.
>>>>>
>>>>> Bring back the 3 OSD and see what those 2 inconsistent pgs look like
>>>>> with ceph pg query.
>>>>>
>>>>> On January 3, 2019 21:59:38 Arun POONIA <arun.poo...@nuagenetworks.net>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Recently I tried adding a new node (OSD) to ceph cluster using
>>>>>> ceph-deploy tool. Since I was experimenting with tool and ended up 
>>>>>> deleting
>>>>>> OSD nodes on new server couple of times.
>>>>>>
>>>>>> Now since ceph OSDs are running on new server cluster PGs seems to be
>>>>>> inactive (10-15%) and they are not recovering or rebalancing. Not sure 
>>>>>> what
>>>>>> to do. I tried shutting down OSDs on new server.
>>>>>>
>>>>>> Status:
>>>>>> [root@fre105 ~]# ceph -s
>>>>>> 2019-01-03 18:56:42.867081 7fa0bf573700 -1 asok(0x7fa0b80017a0)
>>>>>> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed 
>>>>>> to
>>>>>> bind the UNIX domain socket to
>>>>>> '/var/run/ceph-guests/ceph-client.admin.4018644.140328258509136.asok': 
>>>>>> (2)
>>>>>> No such file or directory
>>>>>>   cluster:
>>>>>>     id:     adb9ad8e-f458-4124-bf58-7963a8d1391f
>>>>>>     health: HEALTH_ERR
>>>>>>             3 pools have many more objects per pg than average
>>>>>>             373907/12391198 objects misplaced (3.018%)
>>>>>>             2 scrub errors
>>>>>>             9677 PGs pending on creation
>>>>>>             Reduced data availability: 7145 pgs inactive, 6228 pgs
>>>>>> down, 1 pg peering, 2717 pgs stale
>>>>>>             Possible data damage: 2 pgs inconsistent
>>>>>>             Degraded data redundancy: 178350/12391198 objects
>>>>>> degraded (1.439%), 346 pgs degraded, 1297 pgs undersized
>>>>>>             52486 slow requests are blocked > 32 sec
>>>>>>             9287 stuck requests are blocked > 4096 sec
>>>>>>             too many PGs per OSD (2968 > max 200)
>>>>>>
>>>>>>   services:
>>>>>>     mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03
>>>>>>     mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02
>>>>>>     osd: 39 osds: 36 up, 36 in; 51 remapped pgs
>>>>>>     rgw: 1 daemon active
>>>>>>
>>>>>>   data:
>>>>>>     pools:   18 pools, 54656 pgs
>>>>>>     objects: 6050k objects, 10941 GB
>>>>>>     usage:   21727 GB used, 45308 GB / 67035 GB avail
>>>>>>     pgs:     13.073% pgs not active
>>>>>>              178350/12391198 objects degraded (1.439%)
>>>>>>              373907/12391198 objects misplaced (3.018%)
>>>>>>              46177 active+clean
>>>>>>              5054  down
>>>>>>              1173  stale+down
>>>>>>              1084  stale+active+undersized
>>>>>>              547   activating
>>>>>>              201   stale+active+undersized+degraded
>>>>>>              158   stale+activating
>>>>>>              96    activating+degraded
>>>>>>              46    stale+active+clean
>>>>>>              42    activating+remapped
>>>>>>              34    stale+activating+degraded
>>>>>>              23    stale+activating+remapped
>>>>>>              6     stale+activating+undersized+degraded+remapped
>>>>>>              6     activating+undersized+degraded+remapped
>>>>>>              2     activating+degraded+remapped
>>>>>>              2     active+clean+inconsistent
>>>>>>              1     stale+activating+degraded+remapped
>>>>>>              1     stale+active+clean+remapped
>>>>>>              1     stale+remapped
>>>>>>              1     down+remapped
>>>>>>              1     remapped+peering
>>>>>>
>>>>>>   io:
>>>>>>     client:   0 B/s rd, 208 kB/s wr, 28 op/s rd, 28 op/s wr
>>>>>>
>>>>>> Thanks
>>>>>> --
>>>>>> Arun Poonia
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Arun Poonia
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> --
>> Arun Poonia
>>
>> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Arun Poonia

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Help Ceph Cluster Down

Reply via email to