Arun,

This is what i already suggested in my first reply.

Kind regards,
Caspar

Op za 5 jan. 2019 om 06:52 schreef Arun POONIA <
arun.poo...@nuagenetworks.net>:

> Hi Kevin,
>
> You are right. Increasing number of PGs per OSD resolved the issue. I will
> probably add this config in /etc/ceph/ceph.conf file of ceph mon and OSDs
> so it applies on host boot.
>
> Thanks
> Arun
>
> On Fri, Jan 4, 2019 at 3:46 PM Kevin Olbrich <k...@sv01.de> wrote:
>
>> Hi Arun,
>>
>> actually deleting was no good idea, thats why I wrote, that the OSDs
>> should be "out".
>> You have down PGs, that because the data is on OSDs that are
>> unavailable but known by the cluster.
>> This can be checked by using "ceph pg 0.5 query" (change PG name).
>>
>> Because your PG count is so much oversized, the overdose limits get
>> hit on every recovery on your cluster.
>> I had the same problem on a medium cluster when I added to many new
>> disks at once.
>> You already got this info from Caspar earlier in this thread.
>>
>>
>> https://ceph.com/planet/placement-groups-with-ceph-luminous-stay-in-activating-state/
>>
>> https://blog.widodh.nl/2018/01/placement-groups-with-ceph-luminous-stay-in-activating-state/
>>
>> The second link shows one of the config params you need to inject to
>> all your OSDs like this:
>> ceph tell osd.* injectargs --mon_max_pg_per_osd 10000
>>
>> This might help you getting these PGs some sort of "active"
>> (+recovery/+degraded/+inconsistent/etc.).
>>
>> The down PGs will most likely never come back. It would bet, you will
>> find OSD IDs that are invalid in the acting set, meaning that
>> non-existent OSDs hold your data.
>> I had a similar problem on a test cluster with erasure code pools
>> where too many disks failed at the same time, you will then see
>> negative values as OSD IDs.
>>
>> Maybe this helps a little bit.
>>
>> Kevin
>>
>> Am Sa., 5. Jan. 2019 um 00:20 Uhr schrieb Arun POONIA
>> <arun.poo...@nuagenetworks.net>:
>> >
>> > Hi Kevin,
>> >
>> > I tried deleting newly added server from Ceph Cluster and looks like
>> Ceph is not recovering. I agree with unfound data but it doesn't say about
>> unfound data. It says inactive/down for PGs and I can't bring them up.
>> >
>> >
>> > [root@fre101 ~]# ceph health detail
>> > 2019-01-04 15:17:05.711641 7f27b0f31700 -1 asok(0x7f27ac0017a0)
>> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to
>> bind the UNIX domain socket to
>> '/var/run/ceph-guests/ceph-client.admin.129552.139808366139728.asok': (2)
>> No such file or directory
>> > HEALTH_ERR 3 pools have many more objects per pg than average;
>> 523656/12393978 objects misplaced (4.225%); 6517 PGs pending on creation;
>> Reduced data availability: 6585 pgs inactive, 1267 pgs down, 2 pgs peering,
>> 2703 pgs stale; Degraded data redundancy: 86858/12393978 objects degraded
>> (0.701%), 717 pgs degraded, 21 pgs undersized; 99059 slow requests are
>> blocked > 32 sec; 4834 stuck requests are blocked > 4096 sec; too many PGs
>> per OSD (3003 > max 200)
>> > MANY_OBJECTS_PER_PG 3 pools have many more objects per pg than average
>> >     pool glance-images objects per pg (10478) is more than 92.7257
>> times cluster average (113)
>> >     pool vms objects per pg (4722) is more than 41.7876 times cluster
>> average (113)
>> >     pool volumes objects per pg (1220) is more than 10.7965 times
>> cluster average (113)
>> > OBJECT_MISPLACED 523656/12393978 objects misplaced (4.225%)
>> > PENDING_CREATING_PGS 6517 PGs pending on creation
>> >     osds
>> [osd.0,osd.1,osd.10,osd.11,osd.12,osd.13,osd.14,osd.15,osd.16,osd.17,osd.18,osd.19,osd.2,osd.20,osd.21,osd.22,osd.23,osd.24,osd.25,osd.26,osd.27,osd.28,osd.29,osd.3,osd.30,osd.31,osd.32,osd.33,osd.34,osd.35,osd.4,osd.5,osd.6,osd.7,osd.8,osd.9]
>> have pending PGs.
>> > PG_AVAILABILITY Reduced data availability: 6585 pgs inactive, 1267 pgs
>> down, 2 pgs peering, 2703 pgs stale
>> >     pg 10.90e is stuck inactive for 94928.999109, current state
>> activating, last acting [2,6]
>> >     pg 10.913 is stuck inactive for 95094.175400, current state
>> activating, last acting [9,5]
>> >     pg 10.915 is stuck inactive for 94929.184177, current state
>> activating, last acting [30,26]
>> >     pg 11.907 is stuck stale for 9612.906582, current state
>> stale+active+clean, last acting [38,24]
>> >     pg 11.910 is stuck stale for 11822.359237, current state
>> stale+down, last acting [21]
>> >     pg 11.915 is stuck stale for 9612.906604, current state
>> stale+active+clean, last acting [38,31]
>> >     pg 11.919 is stuck inactive for 95636.716568, current state
>> activating, last acting [25,12]
>> >     pg 12.902 is stuck stale for 10810.497213, current state
>> stale+activating, last acting [36,14]
>> >     pg 13.901 is stuck stale for 94889.512234, current state
>> stale+active+clean, last acting [1,31]
>> >     pg 13.904 is stuck stale for 10745.279158, current state
>> stale+active+clean, last acting [37,8]
>> >     pg 13.908 is stuck stale for 10745.279176, current state
>> stale+active+clean, last acting [37,19]
>> >     pg 13.909 is stuck inactive for 95370.129659, current state
>> activating, last acting [34,19]
>> >     pg 13.90e is stuck inactive for 95370.379694, current state
>> activating, last acting [21,20]
>> >     pg 13.911 is stuck inactive for 98449.317873, current state
>> activating, last acting [25,22]
>> >     pg 13.914 is stuck stale for 11827.503651, current state
>> stale+down, last acting [29]
>> >     pg 13.917 is stuck inactive for 94564.811121, current state
>> activating, last acting [16,12]
>> >     pg 14.901 is stuck inactive for 94929.006707, current state
>> activating+degraded, last acting [22,8]
>> >     pg 14.910 is stuck inactive for 94929.046256, current state
>> activating+degraded, last acting [17,2]
>> >     pg 14.912 is stuck inactive for 10831.758524, current state
>> activating, last acting [18,2]
>> >     pg 14.915 is stuck inactive for 94929.001390, current state
>> activating, last acting [34,23]
>> >     pg 15.90c is stuck inactive for 93957.371333, current state
>> activating, last acting [29,10]
>> >     pg 15.90d is stuck inactive for 94929.145438, current state
>> activating, last acting [5,31]
>> >     pg 15.913 is stuck stale for 10745.279197, current state
>> stale+active+clean, last acting [37,12]
>> >     pg 15.915 is stuck stale for 12343.606595, current state
>> stale+down, last acting [0]
>> >     pg 15.91c is stuck stale for 10650.058945, current state
>> stale+down, last acting [12]
>> >     pg 16.90e is stuck inactive for 94929.240626, current state
>> activating, last acting [14,2]
>> >     pg 16.919 is stuck inactive for 94564.771129, current state
>> activating, last acting [20,4]
>> >     pg 16.91e is stuck inactive for 94960.007104, current state
>> activating, last acting [22,12]
>> >     pg 17.908 is stuck inactive for 12250.346380, current state
>> activating, last acting [27,18]
>> >     pg 17.90b is stuck inactive for 11714.951268, current state
>> activating, last acting [12,25]
>> >     pg 17.910 is stuck inactive for 94564.819149, current state
>> activating, last acting [26,16]
>> >     pg 17.913 is stuck inactive for 95370.177309, current state
>> activating, last acting [13,31]
>> >     pg 17.91f is stuck inactive for 95147.032346, current state
>> activating, last acting [6,18]
>> >     pg 18.908 is stuck inactive for 95370.185260, current state
>> activating, last acting [10,2]
>> >     pg 18.911 is stuck inactive for 95379.637224, current state
>> activating, last acting [34,9]
>> >     pg 18.91e is stuck inactive for 95370.148283, current state
>> activating, last acting [0,34]
>> >     pg 19.90e is stuck inactive for 10229.611524, current state
>> activating, last acting [18,0]
>> >     pg 19.90f is stuck stale for 9612.906611, current state
>> stale+active+clean, last acting [38,18]
>> >     pg 19.912 is stuck stale for 10745.279169, current state
>> stale+active+clean, last acting [37,29]
>> >     pg 19.915 is stuck stale for 10810.497226, current state
>> stale+active+clean, last acting [36,13]
>> >     pg 20.90f is stuck stale for 10810.497234, current state
>> stale+active+clean, last acting [36,26]
>> >     pg 20.913 is stuck inactive for 94959.946347, current state
>> activating+degraded, last acting [11,0]
>> >     pg 20.91d is stuck inactive for 94959.860315, current state
>> activating+degraded, last acting [10,16]
>> >     pg 21.907 is stuck inactive for 94959.824457, current state
>> activating, last acting [20,0]
>> >     pg 21.90e is stuck inactive for 94929.024503, current state
>> activating, last acting [1,27]
>> >     pg 21.917 is stuck inactive for 94959.909019, current state
>> activating, last acting [15,2]
>> >     pg 21.918 is stuck inactive for 10655.096673, current state
>> activating, last acting [35,9]
>> >     pg 22.90b is stuck inactive for 95370.255015, current state
>> activating, last acting [20,26]
>> >     pg 22.90c is stuck inactive for 94564.757145, current state
>> activating, last acting [20,14]
>> >     pg 22.90f is stuck stale for 9612.906599, current state
>> stale+activating, last acting [38,35]
>> >     pg 22.912 is stuck inactive for 11370.195675, current state
>> activating, last acting [30,15]
>> > PG_DEGRADED Degraded data redundancy: 86858/12393978 objects degraded
>> (0.701%), 717 pgs degraded, 21 pgs undersized
>> >     pg 14.804 is activating+degraded, acting [6,30]
>> >     pg 14.834 is activating+degraded, acting [15,8]
>> >     pg 14.843 is activating+degraded, acting [7,25]
>> >     pg 14.85f is activating+degraded, acting [25,11]
>> >     pg 14.865 is activating+degraded, acting [33,25]
>> >     pg 14.87a is activating+degraded, acting [28,6]
>> >     pg 14.882 is activating+degraded, acting [4,21]
>> >     pg 14.893 is activating+degraded, acting [24,17]
>> >     pg 14.89c is activating+degraded, acting [14,21]
>> >     pg 14.89e is activating+degraded, acting [15,28]
>> >     pg 14.8ad is activating+degraded, acting [30,3]
>> >     pg 14.8b1 is activating+degraded, acting [30,2]
>> >     pg 14.8b4 is activating+degraded, acting [11,18]
>> >     pg 14.8b7 is activating+degraded, acting [7,16]
>> >     pg 14.8e2 is activating+degraded, acting [20,30]
>> >     pg 14.8ec is activating+degraded, acting [25,21]
>> >     pg 14.8ef is activating+degraded, acting [9,31]
>> >     pg 14.8f9 is activating+degraded, acting [27,21]
>> >     pg 14.901 is activating+degraded, acting [22,8]
>> >     pg 14.910 is activating+degraded, acting [17,2]
>> >     pg 20.808 is activating+degraded, acting [20,12]
>> >     pg 20.825 is activating+degraded, acting [25,35]
>> >     pg 20.827 is activating+degraded, acting [23,16]
>> >     pg 20.829 is activating+degraded, acting [20,31]
>> >     pg 20.837 is activating+degraded, acting [31,6]
>> >     pg 20.83c is activating+degraded, acting [26,17]
>> >     pg 20.85e is activating+degraded, acting [4,27]
>> >     pg 20.85f is activating+degraded, acting [1,25]
>> >     pg 20.865 is activating+degraded, acting [8,33]
>> >     pg 20.88b is activating+degraded, acting [6,32]
>> >     pg 20.895 is stale+activating+degraded, acting [37,27]
>> >     pg 20.89c is activating+degraded, acting [1,24]
>> >     pg 20.8a3 is activating+degraded, acting [30,1]
>> >     pg 20.8ad is activating+degraded, acting [1,20]
>> >     pg 20.8af is activating+degraded, acting [33,31]
>> >     pg 20.8b4 is activating+degraded, acting [9,1]
>> >     pg 20.8b7 is activating+degraded, acting [0,33]
>> >     pg 20.8b9 is activating+degraded, acting [20,24]
>> >     pg 20.8c5 is activating+degraded, acting [27,14]
>> >     pg 20.8d1 is activating+degraded, acting [10,7]
>> >     pg 20.8d4 is activating+degraded, acting [28,21]
>> >     pg 20.8d5 is activating+degraded, acting [24,15]
>> >     pg 20.8e0 is activating+degraded, acting [18,0]
>> >     pg 20.8e2 is activating+degraded, acting [25,7]
>> >     pg 20.8ea is activating+degraded, acting [17,21]
>> >     pg 20.8f1 is activating+degraded, acting [15,11]
>> >     pg 20.8fb is activating+degraded, acting [10,24]
>> >     pg 20.8fc is activating+degraded, acting [20,15]
>> >     pg 20.8ff is activating+degraded, acting [18,25]
>> >     pg 20.913 is activating+degraded, acting [11,0]
>> >     pg 20.91d is activating+degraded, acting [10,16]
>> > REQUEST_SLOW 99059 slow requests are blocked > 32 sec
>> >     24235 ops are blocked > 2097.15 sec
>> >     17029 ops are blocked > 1048.58 sec
>> >     54122 ops are blocked > 524.288 sec
>> >     2311 ops are blocked > 262.144 sec
>> >     767 ops are blocked > 131.072 sec
>> >     396 ops are blocked > 65.536 sec
>> >     199 ops are blocked > 32.768 sec
>> >     osd.32 has blocked requests > 262.144 sec
>> >     osds 5,8,12,26,28 have blocked requests > 524.288 sec
>> >     osds 1,3,9,10 have blocked requests > 1048.58 sec
>> >     osds 2,14,18,19,20,23,24,25,27,29,30,31,33,34,35 have blocked
>> requests > 2097.15 sec
>> > REQUEST_STUCK 4834 stuck requests are blocked > 4096 sec
>> >     4834 ops are blocked > 4194.3 sec
>> >     osds 0,4,11,13,17,21,22 have stuck requests > 4194.3 sec
>> > TOO_MANY_PGS too many PGs per OSD (3003 > max 200)
>> > [root@fre101 ~]#
>> >
>> > [root@fre101 ~]# ceph -s
>> > 2019-01-04 15:18:53.398950 7fc372c94700 -1 asok(0x7fc36c0017a0)
>> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to
>> bind the UNIX domain socket to
>> '/var/run/ceph-guests/ceph-client.admin.130425.140477307296080.asok': (2)
>> No such file or directory
>> >   cluster:
>> >     id:     adb9ad8e-f458-4124-bf58-7963a8d1391f
>> >     health: HEALTH_ERR
>> >             3 pools have many more objects per pg than average
>> >             523656/12393978 objects misplaced (4.225%)
>> >             6523 PGs pending on creation
>> >             Reduced data availability: 6584 pgs inactive, 1267 pgs
>> down, 2 pgs peering, 2696 pgs stale
>> >             Degraded data redundancy: 86858/12393978 objects degraded
>> (0.701%), 717 pgs degraded, 21 pgs undersized
>> >             107622 slow requests are blocked > 32 sec
>> >             4957 stuck requests are blocked > 4096 sec
>> >             too many PGs per OSD (3003 > max 200)
>> >
>> >   services:
>> >     mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03
>> >     mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02
>> >     osd: 39 osds: 39 up, 36 in; 85 remapped pgs
>> >     rgw: 1 daemon active
>> >
>> >   data:
>> >     pools:   18 pools, 54656 pgs
>> >     objects: 6051k objects, 10947 GB
>> >     usage:   21971 GB used, 50650 GB / 72622 GB avail
>> >     pgs:     0.002% pgs unknown
>> >              12.046% pgs not active
>> >              86858/12393978 objects degraded (0.701%)
>> >              523656/12393978 objects misplaced (4.225%)
>> >              46743 active+clean
>> >              4342  activating
>> >              1317  stale+active+clean
>> >              1151  stale+down
>> >              667   activating+degraded
>> >              159   stale+activating
>> >              116   down
>> >              77    activating+remapped
>> >              34    stale+activating+degraded
>> >              21    stale+activating+remapped
>> >              9     stale+active+undersized
>> >              6     stale+activating+undersized+degraded+remapped
>> >              5     activating+undersized+degraded+remapped
>> >              3     activating+degraded+remapped
>> >              2     stale+remapped+peering
>> >              1     stale+activating+degraded+remapped
>> >              1     stale+active+undersized+degraded
>> >              1     stale+active+clean+remapped
>> >              1     unknown
>> >
>> >   io:
>> >     client:   0 B/s rd, 33213 B/s wr, 5 op/s rd, 5 op/s wr
>> >     recovery: 437 kB/s, 0 objects/s
>> >
>> >
>> > Are there any other suggestion besides force deleting PGs (like 6000 + )
>> >
>> > Thanks
>> > Arun
>> >
>> > On Fri, Jan 4, 2019 at 11:55 AM Kevin Olbrich <k...@sv01.de> wrote:
>> >>
>> >> I don't think this will help you. Unfound means, the cluster is unable
>> >> to find the data anywhere (it's lost).
>> >> It would be sufficient to shut down the new host - the OSDs will then
>> be out.
>> >>
>> >> You can also force-heal the cluster, something like "do your best
>> possible":
>> >>
>> >> ceph pg 2.5 mark_unfound_lost revert|delete
>> >>
>> >> Src:
>> http://docs.ceph.com/docs/mimic/rados/troubleshooting/troubleshooting-pg/
>> >>
>> >> Kevin
>> >>
>> >> Am Fr., 4. Jan. 2019 um 20:47 Uhr schrieb Arun POONIA
>> >> <arun.poo...@nuagenetworks.net>:
>> >> >
>> >> > Hi Kevin,
>> >> >
>> >> > Can I remove newly added server from Cluster and see if it heals
>> cluster ?
>> >> >
>> >> > When I check Hard Disk Iops on new server which are very low
>> compared to existing cluster server.
>> >> >
>> >> > Indeed this is a critical cluster but I don't have expertise to make
>> it flawless.
>> >> >
>> >> > Thanks
>> >> > Arun
>> >> >
>> >> > On Fri, Jan 4, 2019 at 11:35 AM Kevin Olbrich <k...@sv01.de> wrote:
>> >> >>
>> >> >> If you realy created and destroyed OSDs before the cluster healed
>> >> >> itself, this data will be permanently lost (not found / inactive).
>> >> >> Also your PG count is so much oversized, the calculation for peering
>> >> >> will most likely break because this was never tested.
>> >> >>
>> >> >> If this is a critical cluster, I would start a new one and bring
>> back
>> >> >> the backups (using a better PG count).
>> >> >>
>> >> >> Kevin
>> >> >>
>> >> >> Am Fr., 4. Jan. 2019 um 20:25 Uhr schrieb Arun POONIA
>> >> >> <arun.poo...@nuagenetworks.net>:
>> >> >> >
>> >> >> > Can anyone comment on this issue please, I can't seem to bring my
>> cluster healthy.
>> >> >> >
>> >> >> > On Fri, Jan 4, 2019 at 6:26 AM Arun POONIA <
>> arun.poo...@nuagenetworks.net> wrote:
>> >> >> >>
>> >> >> >> Hi Caspar,
>> >> >> >>
>> >> >> >> Number of IOPs are also quite low. It used be around 1K Plus on
>> one of Pool (VMs) now its like close to 10-30 .
>> >> >> >>
>> >> >> >> Thansk
>> >> >> >> Arun
>> >> >> >>
>> >> >> >> On Fri, Jan 4, 2019 at 5:41 AM Arun POONIA <
>> arun.poo...@nuagenetworks.net> wrote:
>> >> >> >>>
>> >> >> >>> Hi Caspar,
>> >> >> >>>
>> >> >> >>> Yes and No, numbers are going up and down. If I run ceph -s
>> command I can see it decreases one time and later it increases again. I see
>> there are so many blocked/slow requests. Almost all the OSDs have slow
>> requests. Around 12% PGs are inactive not sure how to activate them again.
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> [root@fre101 ~]# ceph health detail
>> >> >> >>> 2019-01-04 05:39:23.860142 7fc37a3a0700 -1 asok(0x7fc3740017a0)
>> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to
>> bind the UNIX domain socket to
>> '/var/run/ceph-guests/ceph-client.admin.1066526.140477441513808.asok': (2)
>> No such file or directory
>> >> >> >>> HEALTH_ERR 1 osds down; 3 pools have many more objects per pg
>> than average; 472812/12392654 objects misplaced (3.815%); 3610 PGs pending
>> on creation; Reduced data availability: 6578 pgs inactive, 1882 pgs down,
>> 86 pgs peering, 850 pgs stale; Degraded data redundancy: 216694/12392654
>> objects degraded (1.749%), 866 pgs degraded, 16 pgs undersized; 116082 slow
>> requests are blocked > 32 sec; 551 stuck requests are blocked > 4096 sec;
>> too many PGs per OSD (2709 > max 200)
>> >> >> >>> OSD_DOWN 1 osds down
>> >> >> >>>     osd.28 (root=default,host=fre119) is down
>> >> >> >>> MANY_OBJECTS_PER_PG 3 pools have many more objects per pg than
>> average
>> >> >> >>>     pool glance-images objects per pg (10478) is more than
>> 92.7257 times cluster average (113)
>> >> >> >>>     pool vms objects per pg (4717) is more than 41.7434 times
>> cluster average (113)
>> >> >> >>>     pool volumes objects per pg (1220) is more than 10.7965
>> times cluster average (113)
>> >> >> >>> OBJECT_MISPLACED 472812/12392654 objects misplaced (3.815%)
>> >> >> >>> PENDING_CREATING_PGS 3610 PGs pending on creation
>> >> >> >>>     osds
>> [osd.0,osd.1,osd.10,osd.11,osd.14,osd.15,osd.17,osd.18,osd.19,osd.20,osd.21,osd.22,osd.23,osd.25,osd.26,osd.27,osd.28,osd.3,osd.30,osd.32,osd.33,osd.35,osd.36,osd.37,osd.38,osd.4,osd.5,osd.6,osd.7,osd.9]
>> have pending PGs.
>> >> >> >>> PG_AVAILABILITY Reduced data availability: 6578 pgs inactive,
>> 1882 pgs down, 86 pgs peering, 850 pgs stale
>> >> >> >>>     pg 10.900 is down, acting [18]
>> >> >> >>>     pg 10.90e is stuck inactive for 60266.030164, current state
>> activating, last acting [2,38]
>> >> >> >>>     pg 10.913 is stuck stale for 1887.552862, current state
>> stale+down, last acting [9]
>> >> >> >>>     pg 10.915 is stuck inactive for 60266.215231, current state
>> activating, last acting [30,38]
>> >> >> >>>     pg 11.903 is stuck inactive for 59294.465961, current state
>> activating, last acting [11,38]
>> >> >> >>>     pg 11.910 is down, acting [21]
>> >> >> >>>     pg 11.919 is down, acting [25]
>> >> >> >>>     pg 12.902 is stuck inactive for 57118.544590, current state
>> activating, last acting [36,14]
>> >> >> >>>     pg 13.8f8 is stuck inactive for 60707.167787, current state
>> activating, last acting [29,37]
>> >> >> >>>     pg 13.901 is stuck stale for 60226.543289, current state
>> stale+active+clean, last acting [1,31]
>> >> >> >>>     pg 13.905 is stuck inactive for 60266.050940, current state
>> activating, last acting [2,36]
>> >> >> >>>     pg 13.909 is stuck inactive for 60707.160714, current state
>> activating, last acting [34,36]
>> >> >> >>>     pg 13.90e is stuck inactive for 60707.410749, current state
>> activating, last acting [21,36]
>> >> >> >>>     pg 13.911 is down, acting [25]
>> >> >> >>>     pg 13.914 is stale+down, acting [29]
>> >> >> >>>     pg 13.917 is stuck stale for 580.224688, current state
>> stale+down, last acting [16]
>> >> >> >>>     pg 14.901 is stuck inactive for 60266.037762, current state
>> activating+degraded, last acting [22,37]
>> >> >> >>>     pg 14.90f is stuck inactive for 60296.996447, current state
>> activating, last acting [30,36]
>> >> >> >>>     pg 14.910 is stuck inactive for 60266.077310, current state
>> activating+degraded, last acting [17,37]
>> >> >> >>>     pg 14.915 is stuck inactive for 60266.032445, current state
>> activating, last acting [34,36]
>> >> >> >>>     pg 15.8fa is stuck stale for 560.223249, current state
>> stale+down, last acting [8]
>> >> >> >>>     pg 15.90c is stuck inactive for 59294.402388, current state
>> activating, last acting [29,38]
>> >> >> >>>     pg 15.90d is stuck inactive for 60266.176492, current state
>> activating, last acting [5,36]
>> >> >> >>>     pg 15.915 is down, acting [0]
>> >> >> >>>     pg 15.917 is stuck inactive for 56279.658951, current state
>> activating, last acting [13,38]
>> >> >> >>>     pg 15.91c is stuck stale for 374.590704, current state
>> stale+down, last acting [12]
>> >> >> >>>     pg 16.903 is stuck inactive for 56580.905961, current state
>> activating, last acting [25,37]
>> >> >> >>>     pg 16.90e is stuck inactive for 60266.271680, current state
>> activating, last acting [14,37]
>> >> >> >>>     pg 16.919 is stuck inactive for 59901.802184, current state
>> activating, last acting [20,37]
>> >> >> >>>     pg 16.91e is stuck inactive for 60297.038159, current state
>> activating, last acting [22,37]
>> >> >> >>>     pg 17.8e5 is stuck inactive for 60266.149061, current state
>> activating, last acting [25,36]
>> >> >> >>>     pg 17.910 is stuck inactive for 59901.850204, current state
>> activating, last acting [26,37]
>> >> >> >>>     pg 17.913 is stuck inactive for 60707.208364, current state
>> activating, last acting [13,36]
>> >> >> >>>     pg 17.91a is stuck inactive for 60266.187509, current state
>> activating, last acting [4,37]
>> >> >> >>>     pg 17.91f is down, acting [6]
>> >> >> >>>     pg 18.908 is stuck inactive for 60707.216314, current state
>> activating, last acting [10,36]
>> >> >> >>>     pg 18.911 is stuck stale for 244.570413, current state
>> stale+down, last acting [34]
>> >> >> >>>     pg 18.919 is stuck inactive for 60265.980816, current state
>> activating, last acting [28,36]
>> >> >> >>>     pg 18.91a is stuck inactive for 59901.814714, current state
>> activating, last acting [28,37]
>> >> >> >>>     pg 18.91e is stuck inactive for 60707.179338, current state
>> activating, last acting [0,36]
>> >> >> >>>     pg 19.90a is stuck inactive for 60203.089988, current state
>> activating, last acting [35,38]
>> >> >> >>>     pg 20.8e0 is stuck inactive for 60296.839098, current state
>> activating+degraded, last acting [18,37]
>> >> >> >>>     pg 20.913 is stuck inactive for 60296.977401, current state
>> activating+degraded, last acting [11,37]
>> >> >> >>>     pg 20.91d is stuck inactive for 60296.891370, current state
>> activating+degraded, last acting [10,38]
>> >> >> >>>     pg 21.8e1 is stuck inactive for 60707.422330, current state
>> activating, last acting [21,38]
>> >> >> >>>     pg 21.907 is stuck inactive for 60296.855511, current state
>> activating, last acting [20,36]
>> >> >> >>>     pg 21.90e is stuck inactive for 60266.055557, current state
>> activating, last acting [1,38]
>> >> >> >>>     pg 21.917 is stuck inactive for 60296.940074, current state
>> activating, last acting [15,36]
>> >> >> >>>     pg 22.90b is stuck inactive for 60707.286070, current state
>> activating, last acting [20,36]
>> >> >> >>>     pg 22.90c is stuck inactive for 59901.788199, current state
>> activating, last acting [20,37]
>> >> >> >>>     pg 22.90f is stuck inactive for 60297.062020, current state
>> activating, last acting [38,35]
>> >> >> >>> PG_DEGRADED Degraded data redundancy: 216694/12392654 objects
>> degraded (1.749%), 866 pgs degraded, 16 pgs undersized
>> >> >> >>>     pg 12.85a is active+undersized+degraded, acting [3]
>> >> >> >>>     pg 14.843 is activating+degraded, acting [7,38]
>> >> >> >>>     pg 14.85f is activating+degraded, acting [25,36]
>> >> >> >>>     pg 14.865 is activating+degraded, acting [33,37]
>> >> >> >>>     pg 14.87a is activating+degraded, acting [28,36]
>> >> >> >>>     pg 14.87e is activating+degraded, acting [17,38]
>> >> >> >>>     pg 14.882 is activating+degraded, acting [4,36]
>> >> >> >>>     pg 14.88a is activating+degraded, acting [2,37]
>> >> >> >>>     pg 14.893 is activating+degraded, acting [24,36]
>> >> >> >>>     pg 14.897 is active+undersized+degraded, acting [34]
>> >> >> >>>     pg 14.89c is activating+degraded, acting [14,38]
>> >> >> >>>     pg 14.89e is activating+degraded, acting [15,38]
>> >> >> >>>     pg 14.8a8 is active+undersized+degraded, acting [33]
>> >> >> >>>     pg 14.8b1 is activating+degraded, acting [30,38]
>> >> >> >>>     pg 14.8d4 is active+undersized+degraded, acting [13]
>> >> >> >>>     pg 14.8d8 is active+undersized+degraded, acting [4]
>> >> >> >>>     pg 14.8e6 is active+undersized+degraded, acting [10]
>> >> >> >>>     pg 14.8e7 is active+undersized+degraded, acting [1]
>> >> >> >>>     pg 14.8ef is activating+degraded, acting [9,36]
>> >> >> >>>     pg 14.8f8 is active+undersized+degraded, acting [30]
>> >> >> >>>     pg 14.901 is activating+degraded, acting [22,37]
>> >> >> >>>     pg 14.910 is activating+degraded, acting [17,37]
>> >> >> >>>     pg 14.913 is active+undersized+degraded, acting [18]
>> >> >> >>>     pg 20.821 is activating+degraded, acting [37,33]
>> >> >> >>>     pg 20.825 is activating+degraded, acting [25,36]
>> >> >> >>>     pg 20.84f is active+undersized+degraded, acting [2]
>> >> >> >>>     pg 20.85a is active+undersized+degraded, acting [11]
>> >> >> >>>     pg 20.85f is activating+degraded, acting [1,38]
>> >> >> >>>     pg 20.865 is activating+degraded, acting [8,38]
>> >> >> >>>     pg 20.869 is activating+degraded, acting [27,37]
>> >> >> >>>     pg 20.87b is active+undersized+degraded, acting [30]
>> >> >> >>>     pg 20.88b is activating+degraded, acting [6,38]
>> >> >> >>>     pg 20.895 is activating+degraded, acting [37,27]
>> >> >> >>>     pg 20.89c is activating+degraded, acting [1,36]
>> >> >> >>>     pg 20.8a3 is activating+degraded, acting [30,36]
>> >> >> >>>     pg 20.8ad is activating+degraded, acting [1,38]
>> >> >> >>>     pg 20.8af is activating+degraded, acting [33,37]
>> >> >> >>>     pg 20.8b7 is activating+degraded, acting [0,38]
>> >> >> >>>     pg 20.8b9 is activating+degraded, acting [20,38]
>> >> >> >>>     pg 20.8d4 is activating+degraded, acting [28,37]
>> >> >> >>>     pg 20.8d5 is activating+degraded, acting [24,37]
>> >> >> >>>     pg 20.8e0 is activating+degraded, acting [18,37]
>> >> >> >>>     pg 20.8e3 is activating+degraded, acting [21,38]
>> >> >> >>>     pg 20.8ea is activating+degraded, acting [17,36]
>> >> >> >>>     pg 20.8ee is active+undersized+degraded, acting [4]
>> >> >> >>>     pg 20.8f2 is activating+degraded, acting [3,36]
>> >> >> >>>     pg 20.8fb is activating+degraded, acting [10,38]
>> >> >> >>>     pg 20.8fc is activating+degraded, acting [20,38]
>> >> >> >>>     pg 20.913 is activating+degraded, acting [11,37]
>> >> >> >>>     pg 20.916 is active+undersized+degraded, acting [21]
>> >> >> >>>     pg 20.91d is activating+degraded, acting [10,38]
>> >> >> >>> REQUEST_SLOW 116082 slow requests are blocked > 32 sec
>> >> >> >>>     10619 ops are blocked > 2097.15 sec
>> >> >> >>>     74227 ops are blocked > 1048.58 sec
>> >> >> >>>     18561 ops are blocked > 524.288 sec
>> >> >> >>>     10862 ops are blocked > 262.144 sec
>> >> >> >>>     1037 ops are blocked > 131.072 sec
>> >> >> >>>     520 ops are blocked > 65.536 sec
>> >> >> >>>     256 ops are blocked > 32.768 sec
>> >> >> >>>     osd.29 has blocked requests > 32.768 sec
>> >> >> >>>     osd.15 has blocked requests > 262.144 sec
>> >> >> >>>     osds 12,13,31 have blocked requests > 524.288 sec
>> >> >> >>>     osds 1,8,16,19,23,25,26,33,37,38 have blocked requests >
>> 1048.58 sec
>> >> >> >>>     osds 3,4,5,6,10,14,17,22,27,30,32,35,36 have blocked
>> requests > 2097.15 sec
>> >> >> >>> REQUEST_STUCK 551 stuck requests are blocked > 4096 sec
>> >> >> >>>     551 ops are blocked > 4194.3 sec
>> >> >> >>>     osds 0,28 have stuck requests > 4194.3 sec
>> >> >> >>> TOO_MANY_PGS too many PGs per OSD (2709 > max 200)
>> >> >> >>> [root@fre101 ~]#
>> >> >> >>> [root@fre101 ~]#
>> >> >> >>> [root@fre101 ~]#
>> >> >> >>> [root@fre101 ~]#
>> >> >> >>> [root@fre101 ~]#
>> >> >> >>> [root@fre101 ~]#
>> >> >> >>> [root@fre101 ~]# ceph -s
>> >> >> >>> 2019-01-04 05:39:29.364100 7f0fb32f2700 -1 asok(0x7f0fac0017a0)
>> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to
>> bind the UNIX domain socket to
>> '/var/run/ceph-guests/ceph-client.admin.1066635.139705286924624.asok': (2)
>> No such file or directory
>> >> >> >>>   cluster:
>> >> >> >>>     id:     adb9ad8e-f458-4124-bf58-7963a8d1391f
>> >> >> >>>     health: HEALTH_ERR
>> >> >> >>>             3 pools have many more objects per pg than average
>> >> >> >>>             473825/12392654 objects misplaced (3.823%)
>> >> >> >>>             3723 PGs pending on creation
>> >> >> >>>             Reduced data availability: 6677 pgs inactive, 1948
>> pgs down, 157 pgs peering, 850 pgs stale
>> >> >> >>>             Degraded data redundancy: 306567/12392654 objects
>> degraded (2.474%), 949 pgs degraded, 16 pgs undersized
>> >> >> >>>             98047 slow requests are blocked > 32 sec
>> >> >> >>>             33 stuck requests are blocked > 4096 sec
>> >> >> >>>             too many PGs per OSD (2690 > max 200)
>> >> >> >>>
>> >> >> >>>   services:
>> >> >> >>>     mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03
>> >> >> >>>     mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02
>> >> >> >>>     osd: 39 osds: 39 up, 39 in; 76 remapped pgs
>> >> >> >>>     rgw: 1 daemon active
>> >> >> >>>
>> >> >> >>>   data:
>> >> >> >>>     pools:   18 pools, 54656 pgs
>> >> >> >>>     objects: 6051k objects, 10944 GB
>> >> >> >>>     usage:   21934 GB used, 50687 GB / 72622 GB avail
>> >> >> >>>     pgs:     13.267% pgs not active
>> >> >> >>>              306567/12392654 objects degraded (2.474%)
>> >> >> >>>              473825/12392654 objects misplaced (3.823%)
>> >> >> >>>              44937 active+clean
>> >> >> >>>              3850  activating
>> >> >> >>>              1936  active+undersized
>> >> >> >>>              1078  down
>> >> >> >>>              864   stale+down
>> >> >> >>>              597   peering
>> >> >> >>>              591   activating+degraded
>> >> >> >>>              316   active+undersized+degraded
>> >> >> >>>              205   stale+active+clean
>> >> >> >>>              133   stale+activating
>> >> >> >>>              67    activating+remapped
>> >> >> >>>              32    stale+activating+degraded
>> >> >> >>>              21    stale+activating+remapped
>> >> >> >>>              9     stale+active+undersized
>> >> >> >>>              6     down+remapped
>> >> >> >>>              5     stale+activating+undersized+degraded+remapped
>> >> >> >>>              2     activating+degraded+remapped
>> >> >> >>>              1     stale+activating+degraded+remapped
>> >> >> >>>              1     stale+active+undersized+degraded
>> >> >> >>>              1     remapped+peering
>> >> >> >>>              1     active+clean+remapped
>> >> >> >>>              1     stale+remapped+peering
>> >> >> >>>              1     stale+peering
>> >> >> >>>              1     activating+undersized+degraded+remapped
>> >> >> >>>
>> >> >> >>>   io:
>> >> >> >>>     client:   0 B/s rd, 23566 B/s wr, 0 op/s rd, 3 op/s wr
>> >> >> >>>
>> >> >> >>> Thanks
>> >> >> >>>
>> >> >> >>> Arun
>> >> >> >>>
>> >> >> >>> On Fri, Jan 4, 2019 at 5:38 AM Caspar Smit <
>> caspars...@supernas.eu> wrote:
>> >> >> >>>>
>> >> >> >>>> Are the numbers still decreasing?
>> >> >> >>>>
>> >> >> >>>> This one for instance:
>> >> >> >>>>
>> >> >> >>>> "3883 PGs pending on creation"
>> >> >> >>>>
>> >> >> >>>> Caspar
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>> Op vr 4 jan. 2019 om 14:23 schreef Arun POONIA <
>> arun.poo...@nuagenetworks.net>:
>> >> >> >>>>>
>> >> >> >>>>> Hi Caspar,
>> >> >> >>>>>
>> >> >> >>>>> Yes, cluster was working fine with number of PGs per OSD
>> warning up until now. I am not sure how to recover from stale down/inactive
>> PGs. If you happen to know about this can you let me know?
>> >> >> >>>>>
>> >> >> >>>>> Current State:
>> >> >> >>>>>
>> >> >> >>>>> [root@fre101 ~]# ceph -s
>> >> >> >>>>> 2019-01-04 05:22:05.942349 7f314f613700 -1
>> asok(0x7f31480017a0) AdminSocketConfigObs::init: failed:
>> AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to
>> '/var/run/ceph-guests/ceph-client.admin.1053724.139849638091088.asok': (2)
>> No such file or directory
>> >> >> >>>>>   cluster:
>> >> >> >>>>>     id:     adb9ad8e-f458-4124-bf58-7963a8d1391f
>> >> >> >>>>>     health: HEALTH_ERR
>> >> >> >>>>>             3 pools have many more objects per pg than average
>> >> >> >>>>>             505714/12392650 objects misplaced (4.081%)
>> >> >> >>>>>             3883 PGs pending on creation
>> >> >> >>>>>             Reduced data availability: 6519 pgs inactive,
>> 1870 pgs down, 1 pg peering, 886 pgs stale
>> >> >> >>>>>             Degraded data redundancy: 42987/12392650 objects
>> degraded (0.347%), 634 pgs degraded, 16 pgs undersized
>> >> >> >>>>>             125827 slow requests are blocked > 32 sec
>> >> >> >>>>>             2 stuck requests are blocked > 4096 sec
>> >> >> >>>>>             too many PGs per OSD (2758 > max 200)
>> >> >> >>>>>
>> >> >> >>>>>   services:
>> >> >> >>>>>     mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03
>> >> >> >>>>>     mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02
>> >> >> >>>>>     osd: 39 osds: 39 up, 39 in; 76 remapped pgs
>> >> >> >>>>>     rgw: 1 daemon active
>> >> >> >>>>>
>> >> >> >>>>>   data:
>> >> >> >>>>>     pools:   18 pools, 54656 pgs
>> >> >> >>>>>     objects: 6051k objects, 10944 GB
>> >> >> >>>>>     usage:   21933 GB used, 50688 GB / 72622 GB avail
>> >> >> >>>>>     pgs:     11.927% pgs not active
>> >> >> >>>>>              42987/12392650 objects degraded (0.347%)
>> >> >> >>>>>              505714/12392650 objects misplaced (4.081%)
>> >> >> >>>>>              48080 active+clean
>> >> >> >>>>>              3885  activating
>> >> >> >>>>>              1111  down
>> >> >> >>>>>              759   stale+down
>> >> >> >>>>>              614   activating+degraded
>> >> >> >>>>>              74    activating+remapped
>> >> >> >>>>>              46    stale+active+clean
>> >> >> >>>>>              35    stale+activating
>> >> >> >>>>>              21    stale+activating+remapped
>> >> >> >>>>>              9     stale+active+undersized
>> >> >> >>>>>              9     stale+activating+degraded
>> >> >> >>>>>              5
>>  stale+activating+undersized+degraded+remapped
>> >> >> >>>>>              3     activating+degraded+remapped
>> >> >> >>>>>              1     stale+activating+degraded+remapped
>> >> >> >>>>>              1     stale+active+undersized+degraded
>> >> >> >>>>>              1     remapped+peering
>> >> >> >>>>>              1     active+clean+remapped
>> >> >> >>>>>              1     activating+undersized+degraded+remapped
>> >> >> >>>>>
>> >> >> >>>>>   io:
>> >> >> >>>>>     client:   0 B/s rd, 25397 B/s wr, 4 op/s rd, 4 op/s wr
>> >> >> >>>>>
>> >> >> >>>>> I will update number of PGs per OSD once these inactive or
>> stale PGs come online. I am not able to access VMs (VMs, Images) which are
>> using Ceph.
>> >> >> >>>>>
>> >> >> >>>>> Thanks
>> >> >> >>>>> Arun
>> >> >> >>>>>
>> >> >> >>>>> On Fri, Jan 4, 2019 at 4:53 AM Caspar Smit <
>> caspars...@supernas.eu> wrote:
>> >> >> >>>>>>
>> >> >> >>>>>> Hi Arun,
>> >> >> >>>>>>
>> >> >> >>>>>> How did you end up with a 'working' cluster with so many pgs
>> per OSD?
>> >> >> >>>>>>
>> >> >> >>>>>> "too many PGs per OSD (2968 > max 200)"
>> >> >> >>>>>>
>> >> >> >>>>>> To (temporarily) allow this kind of pgs per osd you could
>> try this:
>> >> >> >>>>>>
>> >> >> >>>>>> Change these values in the global section in your ceph.conf:
>> >> >> >>>>>>
>> >> >> >>>>>> mon max pg per osd = 200
>> >> >> >>>>>> osd max pg per osd hard ratio = 2
>> >> >> >>>>>>
>> >> >> >>>>>> It allows 200*2 = 400 Pgs per OSD before disabling the
>> creation of new pgs.
>> >> >> >>>>>>
>> >> >> >>>>>> Above are the defaults (for Luminous, maybe other versions
>> too)
>> >> >> >>>>>> You can check your current settings with:
>> >> >> >>>>>>
>> >> >> >>>>>> ceph daemon mon.ceph-mon01 config show |grep pg_per_osd
>> >> >> >>>>>>
>> >> >> >>>>>> Since your current pgs per osd ratio is way higher then the
>> default you could set them to for instance:
>> >> >> >>>>>>
>> >> >> >>>>>> mon max pg per osd = 1000
>> >> >> >>>>>> osd max pg per osd hard ratio = 5
>> >> >> >>>>>>
>> >> >> >>>>>> Which allow for 5000 pgs per osd before disabling creation
>> of new pgs.
>> >> >> >>>>>>
>> >> >> >>>>>> You'll need to inject the setting into the mons/osds and
>> restart mgrs to make them active.
>> >> >> >>>>>>
>> >> >> >>>>>> ceph tell mon.* injectargs ‘--mon_max_pg_per_osd 1000’
>> >> >> >>>>>> ceph tell mon.* injectargs ‘--osd_max_pg_per_osd_hard_ratio
>> 5’
>> >> >> >>>>>> ceph tell osd.* injectargs ‘--mon_max_pg_per_osd 1000’
>> >> >> >>>>>> ceph tell osd.* injectargs ‘--osd_max_pg_per_osd_hard_ratio
>> 5’
>> >> >> >>>>>> restart mgrs
>> >> >> >>>>>>
>> >> >> >>>>>> Kind regards,
>> >> >> >>>>>> Caspar
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>> Op vr 4 jan. 2019 om 04:28 schreef Arun POONIA <
>> arun.poo...@nuagenetworks.net>:
>> >> >> >>>>>>>
>> >> >> >>>>>>> Hi Chris,
>> >> >> >>>>>>>
>> >> >> >>>>>>> Indeed that's what happened. I didn't set noout flag either
>> and I did zapped disk on new server every time. In my cluster status fre201
>> is only new server.
>> >> >> >>>>>>>
>> >> >> >>>>>>> Current Status after enabling 3 OSDs on fre201 host.
>> >> >> >>>>>>>
>> >> >> >>>>>>> [root@fre201 ~]# ceph osd tree
>> >> >> >>>>>>> ID  CLASS WEIGHT   TYPE NAME       STATUS REWEIGHT PRI-AFF
>> >> >> >>>>>>>  -1       70.92137 root default
>> >> >> >>>>>>>  -2        5.45549     host fre101
>> >> >> >>>>>>>   0   hdd  1.81850         osd.0       up  1.00000 1.00000
>> >> >> >>>>>>>   1   hdd  1.81850         osd.1       up  1.00000 1.00000
>> >> >> >>>>>>>   2   hdd  1.81850         osd.2       up  1.00000 1.00000
>> >> >> >>>>>>>  -9        5.45549     host fre103
>> >> >> >>>>>>>   3   hdd  1.81850         osd.3       up  1.00000 1.00000
>> >> >> >>>>>>>   4   hdd  1.81850         osd.4       up  1.00000 1.00000
>> >> >> >>>>>>>   5   hdd  1.81850         osd.5       up  1.00000 1.00000
>> >> >> >>>>>>>  -3        5.45549     host fre105
>> >> >> >>>>>>>   6   hdd  1.81850         osd.6       up  1.00000 1.00000
>> >> >> >>>>>>>   7   hdd  1.81850         osd.7       up  1.00000 1.00000
>> >> >> >>>>>>>   8   hdd  1.81850         osd.8       up  1.00000 1.00000
>> >> >> >>>>>>>  -4        5.45549     host fre107
>> >> >> >>>>>>>   9   hdd  1.81850         osd.9       up  1.00000 1.00000
>> >> >> >>>>>>>  10   hdd  1.81850         osd.10      up  1.00000 1.00000
>> >> >> >>>>>>>  11   hdd  1.81850         osd.11      up  1.00000 1.00000
>> >> >> >>>>>>>  -5        5.45549     host fre109
>> >> >> >>>>>>>  12   hdd  1.81850         osd.12      up  1.00000 1.00000
>> >> >> >>>>>>>  13   hdd  1.81850         osd.13      up  1.00000 1.00000
>> >> >> >>>>>>>  14   hdd  1.81850         osd.14      up  1.00000 1.00000
>> >> >> >>>>>>>  -6        5.45549     host fre111
>> >> >> >>>>>>>  15   hdd  1.81850         osd.15      up  1.00000 1.00000
>> >> >> >>>>>>>  16   hdd  1.81850         osd.16      up  1.00000 1.00000
>> >> >> >>>>>>>  17   hdd  1.81850         osd.17      up  0.79999 1.00000
>> >> >> >>>>>>>  -7        5.45549     host fre113
>> >> >> >>>>>>>  18   hdd  1.81850         osd.18      up  1.00000 1.00000
>> >> >> >>>>>>>  19   hdd  1.81850         osd.19      up  1.00000 1.00000
>> >> >> >>>>>>>  20   hdd  1.81850         osd.20      up  1.00000 1.00000
>> >> >> >>>>>>>  -8        5.45549     host fre115
>> >> >> >>>>>>>  21   hdd  1.81850         osd.21      up  1.00000 1.00000
>> >> >> >>>>>>>  22   hdd  1.81850         osd.22      up  1.00000 1.00000
>> >> >> >>>>>>>  23   hdd  1.81850         osd.23      up  1.00000 1.00000
>> >> >> >>>>>>> -10        5.45549     host fre117
>> >> >> >>>>>>>  24   hdd  1.81850         osd.24      up  1.00000 1.00000
>> >> >> >>>>>>>  25   hdd  1.81850         osd.25      up  1.00000 1.00000
>> >> >> >>>>>>>  26   hdd  1.81850         osd.26      up  1.00000 1.00000
>> >> >> >>>>>>> -11        5.45549     host fre119
>> >> >> >>>>>>>  27   hdd  1.81850         osd.27      up  1.00000 1.00000
>> >> >> >>>>>>>  28   hdd  1.81850         osd.28      up  1.00000 1.00000
>> >> >> >>>>>>>  29   hdd  1.81850         osd.29      up  1.00000 1.00000
>> >> >> >>>>>>> -12        5.45549     host fre121
>> >> >> >>>>>>>  30   hdd  1.81850         osd.30      up  1.00000 1.00000
>> >> >> >>>>>>>  31   hdd  1.81850         osd.31      up  1.00000 1.00000
>> >> >> >>>>>>>  32   hdd  1.81850         osd.32      up  1.00000 1.00000
>> >> >> >>>>>>> -13        5.45549     host fre123
>> >> >> >>>>>>>  33   hdd  1.81850         osd.33      up  1.00000 1.00000
>> >> >> >>>>>>>  34   hdd  1.81850         osd.34      up  1.00000 1.00000
>> >> >> >>>>>>>  35   hdd  1.81850         osd.35      up  1.00000 1.00000
>> >> >> >>>>>>> -27        5.45549     host fre201
>> >> >> >>>>>>>  36   hdd  1.81850         osd.36      up  1.00000 1.00000
>> >> >> >>>>>>>  37   hdd  1.81850         osd.37      up  1.00000 1.00000
>> >> >> >>>>>>>  38   hdd  1.81850         osd.38      up  1.00000 1.00000
>> >> >> >>>>>>> [root@fre201 ~]#
>> >> >> >>>>>>> [root@fre201 ~]#
>> >> >> >>>>>>> [root@fre201 ~]#
>> >> >> >>>>>>> [root@fre201 ~]#
>> >> >> >>>>>>> [root@fre201 ~]#
>> >> >> >>>>>>> [root@fre201 ~]# ceph -s
>> >> >> >>>>>>>   cluster:
>> >> >> >>>>>>>     id:     adb9ad8e-f458-4124-bf58-7963a8d1391f
>> >> >> >>>>>>>     health: HEALTH_ERR
>> >> >> >>>>>>>             3 pools have many more objects per pg than
>> average
>> >> >> >>>>>>>             585791/12391450 objects misplaced (4.727%)
>> >> >> >>>>>>>             2 scrub errors
>> >> >> >>>>>>>             2374 PGs pending on creation
>> >> >> >>>>>>>             Reduced data availability: 6578 pgs inactive,
>> 2025 pgs down, 74 pgs peering, 1234 pgs stale
>> >> >> >>>>>>>             Possible data damage: 2 pgs inconsistent
>> >> >> >>>>>>>             Degraded data redundancy: 64969/12391450
>> objects degraded (0.524%), 616 pgs degraded, 20 pgs undersized
>> >> >> >>>>>>>             96242 slow requests are blocked > 32 sec
>> >> >> >>>>>>>             228 stuck requests are blocked > 4096 sec
>> >> >> >>>>>>>             too many PGs per OSD (2768 > max 200)
>> >> >> >>>>>>>
>> >> >> >>>>>>>   services:
>> >> >> >>>>>>>     mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03
>> >> >> >>>>>>>     mgr: ceph-mon03(active), standbys: ceph-mon01,
>> ceph-mon02
>> >> >> >>>>>>>     osd: 39 osds: 39 up, 39 in; 96 remapped pgs
>> >> >> >>>>>>>     rgw: 1 daemon active
>> >> >> >>>>>>>
>> >> >> >>>>>>>   data:
>> >> >> >>>>>>>     pools:   18 pools, 54656 pgs
>> >> >> >>>>>>>     objects: 6050k objects, 10942 GB
>> >> >> >>>>>>>     usage:   21900 GB used, 50721 GB / 72622 GB avail
>> >> >> >>>>>>>     pgs:     0.002% pgs unknown
>> >> >> >>>>>>>              12.050% pgs not active
>> >> >> >>>>>>>              64969/12391450 objects degraded (0.524%)
>> >> >> >>>>>>>              585791/12391450 objects misplaced (4.727%)
>> >> >> >>>>>>>              47489 active+clean
>> >> >> >>>>>>>              3670  activating
>> >> >> >>>>>>>              1098  stale+down
>> >> >> >>>>>>>              923   down
>> >> >> >>>>>>>              575   activating+degraded
>> >> >> >>>>>>>              563   stale+active+clean
>> >> >> >>>>>>>              105   stale+activating
>> >> >> >>>>>>>              78    activating+remapped
>> >> >> >>>>>>>              72    peering
>> >> >> >>>>>>>              25    stale+activating+degraded
>> >> >> >>>>>>>              23    stale+activating+remapped
>> >> >> >>>>>>>              9     stale+active+undersized
>> >> >> >>>>>>>              6
>>  stale+activating+undersized+degraded+remapped
>> >> >> >>>>>>>              5     stale+active+undersized+degraded
>> >> >> >>>>>>>              4     down+remapped
>> >> >> >>>>>>>              4     activating+degraded+remapped
>> >> >> >>>>>>>              2     active+clean+inconsistent
>> >> >> >>>>>>>              1     stale+activating+degraded+remapped
>> >> >> >>>>>>>              1     stale+active+clean+remapped
>> >> >> >>>>>>>              1     stale+remapped+peering
>> >> >> >>>>>>>              1     remapped+peering
>> >> >> >>>>>>>              1     unknown
>> >> >> >>>>>>>
>> >> >> >>>>>>>   io:
>> >> >> >>>>>>>     client:   0 B/s rd, 208 kB/s wr, 22 op/s rd, 22 op/s wr
>> >> >> >>>>>>>
>> >> >> >>>>>>>
>> >> >> >>>>>>>
>> >> >> >>>>>>> Thanks
>> >> >> >>>>>>> Arun
>> >> >> >>>>>>>
>> >> >> >>>>>>>
>> >> >> >>>>>>> On Thu, Jan 3, 2019 at 7:19 PM Chris <
>> bitskr...@bitskrieg.net> wrote:
>> >> >> >>>>>>>>
>> >> >> >>>>>>>> If you added OSDs and then deleted them repeatedly without
>> waiting for replication to finish as the cluster attempted to re-balance
>> across them, its highly likely that you are permanently missing PGs
>> (especially if the disks were zapped each time).
>> >> >> >>>>>>>>
>> >> >> >>>>>>>> If those 3 down OSDs can be revived there is a (small)
>> chance that you can right the ship, but 1400pg/OSD is pretty extreme.  I'm
>> surprised the cluster even let you do that - this sounds like a data loss
>> event.
>> >> >> >>>>>>>>
>> >> >> >>>>>>>> Bring back the 3 OSD and see what those 2 inconsistent pgs
>> look like with ceph pg query.
>> >> >> >>>>>>>>
>> >> >> >>>>>>>> On January 3, 2019 21:59:38 Arun POONIA <
>> arun.poo...@nuagenetworks.net> wrote:
>> >> >> >>>>>>>>>
>> >> >> >>>>>>>>> Hi,
>> >> >> >>>>>>>>>
>> >> >> >>>>>>>>> Recently I tried adding a new node (OSD) to ceph cluster
>> using ceph-deploy tool. Since I was experimenting with tool and ended up
>> deleting OSD nodes on new server couple of times.
>> >> >> >>>>>>>>>
>> >> >> >>>>>>>>> Now since ceph OSDs are running on new server cluster PGs
>> seems to be inactive (10-15%) and they are not recovering or rebalancing.
>> Not sure what to do. I tried shutting down OSDs on new server.
>> >> >> >>>>>>>>>
>> >> >> >>>>>>>>> Status:
>> >> >> >>>>>>>>> [root@fre105 ~]# ceph -s
>> >> >> >>>>>>>>> 2019-01-03 18:56:42.867081 7fa0bf573700 -1
>> asok(0x7fa0b80017a0) AdminSocketConfigObs::init: failed:
>> AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to
>> '/var/run/ceph-guests/ceph-client.admin.4018644.140328258509136.asok': (2)
>> No such file or directory
>> >> >> >>>>>>>>>   cluster:
>> >> >> >>>>>>>>>     id:     adb9ad8e-f458-4124-bf58-7963a8d1391f
>> >> >> >>>>>>>>>     health: HEALTH_ERR
>> >> >> >>>>>>>>>             3 pools have many more objects per pg than
>> average
>> >> >> >>>>>>>>>             373907/12391198 objects misplaced (3.018%)
>> >> >> >>>>>>>>>             2 scrub errors
>> >> >> >>>>>>>>>             9677 PGs pending on creation
>> >> >> >>>>>>>>>             Reduced data availability: 7145 pgs inactive,
>> 6228 pgs down, 1 pg peering, 2717 pgs stale
>> >> >> >>>>>>>>>             Possible data damage: 2 pgs inconsistent
>> >> >> >>>>>>>>>             Degraded data redundancy: 178350/12391198
>> objects degraded (1.439%), 346 pgs degraded, 1297 pgs undersized
>> >> >> >>>>>>>>>             52486 slow requests are blocked > 32 sec
>> >> >> >>>>>>>>>             9287 stuck requests are blocked > 4096 sec
>> >> >> >>>>>>>>>             too many PGs per OSD (2968 > max 200)
>> >> >> >>>>>>>>>
>> >> >> >>>>>>>>>   services:
>> >> >> >>>>>>>>>     mon: 3 daemons, quorum
>> ceph-mon01,ceph-mon02,ceph-mon03
>> >> >> >>>>>>>>>     mgr: ceph-mon03(active), standbys: ceph-mon01,
>> ceph-mon02
>> >> >> >>>>>>>>>     osd: 39 osds: 36 up, 36 in; 51 remapped pgs
>> >> >> >>>>>>>>>     rgw: 1 daemon active
>> >> >> >>>>>>>>>
>> >> >> >>>>>>>>>   data:
>> >> >> >>>>>>>>>     pools:   18 pools, 54656 pgs
>> >> >> >>>>>>>>>     objects: 6050k objects, 10941 GB
>> >> >> >>>>>>>>>     usage:   21727 GB used, 45308 GB / 67035 GB avail
>> >> >> >>>>>>>>>     pgs:     13.073% pgs not active
>> >> >> >>>>>>>>>              178350/12391198 objects degraded (1.439%)
>> >> >> >>>>>>>>>              373907/12391198 objects misplaced (3.018%)
>> >> >> >>>>>>>>>              46177 active+clean
>> >> >> >>>>>>>>>              5054  down
>> >> >> >>>>>>>>>              1173  stale+down
>> >> >> >>>>>>>>>              1084  stale+active+undersized
>> >> >> >>>>>>>>>              547   activating
>> >> >> >>>>>>>>>              201   stale+active+undersized+degraded
>> >> >> >>>>>>>>>              158   stale+activating
>> >> >> >>>>>>>>>              96    activating+degraded
>> >> >> >>>>>>>>>              46    stale+active+clean
>> >> >> >>>>>>>>>              42    activating+remapped
>> >> >> >>>>>>>>>              34    stale+activating+degraded
>> >> >> >>>>>>>>>              23    stale+activating+remapped
>> >> >> >>>>>>>>>              6
>>  stale+activating+undersized+degraded+remapped
>> >> >> >>>>>>>>>              6     activating+undersized+degraded+remapped
>> >> >> >>>>>>>>>              2     activating+degraded+remapped
>> >> >> >>>>>>>>>              2     active+clean+inconsistent
>> >> >> >>>>>>>>>              1     stale+activating+degraded+remapped
>> >> >> >>>>>>>>>              1     stale+active+clean+remapped
>> >> >> >>>>>>>>>              1     stale+remapped
>> >> >> >>>>>>>>>              1     down+remapped
>> >> >> >>>>>>>>>              1     remapped+peering
>> >> >> >>>>>>>>>
>> >> >> >>>>>>>>>   io:
>> >> >> >>>>>>>>>     client:   0 B/s rd, 208 kB/s wr, 28 op/s rd, 28 op/s
>> wr
>> >> >> >>>>>>>>>
>> >> >> >>>>>>>>> Thanks
>> >> >> >>>>>>>>> --
>> >> >> >>>>>>>>> Arun Poonia
>> >> >> >>>>>>>>>
>> >> >> >>>>>>>>> _______________________________________________
>> >> >> >>>>>>>>> ceph-users mailing list
>> >> >> >>>>>>>>> ceph-users@lists.ceph.com
>> >> >> >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >> >>>>>>>>>
>> >> >> >>>>>>>>
>> >> >> >>>>>>>
>> >> >> >>>>>>>
>> >> >> >>>>>>> --
>> >> >> >>>>>>> Arun Poonia
>> >> >> >>>>>>>
>> >> >> >>>>>>> _______________________________________________
>> >> >> >>>>>>> ceph-users mailing list
>> >> >> >>>>>>> ceph-users@lists.ceph.com
>> >> >> >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >> >>>>>>
>> >> >> >>>>>> _______________________________________________
>> >> >> >>>>>> ceph-users mailing list
>> >> >> >>>>>> ceph-users@lists.ceph.com
>> >> >> >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >> >>>>>
>> >> >> >>>>>
>> >> >> >>>>>
>> >> >> >>>>> --
>> >> >> >>>>> Arun Poonia
>> >> >> >>>>>
>> >> >> >>>> _______________________________________________
>> >> >> >>>> ceph-users mailing list
>> >> >> >>>> ceph-users@lists.ceph.com
>> >> >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> --
>> >> >> >>> Arun Poonia
>> >> >> >>>
>> >> >> >>
>> >> >> >>
>> >> >> >> --
>> >> >> >> Arun Poonia
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Arun Poonia
>> >> >> >
>> >> >> > _______________________________________________
>> >> >> > ceph-users mailing list
>> >> >> > ceph-users@lists.ceph.com
>> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Arun Poonia
>> >> >
>> >
>> >
>> >
>> > --
>> > Arun Poonia
>> >
>>
>
>
> --
> Arun Poonia
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to