Hi! @Paul Thanks! I know, I read the whole topic about size 2 some months ago. But this has not been my decision, I had to set it up like that.
In the meantime, I did a reboot of node1001 and node1002 with flag "noout" set and now peering has finished and only 0.0x% are rebalanced. IO is flowing again. This happend as soon as the OSD was down (not out). This looks very much like a bug for me, isn't it? Restarting an OSD to "repair" crush? Also I did query the pg but it did not show any error. It just lists stats and that the pg was active since 8:40 this morning. There are row(s) with "blocked by" but no value, is that supposed to be filled with data? Kind regards, Kevin 2018-05-17 16:45 GMT+02:00 Paul Emmerich <paul.emmer...@croit.io>: > Check ceph pg query, it will (usually) tell you why something is stuck > inactive. > > Also: never do min_size 1. > > > Paul > > > 2018-05-17 15:48 GMT+02:00 Kevin Olbrich <k...@sv01.de>: > >> I was able to obtain another NVMe to get the HDDs in node1004 into the >> cluster. >> The number of disks (all 1TB) is now balanced between racks, still some >> inactive PGs: >> >> data: >> pools: 2 pools, 1536 pgs >> objects: 639k objects, 2554 GB >> usage: 5167 GB used, 14133 GB / 19300 GB avail >> pgs: 1.562% pgs not active >> 1183/1309952 objects degraded (0.090%) >> 199660/1309952 objects misplaced (15.242%) >> 1072 active+clean >> 405 active+remapped+backfill_wait >> 35 active+remapped+backfilling >> 21 activating+remapped >> 3 activating+undersized+degraded+remapped >> >> >> >> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >> -1 18.85289 root default >> -16 18.85289 datacenter dc01 >> -19 18.85289 pod dc01-agg01 >> -10 8.98700 rack dc01-rack02 >> -4 4.03899 host node1001 >> 0 hdd 0.90999 osd.0 up 1.00000 1.00000 >> 1 hdd 0.90999 osd.1 up 1.00000 1.00000 >> 5 hdd 0.90999 osd.5 up 1.00000 1.00000 >> 2 ssd 0.43700 osd.2 up 1.00000 1.00000 >> 3 ssd 0.43700 osd.3 up 1.00000 1.00000 >> 4 ssd 0.43700 osd.4 up 1.00000 1.00000 >> -7 4.94899 host node1002 >> 9 hdd 0.90999 osd.9 up 1.00000 1.00000 >> 10 hdd 0.90999 osd.10 up 1.00000 1.00000 >> 11 hdd 0.90999 osd.11 up 1.00000 1.00000 >> 12 hdd 0.90999 osd.12 up 1.00000 1.00000 >> 6 ssd 0.43700 osd.6 up 1.00000 1.00000 >> 7 ssd 0.43700 osd.7 up 1.00000 1.00000 >> 8 ssd 0.43700 osd.8 up 1.00000 1.00000 >> -11 9.86589 rack dc01-rack03 >> -22 5.38794 host node1003 >> 17 hdd 0.90999 osd.17 up 1.00000 1.00000 >> 18 hdd 0.90999 osd.18 up 1.00000 1.00000 >> 24 hdd 0.90999 osd.24 up 1.00000 1.00000 >> 26 hdd 0.90999 osd.26 up 1.00000 1.00000 >> 13 ssd 0.43700 osd.13 up 1.00000 1.00000 >> 14 ssd 0.43700 osd.14 up 1.00000 1.00000 >> 15 ssd 0.43700 osd.15 up 1.00000 1.00000 >> 16 ssd 0.43700 osd.16 up 1.00000 1.00000 >> -25 4.47795 host node1004 >> 23 hdd 0.90999 osd.23 up 1.00000 1.00000 >> 25 hdd 0.90999 osd.25 up 1.00000 1.00000 >> 27 hdd 0.90999 osd.27 up 1.00000 1.00000 >> 19 ssd 0.43700 osd.19 up 1.00000 1.00000 >> 20 ssd 0.43700 osd.20 up 1.00000 1.00000 >> 21 ssd 0.43700 osd.21 up 1.00000 1.00000 >> 22 ssd 0.43700 osd.22 up 1.00000 1.00000 >> >> >> Pools are size 2, min_size 1 during setup. >> >> The count of PGs in activate state are related to the weight of OSDs but >> why are they failing to proceed to active+clean or active+remapped? >> >> Kind regards, >> Kevin >> >> 2018-05-17 14:05 GMT+02:00 Kevin Olbrich <k...@sv01.de>: >> >>> Ok, I just waited some time but I still got some "activating" issues: >>> >>> data: >>> pools: 2 pools, 1536 pgs >>> objects: 639k objects, 2554 GB >>> usage: 5194 GB used, 11312 GB / 16506 GB avail >>> pgs: 7.943% pgs not active >>> 5567/1309948 objects degraded (0.425%) >>> 195386/1309948 objects misplaced (14.916%) >>> 1147 active+clean >>> 235 active+remapped+backfill_wait >>> * 107 activating+remapped* >>> 32 active+remapped+backfilling >>> * 15 activating+undersized+degraded+remapped* >>> >>> I set these settings during runtime: >>> ceph tell 'osd.*' injectargs '--osd-max-backfills 16' >>> ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4' >>> ceph tell 'mon.*' injectargs '--mon_max_pg_per_osd 800' >>> ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32' >>> >>> Sure, mon_max_pg_per_osd is oversized but this is just temporary. >>> Calculated PGs per OSD is 200. >>> >>> I searched the net and the bugtracker but most posts suggest >>> osd_max_pg_per_osd_hard_ratio = 32 to fix this issue but this time, I >>> got more stuck PGs. >>> >>> Any more hints? >>> >>> Kind regards. >>> Kevin >>> >>> 2018-05-17 13:37 GMT+02:00 Kevin Olbrich <k...@sv01.de>: >>> >>>> PS: Cluster currently is size 2, I used PGCalc on Ceph website which, >>>> by default, will place 200 PGs on each OSD. >>>> I read about the protection in the docs and later noticed that I better >>>> had only placed 100 PGs. >>>> >>>> >>>> 2018-05-17 13:35 GMT+02:00 Kevin Olbrich <k...@sv01.de>: >>>> >>>>> Hi! >>>>> >>>>> Thanks for your quick reply. >>>>> Before I read your mail, i applied the following conf to my OSDs: >>>>> ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32' >>>>> >>>>> Status is now: >>>>> data: >>>>> pools: 2 pools, 1536 pgs >>>>> objects: 639k objects, 2554 GB >>>>> usage: 5211 GB used, 11295 GB / 16506 GB avail >>>>> pgs: 7.943% pgs not active >>>>> 5567/1309948 objects degraded (0.425%) >>>>> 252327/1309948 objects misplaced (19.262%) >>>>> 1030 active+clean >>>>> 351 active+remapped+backfill_wait >>>>> 107 activating+remapped >>>>> 33 active+remapped+backfilling >>>>> 15 activating+undersized+degraded+remapped >>>>> >>>>> A little bit better but still some non-active PGs. >>>>> I will investigate your other hints! >>>>> >>>>> Thanks >>>>> Kevin >>>>> >>>>> 2018-05-17 13:30 GMT+02:00 Burkhard Linke < >>>>> burkhard.li...@computational.bio.uni-giessen.de>: >>>>> >>>>>> Hi, >>>>>> >>>>>> >>>>>> >>>>>> On 05/17/2018 01:09 PM, Kevin Olbrich wrote: >>>>>> >>>>>>> Hi! >>>>>>> >>>>>>> Today I added some new OSDs (nearly doubled) to my luminous cluster. >>>>>>> I then changed pg(p)_num from 256 to 1024 for that pool because it >>>>>>> was >>>>>>> complaining about to few PGs. (I noticed that should better have >>>>>>> been small >>>>>>> changes). >>>>>>> >>>>>>> This is the current status: >>>>>>> >>>>>>> health: HEALTH_ERR >>>>>>> 336568/1307562 objects misplaced (25.740%) >>>>>>> Reduced data availability: 128 pgs inactive, 3 pgs >>>>>>> peering, 1 >>>>>>> pg stale >>>>>>> Degraded data redundancy: 6985/1307562 objects degraded >>>>>>> (0.534%), 19 pgs degraded, 19 pgs undersized >>>>>>> 107 slow requests are blocked > 32 sec >>>>>>> 218 stuck requests are blocked > 4096 sec >>>>>>> >>>>>>> data: >>>>>>> pools: 2 pools, 1536 pgs >>>>>>> objects: 638k objects, 2549 GB >>>>>>> usage: 5210 GB used, 11295 GB / 16506 GB avail >>>>>>> pgs: 0.195% pgs unknown >>>>>>> 8.138% pgs not active >>>>>>> 6985/1307562 objects degraded (0.534%) >>>>>>> 336568/1307562 objects misplaced (25.740%) >>>>>>> 855 active+clean >>>>>>> 517 active+remapped+backfill_wait >>>>>>> 107 activating+remapped >>>>>>> 31 active+remapped+backfilling >>>>>>> 15 activating+undersized+degraded+remapped >>>>>>> 4 active+undersized+degraded+remapped+backfilling >>>>>>> 3 unknown >>>>>>> 3 peering >>>>>>> 1 stale+active+clean >>>>>>> >>>>>> >>>>>> You need to resolve the unknown/peering/activating pgs first. You >>>>>> have 1536 PGs, assuming replication size 3 this make 4608 PG copies. >>>>>> Given >>>>>> 25 OSDs and the heterogenous host sizes, I assume that some OSDs hold >>>>>> more >>>>>> than 200 PGs. There's a threshold for the number of PGs; reaching this >>>>>> threshold keeps the OSDs from accepting new PGs. >>>>>> >>>>>> Try to increase the threshold (mon_max_pg_per_osd / >>>>>> max_pg_per_osd_hard_ratio / osd_max_pg_per_osd_hard_ratio, not sure about >>>>>> the exact one, consult the documentation) to allow more PGs on the OSDs. >>>>>> If >>>>>> this is the cause of the problem, the peering and activating states >>>>>> should >>>>>> be resolved within a short time. >>>>>> >>>>>> You can also check the number of PGs per OSD with 'ceph osd df'; the >>>>>> last column is the current number of PGs. >>>>>> >>>>>> >>>>>>> >>>>>>> OSD tree: >>>>>>> >>>>>>> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT >>>>>>> PRI-AFF >>>>>>> -1 16.12177 root default >>>>>>> -16 16.12177 datacenter dc01 >>>>>>> -19 16.12177 pod dc01-agg01 >>>>>>> -10 8.98700 rack dc01-rack02 >>>>>>> -4 4.03899 host node1001 >>>>>>> 0 hdd 0.90999 osd.0 up 1.00000 >>>>>>> 1.00000 >>>>>>> 1 hdd 0.90999 osd.1 up 1.00000 >>>>>>> 1.00000 >>>>>>> 5 hdd 0.90999 osd.5 up 1.00000 >>>>>>> 1.00000 >>>>>>> 2 ssd 0.43700 osd.2 up 1.00000 >>>>>>> 1.00000 >>>>>>> 3 ssd 0.43700 osd.3 up 1.00000 >>>>>>> 1.00000 >>>>>>> 4 ssd 0.43700 osd.4 up 1.00000 >>>>>>> 1.00000 >>>>>>> -7 4.94899 host node1002 >>>>>>> 9 hdd 0.90999 osd.9 up 1.00000 >>>>>>> 1.00000 >>>>>>> 10 hdd 0.90999 osd.10 up 1.00000 >>>>>>> 1.00000 >>>>>>> 11 hdd 0.90999 osd.11 up 1.00000 >>>>>>> 1.00000 >>>>>>> 12 hdd 0.90999 osd.12 up 1.00000 >>>>>>> 1.00000 >>>>>>> 6 ssd 0.43700 osd.6 up 1.00000 >>>>>>> 1.00000 >>>>>>> 7 ssd 0.43700 osd.7 up 1.00000 >>>>>>> 1.00000 >>>>>>> 8 ssd 0.43700 osd.8 up 1.00000 >>>>>>> 1.00000 >>>>>>> -11 7.13477 rack dc01-rack03 >>>>>>> -22 5.38678 host node1003 >>>>>>> 17 hdd 0.90970 osd.17 up 1.00000 >>>>>>> 1.00000 >>>>>>> 18 hdd 0.90970 osd.18 up 1.00000 >>>>>>> 1.00000 >>>>>>> 24 hdd 0.90970 osd.24 up 1.00000 >>>>>>> 1.00000 >>>>>>> 26 hdd 0.90970 osd.26 up 1.00000 >>>>>>> 1.00000 >>>>>>> 13 ssd 0.43700 osd.13 up 1.00000 >>>>>>> 1.00000 >>>>>>> 14 ssd 0.43700 osd.14 up 1.00000 >>>>>>> 1.00000 >>>>>>> 15 ssd 0.43700 osd.15 up 1.00000 >>>>>>> 1.00000 >>>>>>> 16 ssd 0.43700 osd.16 up 1.00000 >>>>>>> 1.00000 >>>>>>> -25 1.74799 host node1004 >>>>>>> 19 ssd 0.43700 osd.19 up 1.00000 >>>>>>> 1.00000 >>>>>>> 20 ssd 0.43700 osd.20 up 1.00000 >>>>>>> 1.00000 >>>>>>> 21 ssd 0.43700 osd.21 up 1.00000 >>>>>>> 1.00000 >>>>>>> 22 ssd 0.43700 osd.22 up 1.00000 >>>>>>> 1.00000 >>>>>>> >>>>>>> >>>>>>> Crush rule is set to chooseleaf rack and (temporary!) to size 2. >>>>>>> Why are PGs stuck in peering and activating? >>>>>>> "ceph df" shows that only 1,5TB are used on the pool, residing on >>>>>>> the hdd's >>>>>>> - which would perfectly fit the crush rule....(?) >>>>>>> >>>>>> >>>>>> Size 2 within the crush rule or size 2 for the two pools? >>>>>> >>>>>> Regards, >>>>>> Burkhard >>>>>> _______________________________________________ >>>>>> ceph-users mailing list >>>>>> ceph-users@lists.ceph.com >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>> >>>>> >>>>> >>>> >>> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > > > -- > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > <https://maps.google.com/?q=Freseniusstr.+31h+81247+M%C3%BCnchen&entry=gmail&source=g> > 81247 München > <https://maps.google.com/?q=Freseniusstr.+31h+81247+M%C3%BCnchen&entry=gmail&source=g> > www.croit.io > Tel: +49 89 1896585 90 >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com