I don't have much experience in recovering struggling EC pools unfortunately. Looks like it can't find OSDs for 2 out of the 6 shards. Since you run EC 4+2 the data isn't lost but not 100% sure how to make it healthy. There was a thread a while back that had some similar issue albeit possibly different underlying causes but maybe there is some helpful advise in there: https://www.mail-archive.com/ceph-users@ceph.io/msg06854.html I suspect your OSDs frequently suiciding has put these PGs into the state
Maybe someone else on the list has some ideas. On Fri, 1 Oct 2021 at 18:13, Szabo, Istvan (Agoda) <istvan.sz...@agoda.com> wrote: > > Thank you very much Christian, maybe you have idea how can I take out the > cluster from this state? Something blocks the recovery and the rebalance, > something stuck somewhere, thats why can’t increase the pg further. > I don’t have auto pg scaler or anything just on warn state. > > If I set the min size of the pool to 4, will this pg be recovered? Or how I > can take out the cluster from health error like this? > > Mark as lost seems risky based on some maillist experience, even if marked > lost after you still have issue, so curious what is the way to take the > cluster out from this and let it recover: > > > > Example problematic pg: > > dumped pgs_brief > > PG_STAT STATE UP > UP_PRIMARY ACTING ACTING_PRIMARY > > 28.5b active+recovery_unfound+undersized+degraded+remapped > [18,33,10,0,48,1] 18 [2147483647,2147483647,29,21,4,47] > 29 > > > > Cluster state: > > cluster: > > id: 5a07ec50-4eee-4336-aa11-46ca76edcc24 > > health: HEALTH_ERR > > 10 OSD(s) experiencing BlueFS spillover > > 4/1055070542 objects unfound (0.000%) > > noout flag(s) set > > Possible data damage: 2 pgs recovery_unfound > > Degraded data redundancy: 64150765/6329079237 objects degraded > (1.014%), 10 pgs degraded, 26 pgs undersized > > 4 pgs not deep-scrubbed in time > > > > services: > > mon: 3 daemons, quorum mon-2s01,mon-2s02,mon-2s03 (age 2M) > > mgr: mon-2s01(active, since 2M), standbys: mon-2s03, mon-2s02 > > osd: 49 osds: 49 up (since 36m), 49 in (since 4d); 28 remapped pgs > > flags noout > > rgw: 3 daemons active (mon-2s01.rgw0, mon-2s02.rgw0, mon-2s03.rgw0) > > > > task status: > > > > data: > > pools: 9 pools, 425 pgs > > objects: 1.06G objects, 66 TiB > > usage: 158 TiB used, 465 TiB / 623 TiB avail > > pgs: 64150765/6329079237 objects degraded (1.014%) > > 38922319/6329079237 objects misplaced (0.615%) > > 4/1055070542 objects unfound (0.000%) > > 393 active+clean > > 13 active+undersized+remapped+backfill_wait > > 8 active+undersized+degraded+remapped+backfill_wait > > 3 active+clean+scrubbing > > 3 active+undersized+remapped+backfilling > > 2 active+recovery_unfound+undersized+degraded+remapped > > 2 active+remapped+backfill_wait > > 1 active+clean+scrubbing+deep > > > > io: > > client: 181 MiB/s rd, 9.4 MiB/s wr, 5.38k op/s rd, 2.42k op/s wr > > recovery: 23 MiB/s, 389 objects/s > > > Istvan Szabo > Senior Infrastructure Engineer > --------------------------------------------------- > Agoda Services Co., Ltd. > e: istvan.sz...@agoda.com > --------------------------------------------------- > > On 2021. Oct 1., at 1:25, Christian Wuerdig <christian.wuer...@gmail.com> > wrote: > > Email received from the internet. If in doubt, don't click any link nor open > any attachment ! > ________________________________ > > That is - one thing you could do is to rate limit PUT requests on your > haproxy down to a level that your cluster is stable. At least that > gives you a chance to finish the PG scaling without OSDs dying on you > constantly > > On Fri, 1 Oct 2021 at 11:56, Christian Wuerdig > <christian.wuer...@gmail.com> wrote: > > > Ok, so I guess there are several things coming together that end up > > making your life a bit miserable at the moment: > > - PG scaling causing increase IO > > - Ingesting large number of objects into RGW causing lots of IOPs > > - Usual client traffic > > - Your NVME that's being used for WAL/DB has only half the listed > > performance in terms of random write IOPS than your backing storage > > SSD while it should be the other way around - the WAL/DB device is > > supposed to be the faster device. > > you'd probably be better off replacing your NVME's with something > > like a P20096 > > - Smaller number of large drives - ceph traditionally scales better > > with more but smaller OSDs - especially if you plan on hosting > > truckloads of RGW blobs > > > Don't have a good solution - maybe you can stop the pg scaling until > > the big data load has finished or arrange a schedule - load data at > > night and pause during the day to continue the PG scaling > > Try and get your hands on a couple of faster NVME drives and replace > > the WAL/DB drives in one node so see how much of a difference it make > > > Also I wouldn't lower the osd memory target if you can afford the RAM > > - you only have 6 OSDs per server with a mem target of 32GB thats > > 192GB RAM - so if you have at least 256GB in your servers then I would > > leave it. It won't help with writes but it should help with reducing > > read iops - you probably don't want to make your existing problems > > even bigger by chucking more read io onto the system due to lower > > in-memory buffers. > > > On Thu, 30 Sept 2021 at 21:02, Szabo, Istvan (Agoda) > > <istvan.sz...@agoda.com> wrote: > > > Hi Christian, > > > Yes, I very clearly know what is spillover, read that github leveled document > in the last couple of days every day multiple time. (Answers for your > questions are after the cluster background information). > > > About the cluster: > > - users are doing continuously put/head/delete operations > > - cluster iops: 10-50k read, 5000 write iops > > - throughput: 142MiB/s write and 662 MiB/s read > > - Not containerized deployment, 3 cluster in multisite > > - 3x mon/mgr/rgw (5 rgw in each mon, altogether 15 behind haproxy vip) > > > 7 nodes and in each node the following config: > > - 1x 1.92TB nvme for index pool > > - 6x 15.3 TB osd SAS SSD (hpe VO015360JWZJN read intensive ssd, SKU > P19911-B21 in this document: > https://h20195.www2.hpe.com/v2/getpdf.aspx/a00001288enw.pdf) > > - 2x 1.92TB nvme block.db for the 6 ssd (model: HPE KCD6XLUL1T92 SKU: > P20131-B21 in this document > https://h20195.www2.hpe.com/v2/getpdf.aspx/a00001288enw.pdf) > > - osd deployed with dmcrypt > > - data pool is on ec 4:2 other pools are on the ssds with 3 replica > > > Config file that we have on all nodes + on the mon nodes has the rgw > definition also: > > [global] > > cluster network = 192.168.199.0/24 > > fsid = 5a07ec50-4eee-4336-aa11-46ca76edcc24 > > mon host = > [v2:10.118.199.1:3300,v1:10.118.199.1:6789],[v2:10.118.199.2:3300,v1:10.118.199.2:6789],[v2:10.118.199.3:3300,v1:10.118.199.3:6789] > > mon initial members = mon-2s01,mon-2s02,mon-2s03 > > osd pool default crush rule = -1 > > public network = 10.118.199.0/24 > > rgw_relaxed_s3_bucket_names = true > > rgw_dynamic_resharding = false > > rgw_enable_apis = s3, s3website, swift, swift_auth, admin, sts, iam, pubsub, > notifications > > #rgw_bucket_default_quota_max_objects = 1126400 > > > [mon] > > mon_allow_pool_delete = true > > mon_pg_warn_max_object_skew = 0 > > mon_osd_nearfull_ratio = 70 > > > [osd] > > osd_max_backfills = 1 > > osd_recovery_max_active = 1 > > osd_recovery_op_priority = 1 > > osd_memory_target = 31490694621 > > # due to osd reboots, the below configs has been added to survive the suicide > timeout > > osd_scrub_during_recovery = true > > osd_op_thread_suicide_timeout=3000 > > osd_op_thread_timeout=120 > > > Stability issue that I mean: > > - Pg increase still in progress, hasn’t been finished from 32-128 on the > erasure coded data pool. 103 currently and the degraded objects are always > stuck almost when finished, but at the end osd dies and start again the > recovery process. > > - compaction is happening all the time, so all the nvme drives are generating > iowait continuously because it is 100% utilized (iowait is around 1-3). If I > try to compact with ceph tell osd.x compact that is impossible, it will never > finish, only with ctrl+c. > > - At the beginning when we didn't have so much spilledover disks, I didn't > mind it actually I was happy of the spillover because the underlaying ssd can > take some load from the nvme, but after the osds started to reboot and I'd > say started to collapse 1 by 1. When I monitor which osds are collapsing, it > was always the one which was spillovered. This op thread and suicide timeout > can keep a bit longer the osds up. > > - Now ALL rgw started to die once 1 specific osd goes down, and this make > total outage. In the logs there isn't anything about this, neither message, > nor rgw log just like timeout the connections. This is unacceptable from the > user's perspective that thay need to wait 1.5 hour until my manual compaction > finished and I can start the osd. > > > Current cluster state ceph -s: > > health: HEALTH_ERR > > 12 OSD(s) experiencing BlueFS spillover > > 4/1055038256 objects unfound (0.000%) > > noout flag(s) set > > Possible data damage: 2 pgs recovery_unfound > > Degraded data redundancy: 12341016/6328900227 objects degraded > (0.195%), 16 pgs degraded, 21 pgs u > > ndersized > > 4 pgs not deep-scrubbed in time > > > services: > > mon: 3 daemons, quorum mon-2s01,mon-2s02,mon-2s03 (age 2M) > > mgr: mon-2s01(active, since 2M), standbys: mon-2s03, mon-2s02 > > osd: 49 osds: 49 up (since 101m), 49 in (since 4d); 23 remapped pgs > > flags noout > > rgw: 15 daemons active (mon-2s01.rgw0, mon-2s01.rgw1, mon-2s01.rgw2, > mon-2s01.rgw3, mon-2s01.rgw4, mon-2s02.rgw0, mon-2s02.rgw1, mon-2s02.rgw2, > mon-2s02.rgw3, mon-2s02.rgw4, mon-2s03.rgw0, mon-2s03.rgw1, mon-2s03.rgw2, > mon-2s03.rgw3, mon-2s03.rgw4) > > > task status: > > > data: > > pools: 9 pools, 425 pgs > > objects: 1.06G objects, 67 TiB > > usage: 159 TiB used, 465 TiB / 623 TiB avail > > pgs: 12032346/6328762449 objects degraded (0.190%) > > 68127707/6328762449 objects misplaced (1.076%) > > 4/1055015441 objects unfound (0.000%) > > 397 active+clean > > 13 active+undersized+degraded+remapped+backfill_wait > > 4 active+undersized+remapped+backfill_wait > > 4 active+clean+scrubbing+deep > > 2 active+recovery_unfound+undersized+degraded+remapped > > 2 active+remapped+backfill_wait > > 1 active+clean+scrubbing > > 1 active+undersized+remapped+backfilling > > 1 active+undersized+degraded+remapped+backfilling > > > io: > > client: 256 MiB/s rd, 94 MiB/s wr, 17.70k op/s rd, 2.75k op/s wr > > recovery: 16 MiB/s, 223 objects/s > > > Ty > > > -----Original Message----- > > From: Christian Wuerdig <christian.wuer...@gmail.com> > > Sent: Thursday, September 30, 2021 1:01 PM > > To: Szabo, Istvan (Agoda) <istvan.sz...@agoda.com> > > Cc: Ceph Users <ceph-users@ceph.io> > > Subject: Re: [ceph-users] osd_memory_target=level0 ? > > > Email received from the internet. If in doubt, don't click any link nor open > any attachment ! > > ________________________________ > > > Bluestore memory targets have nothing to do with spillover. It's already been > said several times: The spillover warning is simply telling you that instead > of writing data to your supposedly fast wal/blockdb device it's now hitting > your slow device. > > > You've stated previously that your fast device is nvme and your slow device > is SSD. So the spill-over is probably less of a problem than you think. It's > currently unclear what your actual problem is and why you think it's to do > with spill-over. > > > What model are your NVMEs and SSDs - what IOPS can each sustain (4k random > write direct IO), what's their current load? What are the actual problems > that you are observing, i.e. what does "stability problems" actually mean? > > > On Thu, 30 Sept 2021 at 18:33, Szabo, Istvan (Agoda) <istvan.sz...@agoda.com> > wrote: > > > Hi, > > > Still suffering with the spilledover disks and stability issue in 3 of my > cluster after uploaded 6-900 millions objects to the cluster. (Octopus > 15.2.10). > > > I’ve set memory target around 31-32GB so could that be that the spilledover > issue is coming from here? > > So have mem target 31GB, next level would be 310 and after go to the > underlaying ssd disk. So the 4 level doesn’t have space on the nvme. > > > Let’s say set to default 4GB, it would be 444GB the level0-3 so it > > should fit in on the 600GB lvm assigned on the nvme for db with wal. > > > This is how it looks like, eg. Osd 27 even after 2 times manual > > compact still spilled over :( > > > osd.1 spilled over 198 GiB metadata from 'db' device (303 GiB used of 596 > GiB) to slow device > > osd.5 spilled over 251 GiB metadata from 'db' device (163 GiB used of 596 > GiB) to slow device > > osd.8 spilled over 61 GiB metadata from 'db' device (264 GiB used of 596 > GiB) to slow device > > osd.11 spilled over 260 GiB metadata from 'db' device (242 GiB used of > 596 GiB) to slow device > > osd.12 spilled over 149 GiB metadata from 'db' device (238 GiB used of > 596 GiB) to slow device > > osd.15 spilled over 259 GiB metadata from 'db' device (195 GiB used of > 596 GiB) to slow device > > osd.17 spilled over 10 GiB metadata from 'db' device (314 GiB used of 596 > GiB) to slow device > > osd.21 spilled over 324 MiB metadata from 'db' device (346 GiB used of > 596 GiB) to slow device > > osd.27 spilled over 12 GiB metadata from 'db' device (486 GiB used of 596 > GiB) to slow device > > osd.29 spilled over 61 GiB metadata from 'db' device (306 GiB used of 596 > GiB) to slow device > > osd.31 spilled over 59 GiB metadata from 'db' device (308 GiB used of 596 > GiB) to slow device > > osd.46 spilled over 69 GiB metadata from 'db' device (308 GiB > > used of 596 GiB) to slow device > > > Also is there a way to fasten compaction? It takes 1-1.5 hours /osd to > compact. > > > Thank you > > _______________________________________________ > > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an > > email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io