OK -- here's the tracker for what I mentioned: https://tracker.ceph.com/issues/55303
On Tue, Apr 12, 2022 at 9:50 PM Ray Cunningham <ray.cunning...@keepertech.com> wrote: > > Thank you Dan! I will definitely disable autoscaler on the rest of our pools. > I can't get the PG numbers today, but I will try to get them tomorrow. We > definitely want to get this under control. > > Thank you, > Ray > > > -----Original Message----- > From: Dan van der Ster <dvand...@gmail.com> > Sent: Tuesday, April 12, 2022 2:46 PM > To: Ray Cunningham <ray.cunning...@keepertech.com> > Cc: ceph-users@ceph.io > Subject: Re: [ceph-users] Stop Rebalancing > > Hi Ray, > > Disabling the autoscaler on all pools is probably a good idea. At least until > https://tracker.ceph.com/issues/53729 is fixed. (You are likely not > susceptible to that -- but better safe than sorry). > > To pause the ongoing PG merges, you can indeed set the pg_num to the current > value. This will allow the ongoing merge complete and prevent further merges > from starting. > From `ceph osd pool ls detail` you'll see pg_num, pgp_num, pg_num_target, > pgp_num_target... If you share the current values of those we can help advise > what you need to set the pg_num to to effectively pause things where they are. > > BTW -- I'm going to create a request in the tracker that we improve the pg > autoscaler heuristic. IMHO the autoscaler should estimate the time to carry > out a split/merge operation and avoid taking one-way decisions without > permission from the administrator. The autoscaler is meant to be helpful, not > degrade a cluster for 100 days! > > Cheers, Dan > > > > On Tue, Apr 12, 2022 at 9:04 PM Ray Cunningham > <ray.cunning...@keepertech.com> wrote: > > > > Hi Everyone, > > > > We just upgraded our 640 OSD cluster to Ceph 16.2.7 and the resulting > > rebalancing of misplaced objects is overwhelming the cluster and impacting > > MON DB compaction, deep scrub repairs and us upgrading legacy bluestore > > OSDs. We have to pause the rebalancing if misplaced objects or we're going > > to fall over. > > > > Autoscaler-status tells us that we are reducing our PGs by 700'ish which > > will take us over 100 days to complete at our current recovery speed. We > > disabled autoscaler on our biggest pool, but I'm concerned that it's > > already on the path to the lower PG count and won't stop adding to our > > misplaced count after drop below 5%. What can we do to stop the cluster > > from finding more misplaced objects to rebalance? Should we set the PG num > > manually to what our current count is? Or will that cause even more havoc? > > > > Any other thoughts or ideas? My goals are to stop the rebalancing > > temporarily so we can deep scrub and repair inconsistencies, upgrade legacy > > bluestore OSDs and compact our MON DBs (supposedly MON DBs don't compact > > when you aren't 100% active+clean). > > > > Thank you, > > Ray > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an > > email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io