[ceph-users] Re: Balancer vs. Autoscaler
If you look at the current pg_num in that pool ls detail command that Dan mentioned you can set the pool pg_num to what that value currently is, which will effectively pause the pg changes. I did this recently when decreasing the number of pg's in a pool, which took several weeks to complete. This let me get some other maintenance done before setting the pg_num back to the target num again. This works well for reduction, but I'm not sure if it works well for increase as I think the pg_num may reach the target much faster and then just the pgp_num changes till they match. Rich On Wed, 22 Sept 2021 at 23:06, Dan van der Ster wrote: > > To get an idea how much work is left, take a look at `ceph osd pool ls > detail`. There should be pg_num_target... The osds will merge or split PGs > until pg_num matches that value. > > .. Dan > > > On Wed, 22 Sep 2021, 11:04 Jan-Philipp Litza, wrote: > > > Hi everyone, > > > > I had the autoscale_mode set to "on" and the autoscaler went to work and > > started adjusting the number of PGs in that pool. Since this implies a > > huge shift in data, the reweights that the balancer had carefully > > adjusted (in crush-compat mode) are now rubbish, and more and more OSDs > > become nearful (we sadly have very different sized OSDs). > > > > Now apparently both manager modules, balancer and pg_autoscaler, have > > the same threshold for operation, namely target_max_misplaced_ratio. So > > the balancer won't become active as long as the pg_autoscaler is still > > adjusting the number of PGs. > > > > I already set the autoscale_mode to "warn" on all pools, but apparently > > the autoscaler is determined to finish what it started. > > > > Is there any way to pause the autoscaler so the balancer has a chance of > > fixing the reweights? Because even in manual mode (ceph balancer > > optimize), the balancer won't compute a plan when the misplaced ratio is > > higher than target_max_misplaced_ratio. > > > > I know about "ceph osd reweight-*", but they adjust the reweights > > (visible in "ceph osd tree"), whereas the balancer adjusts the "compat > > weight-set", which I don't know how to convert back to the old-style > > reweights. > > > > Best regards, > > Jan-Philipp > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Remoto 1.1.4 in Ceph 16.2.6 containers
https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2021-4b2736a28c ^^ if people want to test and provide feedback for a potential merge to EPEL8 stable. David On Wed, Sep 22, 2021 at 11:43 AM David Orman wrote: > > I'm wondering if this was installed using pip/pypi before, and now > switched to using EPEL? That would explain it - 1.2.1 may never have > been pushed to EPEL. > > David > > On Wed, Sep 22, 2021 at 11:26 AM David Orman wrote: > > > > We'd worked on pushing a change to fix > > https://tracker.ceph.com/issues/50526 for a deadlock in remoto here: > > https://github.com/alfredodeza/remoto/pull/63 > > > > A new version, 1.2.1, was built to help with this. With the Ceph > > release 16.2.6 (at least), we see 1.1.4 is again part of the > > containers. Looking at EPEL8, all that is built now is 1.1.4. We're > > not sure what happened, but would it be possible to get 1.2.1 pushed > > to EPEL8 again, and figure out why it was removed? We'd then need a > > rebuild of the 16.2.6 containers to 'fix' this bug. > > > > This is definitely a high urgency bug, as it impacts any deployments > > with medium to large counts of OSDs or split db/wal devices, like many > > modern deployments. > > > > https://koji.fedoraproject.org/koji/packageinfo?packageID=18747 > > https://dl.fedoraproject.org/pub/epel/8/Everything/x86_64/Packages/p/ > > > > Respectfully, > > David Orman ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Why set osd flag to noout during upgrade ?
In addition, from my experience: I often set noout, norebalance and nobackfill before doing maintenance. This greatly speeds up peering (when adding new OSDs) and reduces unnecessary load from all daemons. In particular, if there is heavy client IO going on at the same time, the ceph daemons are much more stable with these settings. I had, after shutting down one host, more OSDs crashing under combined peering+backfill load causing a cascade of even more OSDs crashing. The above settings have prevented such things from happening. As mentioned before, it also avoids unnecessary rebuilds of objects that are not even modified during the service window. Having an OSD down even for 30 minutes usually requires only a few seconds to minutes to catch up with the latest diffs of modified objects instead of starting a full rebuild of all objects regardless of their modification state. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Etienne Menguy Sent: 22 September 2021 12:17:39 To: ceph-users Subject: [ceph-users] Re: Why set osd flag to noout during upgrade ? Hello, >From my experience, I see three reasons : - You don’t want to recover data if you already have them on a down OSD, rebalancing can have a big impact on performance - If upgrade/maintenance goes wrong you will want to focus on this issue and not have to deal with things done by Ceph meanwhile. - During an upgrade you have an ‘unusual’ cluster with different versions. It’s supposed to work, but you probably want to keep it ‘boring’. - Etienne Menguy etienne.men...@croit.io > On 22 Sep 2021, at 11:51, Francois Legrand wrote: > > Hello everybody, > > I have a "stupid" question. Why is it recommended in the docs to set the osd > flag to noout during an upgrade/maintainance (and especially during an osd > upgrade/maintainance) ? > > In my understanding, if an osd goes down, after a while (600s by default) > it's marked out and the cluster will start to rebuild it's content elsewhere > in the cluster to maintain the redondancy of the datas. This generate some > transfer and load on other osds, but that's not a big deal ! > > As soon as the osd is back, it's marked in again and ceph is able to > determine which data is back and stop the recovery to reuse the unchanged > datas which are back. Generally, the recovery is as fast as with noout flag > (because with noout, the data modified during the down period still have be > copied to the back osd). > > Thus is there an other reason apart from limiting the transfer and others > osds load durind the downtime ? > > F > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Why set osd flag to noout during upgrade ?
Indeed. In a large enough cluster, even a few minutes of extra backfill/recovery per OSD adds up. Say you have 100 OSD nodes, and just 3 minutes of unnecessary backfill per. That prolongs your upgrade by 5 hours. > Yeah you don't want to deal with backfilling while the cluster is > upgrading. At best it can delay the upgrade, at worst mixed version > backfilling has (rarely) caused issues in the past. > > We additionally `set noin` and disable the balancer: `ceph balancer off`. > The former prevents broken osds from re-entering the cluster, and both of > these similarly prevent backfilling from starting mid-upgrade. > > > .. Dan > > > On Wed, 22 Sep 2021, 12:18 Etienne Menguy, wrote: > >> Hello, >> >> From my experience, I see three reasons : >> - You don’t want to recover data if you already have them on a down OSD, >> rebalancing can have a big impact on performance >> - If upgrade/maintenance goes wrong you will want to focus on this issue >> and not have to deal with things done by Ceph meanwhile. >> - During an upgrade you have an ‘unusual’ cluster with different versions. >> It’s supposed to work, but you probably want to keep it ‘boring’. >> >> - >> Etienne Menguy >> etienne.men...@croit.io >> >> >> >> >>> On 22 Sep 2021, at 11:51, Francois Legrand wrote: >>> >>> Hello everybody, >>> >>> I have a "stupid" question. Why is it recommended in the docs to set the >> osd flag to noout during an upgrade/maintainance (and especially during an >> osd upgrade/maintainance) ? >>> >>> In my understanding, if an osd goes down, after a while (600s by >> default) it's marked out and the cluster will start to rebuild it's content >> elsewhere in the cluster to maintain the redondancy of the datas. This >> generate some transfer and load on other osds, but that's not a big deal ! >>> >>> As soon as the osd is back, it's marked in again and ceph is able to >> determine which data is back and stop the recovery to reuse the unchanged >> datas which are back. Generally, the recovery is as fast as with noout flag >> (because with noout, the data modified during the down period still have be >> copied to the back osd). >>> >>> Thus is there an other reason apart from limiting the transfer and >> others osds load durind the downtime ? >>> >>> F >>> >>> ___ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >> >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] One PG keeps going inconsistent (stat mismatch)
Hi All, I have a recurring single PG that keeps going inconsistent. A scrub is enough to pick up the problem. The primary OSD log shows something like: 2021-09-22 18:08:18.502 7f5bdcb11700 0 log_channel(cluster) log [DBG] : 1.3ff scrub starts 2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 1.3ff scrub : stat mismatch, got 3243/3244 objects, 67/67 clones, 3243/3244 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1/1 whiteouts, 13247338496/13251516416 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. 2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 1.3ff scrub 1 errors It always repairs ok when I run ceph pg repair 1.3ff: 2021-09-22 18:08:47.533 7f5bdcb11700 0 log_channel(cluster) log [DBG] : 1.3ff repair starts 2021-09-22 18:15:58.218 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 1.3ff repair : stat mismatch, got 3243/3244 objects, 67/67 clones, 3243/3244 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1/1 whiteouts, 13247338496/13251516416 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. 2021-09-22 18:15:58.218 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 1.3ff repair 1 errors, 1 fixed It's happened multiple times and always with the same PG number, no other PG is doing this. It's a Nautilus v14.2.5 cluster using spinning disks with separate DB/WAL on SSDs. I don't believe there's an underlying hardware problem but in a bid to make sure I reweighted the primary OSD for this PG to 0 to get it to move to another disk. The backfilling is complete but on manually scrubbing the PG again it showed inconsistent as above. In case it's relevant the only major activity I've performed recently has been gradually adding new OSD nodes and disks to the cluster, prior to this it had been up without issue for well over a year. The primary OSD for this PG was on the first new OSD I added when this issue first presented. The inconsistent PG issue didn't start happening immediately after adding it though, it was some weeks later. Any suggestions as to how I can get rid of this problem? Should I try reweighting the other two OSDs for this PG to 0? Or is this a known bug that requires some specific work or just an upgrade? Thanks, Simon. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Remoto 1.1.4 in Ceph 16.2.6 containers
I'm wondering if this was installed using pip/pypi before, and now switched to using EPEL? That would explain it - 1.2.1 may never have been pushed to EPEL. David On Wed, Sep 22, 2021 at 11:26 AM David Orman wrote: > > We'd worked on pushing a change to fix > https://tracker.ceph.com/issues/50526 for a deadlock in remoto here: > https://github.com/alfredodeza/remoto/pull/63 > > A new version, 1.2.1, was built to help with this. With the Ceph > release 16.2.6 (at least), we see 1.1.4 is again part of the > containers. Looking at EPEL8, all that is built now is 1.1.4. We're > not sure what happened, but would it be possible to get 1.2.1 pushed > to EPEL8 again, and figure out why it was removed? We'd then need a > rebuild of the 16.2.6 containers to 'fix' this bug. > > This is definitely a high urgency bug, as it impacts any deployments > with medium to large counts of OSDs or split db/wal devices, like many > modern deployments. > > https://koji.fedoraproject.org/koji/packageinfo?packageID=18747 > https://dl.fedoraproject.org/pub/epel/8/Everything/x86_64/Packages/p/ > > Respectfully, > David Orman ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Remoto 1.1.4 in Ceph 16.2.6 containers
We'd worked on pushing a change to fix https://tracker.ceph.com/issues/50526 for a deadlock in remoto here: https://github.com/alfredodeza/remoto/pull/63 A new version, 1.2.1, was built to help with this. With the Ceph release 16.2.6 (at least), we see 1.1.4 is again part of the containers. Looking at EPEL8, all that is built now is 1.1.4. We're not sure what happened, but would it be possible to get 1.2.1 pushed to EPEL8 again, and figure out why it was removed? We'd then need a rebuild of the 16.2.6 containers to 'fix' this bug. This is definitely a high urgency bug, as it impacts any deployments with medium to large counts of OSDs or split db/wal devices, like many modern deployments. https://koji.fedoraproject.org/koji/packageinfo?packageID=18747 https://dl.fedoraproject.org/pub/epel/8/Everything/x86_64/Packages/p/ Respectfully, David Orman ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] "Remaining time" under-estimates by 100x....
Is there a way to re-calibrate the various 'global recovery event' and related 'remaining time' estimators? For the last three days I've been assured that a 19h event will be over in under 3 hours... Previously I think Microsoft held the record for the most incorrect 'please wait' progress indicators. Ceph may take that crown this year, unless... Thanks Harry ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] IO500 SC’21 Call for Submission
Stabilization period: Friday, 17th September - Friday, 1st October Submission deadline: Monday, 1st November 2021 AoE The IO500 [1] is now accepting and encouraging submissions for the upcoming 9th semi-annual IO500 list, in conjunction with SC'21. Once again, we are also accepting submissions to the 10 Node Challenge to encourage the submission of small-scale results. The new ranked lists will be announced via live-stream at a virtual session during "The IO500 and the Virtual Institute of I/O" BoF [3]. We hope to see many new results. What's New Since ISC21, the IO500 follows a two-staged approach. First, there will be a two-week stabilization period during which we encourage the community to verify that the benchmark runs properly on a variety of storage systems. During this period the benchmark may be updated based upon feedback from the community. The final benchmark will then be released. We expect that submissions compliant with the rules made during the stabilization period will be valid as a final submission unless a significant defect is found. We are now creating a more detailed schema to describe the hardware and software of the system under test and provide the first set of tools to ease capturing of this information for inclusion with the submission. Further details will be released on the submission page [2]. We are evaluating the inclusion of optional test phases for additional key workloads - split easy/hard find phases, 4KB and 1MB random read/write phases, and concurrent metadata operations. This is called an extended run. At the moment, we collect the information to verify that additional phases do not significantly impact the results of the standard IO500 run. We encourage every participant to submit results from both a standard run and an extended run to facilitate comparisons between the existing and new benchmark phases. In a future release, we may include some or all of these results as part of the standard benchmark. The extended results are not currently included in the scoring of any ranked list. Background The benchmark suite is designed to be easy to run and the community has multiple active support channels to help with any questions. Please note that submissions of all sizes are welcome; the site has customizable sorting, so it is possible to submit on a small system and still get a very good per-client score, for example. Additionally, the list is about much more than just the raw rank; all submissions help the community by collecting and publishing a wider corpus of data. More details below. Following the success of the Top500 in collecting and analyzing historical trends in supercomputer technology and evolution, the IO500 was created in 2017, published its first list at SC17, and has grown continually since then. The need for such an initiative has long been known within High-Performance Computing; however, defining appropriate benchmarks has long been challenging. Despite this challenge, the community, after long and spirited discussion, finally reached a consensus on a suite of benchmarks and a metric for resolving the scores into a single ranking. The multi-fold goals of the benchmark suite are as follows: Maximizing simplicity in running the benchmark suite Encouraging optimization and documentation of tuning parameters for performance Allowing submitters to highlight their "hero run" performance numbers Forcing submitters to simultaneously report performance for challenging IO patterns. Specifically, the benchmark suite includes a hero-run of both IOR and MDTest configured, however, possible to maximize performance and establish an upper-bound for performance. It also includes an IOR and MDTest run with highly constrained parameters forcing a difficult usage pattern in an attempt to determine a lower-bound. Finally, it includes a namespace search as this has been determined to be a highly sought-after feature in HPC storage systems that has historically not been well-measured. Submitters are encouraged to share their tuning insights for publication. The goals of the community are also multi-fold: Gather historical data for the sake of analysis and to aid predictions of storage futures Collect tuning data to share valuable performance optimizations across the community Encourage vendors and designers to optimize for workloads beyond "hero runs" Establish bounded expectations for users, procurers, and administrators 10 Node I/O Challenge The 10 Node Challenge is conducted using the regular IO500 benchmark, however, with the rule that exactly 10 client nodes must be used to run the benchmark. You may use any shared storage with any number of servers. When submitting for the IO500 list, you can opt-in for "Participate in the 10 compute node challenge only", then we will not include the results in the ranked list. Other 10-node node submissions will be included in the full list and in the ranke
[ceph-users] Re: Modify pgp number after pg_num increased
That's been already increased to 4. Istvan Szabo Senior Infrastructure Engineer --- Agoda Services Co., Ltd. e: istvan.sz...@agoda.com --- -Original Message- From: Eugen Block Sent: Wednesday, September 22, 2021 2:51 PM To: ceph-users@ceph.io Subject: [ceph-users] Re: Modify pgp number after pg_num increased Email received from the internet. If in doubt, don't click any link nor open any attachment ! Hi, IIRC in a different thread you pasted your max-backfill config and it was the lowest possible value (1), right? That's why your backfill is slow. Zitat von "Szabo, Istvan (Agoda)" : > Hi, > > By default in the newer versions of ceph when you increase the pg_num > the cluster will start to increase the pgp_num slowly up to the value > of the pg_num. > I've increased the ec-code data pool from 32 to 128 but 1 node has > been added to the cluster and it's very slow. > > pool 28 'hkg.rgw.buckets.data' erasure profile data-ec size 6 min_size > 5 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 55 > pgp_num_target 128 autoscale_mode warn last_change 16443 lfor > 0/0/14828 fl > ags hashpspool stripe_width 16384 application rgw > > At the moment there has been done 55 out of the 128 pg. > Is it safe to set the pgp_num at this stage to 64 and wait until the > data will be rebalanced to the newly added node? > > Thank you > ___ > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an > email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Change max backfills
God damn...you are absolutely right - my bad. Sorry and thanks for that... -Ursprüngliche Nachricht- Von: Etienne Menguy Gesendet: Mittwoch 22. September 2021 15:48 An: ceph-users@ceph.io Betreff: [ceph-users] Re: Change max backfills Hi, In the past you had this output if value was not changing, try with another value. I don’t know if things changed with latest Ceph version. - Etienne Menguy etienne.men...@croit.io > On 22 Sep 2021, at 15:34, Pascal Weißhaupt > wrote: > > Hi, > > > > I recently upgraded from Ceph 15 to Ceph 16 and when I want to change the max > backfills via > > > > ceph tell 'osd.*' injectargs '--osd-max-backfills 1' > > > > I get no output: > > > > root@pve01:˜# ceph tell 'osd.*' injectargs '--osd-max-backfills 1' > osd.0: {} > osd.1: {} > osd.2: {} > osd.3: {} > osd.4: {} > osd.5: {} > osd.6: {} > osd.7: {} > osd.8: {} > osd.9: {} > osd.10: {} > osd.11: {} > osd.12: {} > osd.13: {} > osd.14: {} > osd.15: {} > osd.16: {} > osd.17: {} > osd.18: {} > osd.19: {} > > > > If I remember correctly, with Ceph 15 I got something like "changed max > backfills to 1" or so. > > > > Is that command not supported anymore or is the empty output correct? > > > > Regards, > > Pascal > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Change max backfills
Hi, In the past you had this output if value was not changing, try with another value. I don’t know if things changed with latest Ceph version. - Etienne Menguy etienne.men...@croit.io > On 22 Sep 2021, at 15:34, Pascal Weißhaupt > wrote: > > Hi, > > > > I recently upgraded from Ceph 15 to Ceph 16 and when I want to change the max > backfills via > > > > ceph tell 'osd.*' injectargs '--osd-max-backfills 1' > > > > I get no output: > > > > root@pve01:~# ceph tell 'osd.*' injectargs '--osd-max-backfills 1' > osd.0: {} > osd.1: {} > osd.2: {} > osd.3: {} > osd.4: {} > osd.5: {} > osd.6: {} > osd.7: {} > osd.8: {} > osd.9: {} > osd.10: {} > osd.11: {} > osd.12: {} > osd.13: {} > osd.14: {} > osd.15: {} > osd.16: {} > osd.17: {} > osd.18: {} > osd.19: {} > > > > If I remember correctly, with Ceph 15 I got something like "changed max > backfills to 1" or so. > > > > Is that command not supported anymore or is the empty output correct? > > > > Regards, > > Pascal > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Change max backfills
Hi, I recently upgraded from Ceph 15 to Ceph 16 and when I want to change the max backfills via ceph tell 'osd.*' injectargs '--osd-max-backfills 1' I get no output: root@pve01:~# ceph tell 'osd.*' injectargs '--osd-max-backfills 1' osd.0: {} osd.1: {} osd.2: {} osd.3: {} osd.4: {} osd.5: {} osd.6: {} osd.7: {} osd.8: {} osd.9: {} osd.10: {} osd.11: {} osd.12: {} osd.13: {} osd.14: {} osd.15: {} osd.16: {} osd.17: {} osd.18: {} osd.19: {} If I remember correctly, with Ceph 15 I got something like "changed max backfills to 1" or so. Is that command not supported anymore or is the empty output correct? Regards, Pascal ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Successful Upgrade from 14.2.22 to 15.2.14
Hi Dan, This is excellent to hear - we've also been a bit hesitant to upgrade from Nautilus (which has been working so well for us). One question: did you/would you consider upgrading straight to Pacific from Nautilus? Can you share your thoughts that lead you to Octopus first? Thanks, Andras On 9/21/21 06:09, Dan van der Ster wrote: Dear friends, This morning we upgraded our pre-prod cluster from 14.2.22 to 15.2.14, successfully, following the procedure at https://docs.ceph.com/en/latest/releases/octopus/#upgrading-from-mimic-or-nautilus It's a 400TB cluster which is 10% used with 72 osds (block=hdd, block.db=ssd) and 40M objects. * The mons upgraded cleanly as expected. * One minor surprise was that the mgrs respawned themselves moments after the leader restarted into octopus: 2021-09-21T10:16:38.992219+0200 mon.cephdwight-mon-1633994557 (mon.0) 16 : cluster [INF] mon.cephdwight-mon-1633994557 is new leader, mons cephdwight-mon-1633994557,cephdwight-mon-f7df6839c6,cephdwight-mon-d8788e3256 in quorum (ranks 0,1,2) 2021-09-21 10:16:39.046 7fae3caf8700 1 mgr handle_mgr_map respawning because set of enabled modules changed! This didn't create any problems AFAICT. * The osds performed the expected fsck after restarting. Their logs are spammed with things like 2021-09-21T11:15:23.233+0200 7f85901bd700 -1 bluestore(/var/lib/ceph/osd/ceph-1) fsck warning: #174:1e024a6e:::10009663a55.:head# has omap that is not per-pool or pgmeta but that is fully expected AFAIU. Each osd took just under 10 minutes to fsck: 2021-09-21T11:22:27.188+0200 7f85a3a2bf00 1 bluestore(/var/lib/ceph/osd/ceph-1) _fsck_on_open <<>> with 0 errors, 197756 warnings, 197756 repaired, 0 remaining in 475.083056 seconds For reference, this cluster was created many major releases ago (maybe firefly) but osds were probably re-created in luminous. The memory usage was quite normal, we didn't suffer any OOMs. * The active mds restarted into octopus without incident. In summary it was a very smooth upgrade. After a week of observation we'll proceed with more production clusters. For our largest S3 cluster with slow hdds, we expect huge fsck transactions, so will wait for https://github.com/ceph/ceph/pull/42958 to be merged before upgrading. Best Regards, and thanks to all the devs for their work, Dan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Modify pgp number after pg_num increased
Hi, By default in the newer versions of ceph when you increase the pg_num the cluster will start to increase the pgp_num slowly up to the value of the pg_num. I've increased the ec-code data pool from 32 to 128 but 1 node has been added to the cluster and it's very slow. pool 28 'hkg.rgw.buckets.data' erasure profile data-ec size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 55 pgp_num_target 128 autoscale_mode warn last_change 16443 lfor 0/0/14828 fl ags hashpspool stripe_width 16384 application rgw At the moment there has been done 55 out of the 128 pg. Is it safe to set the pgp_num at this stage to 64 and wait until the data will be rebalanced to the newly added node? Thank you ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Balancer vs. Autoscaler
To get an idea how much work is left, take a look at `ceph osd pool ls detail`. There should be pg_num_target... The osds will merge or split PGs until pg_num matches that value. .. Dan On Wed, 22 Sep 2021, 11:04 Jan-Philipp Litza, wrote: > Hi everyone, > > I had the autoscale_mode set to "on" and the autoscaler went to work and > started adjusting the number of PGs in that pool. Since this implies a > huge shift in data, the reweights that the balancer had carefully > adjusted (in crush-compat mode) are now rubbish, and more and more OSDs > become nearful (we sadly have very different sized OSDs). > > Now apparently both manager modules, balancer and pg_autoscaler, have > the same threshold for operation, namely target_max_misplaced_ratio. So > the balancer won't become active as long as the pg_autoscaler is still > adjusting the number of PGs. > > I already set the autoscale_mode to "warn" on all pools, but apparently > the autoscaler is determined to finish what it started. > > Is there any way to pause the autoscaler so the balancer has a chance of > fixing the reweights? Because even in manual mode (ceph balancer > optimize), the balancer won't compute a plan when the misplaced ratio is > higher than target_max_misplaced_ratio. > > I know about "ceph osd reweight-*", but they adjust the reweights > (visible in "ceph osd tree"), whereas the balancer adjusts the "compat > weight-set", which I don't know how to convert back to the old-style > reweights. > > Best regards, > Jan-Philipp > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Is it normal Ceph reports "Degraded data redundancy" in normal use?
On 21.09.2021 09:11, Kobi Ginon wrote: for sure the balancer affects the status Of course, but setting several PG to degraded is something else. i doubt that your customers will be writing so many objects in the same rate of the Test. I only need 2 host running rados bench to get several PG in degrade state. maybe you need to play with the balancer configuration a bit. Maybe, but a balancer should not set the cluster health to warning with several PG in degraded state. It should be possible to do this cleanly, copy data and delete the source when copy is OK. Could start with this The balancer mode can be changed to crush-compat mode, which is backward compatible with older clients, and will make small changes to the data distribution over time to ensure that OSDs are equally utilized. https://docs.ceph.com/en/latest/rados/operations/balancer/ I will probably just turn it off before I set the cluster in production. side note: i m using indeed an old version of ceph ( nautilus)+ blancer configured and runs rado benchmarks , but did not saw such a problem. on the other hand i m not using pg_autoscaler i set the pools PG number in advanced according to assumption of the percentage each pool will be using Could be that you do use this Mode and the combination of auto scaler and balancer is what reveals this issue If you look at my initial post you will se that the pool is created with --autoscale-mode=off The cluster is running 16.2.5 and is empty except for one pool with one PG created by Cephadm. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Why set osd flag to noout during upgrade ?
Yeah you don't want to deal with backfilling while the cluster is upgrading. At best it can delay the upgrade, at worst mixed version backfilling has (rarely) caused issues in the past. We additionally `set noin` and disable the balancer: `ceph balancer off`. The former prevents broken osds from re-entering the cluster, and both of these similarly prevent backfilling from starting mid-upgrade. .. Dan On Wed, 22 Sep 2021, 12:18 Etienne Menguy, wrote: > Hello, > > From my experience, I see three reasons : > - You don’t want to recover data if you already have them on a down OSD, > rebalancing can have a big impact on performance > - If upgrade/maintenance goes wrong you will want to focus on this issue > and not have to deal with things done by Ceph meanwhile. > - During an upgrade you have an ‘unusual’ cluster with different versions. > It’s supposed to work, but you probably want to keep it ‘boring’. > > - > Etienne Menguy > etienne.men...@croit.io > > > > > > On 22 Sep 2021, at 11:51, Francois Legrand wrote: > > > > Hello everybody, > > > > I have a "stupid" question. Why is it recommended in the docs to set the > osd flag to noout during an upgrade/maintainance (and especially during an > osd upgrade/maintainance) ? > > > > In my understanding, if an osd goes down, after a while (600s by > default) it's marked out and the cluster will start to rebuild it's content > elsewhere in the cluster to maintain the redondancy of the datas. This > generate some transfer and load on other osds, but that's not a big deal ! > > > > As soon as the osd is back, it's marked in again and ceph is able to > determine which data is back and stop the recovery to reuse the unchanged > datas which are back. Generally, the recovery is as fast as with noout flag > (because with noout, the data modified during the down period still have be > copied to the back osd). > > > > Thus is there an other reason apart from limiting the transfer and > others osds load durind the downtime ? > > > > F > > > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Why set osd flag to noout during upgrade ?
Hello, From my experience, I see three reasons : - You don’t want to recover data if you already have them on a down OSD, rebalancing can have a big impact on performance - If upgrade/maintenance goes wrong you will want to focus on this issue and not have to deal with things done by Ceph meanwhile. - During an upgrade you have an ‘unusual’ cluster with different versions. It’s supposed to work, but you probably want to keep it ‘boring’. - Etienne Menguy etienne.men...@croit.io > On 22 Sep 2021, at 11:51, Francois Legrand wrote: > > Hello everybody, > > I have a "stupid" question. Why is it recommended in the docs to set the osd > flag to noout during an upgrade/maintainance (and especially during an osd > upgrade/maintainance) ? > > In my understanding, if an osd goes down, after a while (600s by default) > it's marked out and the cluster will start to rebuild it's content elsewhere > in the cluster to maintain the redondancy of the datas. This generate some > transfer and load on other osds, but that's not a big deal ! > > As soon as the osd is back, it's marked in again and ceph is able to > determine which data is back and stop the recovery to reuse the unchanged > datas which are back. Generally, the recovery is as fast as with noout flag > (because with noout, the data modified during the down period still have be > copied to the back osd). > > Thus is there an other reason apart from limiting the transfer and others > osds load durind the downtime ? > > F > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Successful Upgrade from 14.2.22 to 15.2.14
I understand, thanks for sharing! Zitat von Dan van der Ster : Hi Eugen, All of our prod clusters are still old school rpm packages managed by our private puppet manifests. Even our newest pacific pre-prod cluster is still managed like that. We have a side project to test and move to cephadm / containers but that is still a WIP. (Our situation is complicated by the fact that we'll need to continue puppet managing things like firewall with cephadm doing the daemon placement). Cheers, Dan On Wed, Sep 22, 2021 at 10:32 AM Eugen Block wrote: Thanks for the summary, Dan! I'm still hesitating upgrading our production environment from N to O, your experience sounds reassuring though. I have one question, did you also switch to cephadm and containerize all daemons? We haven't made a decision yet, but I guess at some point we'll have to switch anyway, so we could also just get over it. :-D We'll need to discuss it with the team... Thanks, Eugen Zitat von Dan van der Ster : > Dear friends, > > This morning we upgraded our pre-prod cluster from 14.2.22 to 15.2.14, > successfully, following the procedure at > https://docs.ceph.com/en/latest/releases/octopus/#upgrading-from-mimic-or-nautilus > It's a 400TB cluster which is 10% used with 72 osds (block=hdd, > block.db=ssd) and 40M objects. > > * The mons upgraded cleanly as expected. > * One minor surprise was that the mgrs respawned themselves moments > after the leader restarted into octopus: > > 2021-09-21T10:16:38.992219+0200 mon.cephdwight-mon-1633994557 (mon.0) > 16 : cluster [INF] mon.cephdwight-mon-1633994557 is new leader, mons > cephdwight-mon-1633994557,cephdwight-mon-f7df6839c6,cephdwight-mon-d8788e3256 > in quorum (ranks 0,1,2) > > 2021-09-21 10:16:39.046 7fae3caf8700 1 mgr handle_mgr_map respawning > because set of enabled modules changed! > > This didn't create any problems AFAICT. > > * The osds performed the expected fsck after restarting. Their logs > are spammed with things like > > 2021-09-21T11:15:23.233+0200 7f85901bd700 -1 > bluestore(/var/lib/ceph/osd/ceph-1) fsck warning: > #174:1e024a6e:::10009663a55.:head# has omap that is not > per-pool or pgmeta > > but that is fully expected AFAIU. Each osd took just under 10 > minutes to fsck: > > 2021-09-21T11:22:27.188+0200 7f85a3a2bf00 1 > bluestore(/var/lib/ceph/osd/ceph-1) _fsck_on_open <<>> with 0 > errors, 197756 warnings, 197756 repaired, 0 remaining in 475.083056 > seconds > > For reference, this cluster was created many major releases ago (maybe > firefly) but osds were probably re-created in luminous. > The memory usage was quite normal, we didn't suffer any OOMs. > > * The active mds restarted into octopus without incident. > > In summary it was a very smooth upgrade. After a week of observation > we'll proceed with more production clusters. > For our largest S3 cluster with slow hdds, we expect huge fsck > transactions, so will wait for https://github.com/ceph/ceph/pull/42958 > to be merged before upgrading. > > Best Regards, and thanks to all the devs for their work, > > Dan > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] High overwrite latency
Hi, We do run several Ceph clusters, but one has a strange problem. It is running Octopus 15.2.14 on 9 (HP 360 Gen 8, 64 GB, 10 Gbps) servers, 48 OSDs (all 2 TB Samsung SSDs with Bluestore). Monitoring in Grafana shows these three latency values over 7 days: ceph_osd_op_r_latency_sum: avg 1.16 ms, max 9.95 ms ceph_osd_op_w_latency_sum: avg 5.85 ms, max 26.2 ms ceph_osd_op_rw_latency_sum: avf 110 ms, max 388 ms Average throughput is around 30 MB/sec read and 40 MB/sec write. Both with 2000 iops. On another cluster (hardware almost the same, identical software versions), but 25% lower load, there the values are: ceph_osd_op_r_latency_sum: avg 1.09 ms, max 6.55 ms ceph_osd_op_w_latency_sum: avg 4.46 ms, max 14.4 ms ceph_osd_op_rw_latency_sum: avf 4.94 ms, max 17.6 ms I can't find any difference in hba controller settings, network or kerneltuning. Has someone got any ideas? Regards, Erwin ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Why set osd flag to noout during upgrade ?
Hello everybody, I have a "stupid" question. Why is it recommended in the docs to set the osd flag to noout during an upgrade/maintainance (and especially during an osd upgrade/maintainance) ? In my understanding, if an osd goes down, after a while (600s by default) it's marked out and the cluster will start to rebuild it's content elsewhere in the cluster to maintain the redondancy of the datas. This generate some transfer and load on other osds, but that's not a big deal ! As soon as the osd is back, it's marked in again and ceph is able to determine which data is back and stop the recovery to reuse the unchanged datas which are back. Generally, the recovery is as fast as with noout flag (because with noout, the data modified during the down period still have be copied to the back osd). Thus is there an other reason apart from limiting the transfer and others osds load durind the downtime ? F ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Successful Upgrade from 14.2.22 to 15.2.14
Hi Andras, I'm not aware of any showstoppers to move directly to pacific. Indeed we already run pacific on a new cluster we built for our users to try cephfs snapshots at scale. That cluster was created with octopus a few months ago then upgraded to pacific at 16.2.4 to take advantage of the stray dentry splitting. Why octopus and not pacific directly for the existing bulk of our prod clusters? we're just being conservative, especially for what concerns the fsck omap upgrade on all the osds. Since it went well for this cluster, I expect it will similarly go well for the other rbd and cephfs clusters. We'll tread more carefully for the S3 clusters, but with the PR mentioned earlier I expect it to go well. My expectation is that we'll only run octopus for a short while before we move to pacific in one of the next point releases there. Before octopus we usually haven't moved our most critical clusters to the next major release until around ~.8 -- it's usually by then that all major issues have been flushed out, AFAICT. Cheers, Dan On Wed, Sep 22, 2021 at 11:19 AM Andras Pataki wrote: > > Hi Dan, > > This is excellent to hear - we've also been a bit hesitant to upgrade > from Nautilus (which has been working so well for us). One question: > did you/would you consider upgrading straight to Pacific from Nautilus? > Can you share your thoughts that lead you to Octopus first? > > Thanks, > > Andras > > > On 9/21/21 06:09, Dan van der Ster wrote: > > Dear friends, > > > > This morning we upgraded our pre-prod cluster from 14.2.22 to 15.2.14, > > successfully, following the procedure at > > https://docs.ceph.com/en/latest/releases/octopus/#upgrading-from-mimic-or-nautilus > > It's a 400TB cluster which is 10% used with 72 osds (block=hdd, > > block.db=ssd) and 40M objects. > > > > * The mons upgraded cleanly as expected. > > * One minor surprise was that the mgrs respawned themselves moments > > after the leader restarted into octopus: > > > > 2021-09-21T10:16:38.992219+0200 mon.cephdwight-mon-1633994557 (mon.0) > > 16 : cluster [INF] mon.cephdwight-mon-1633994557 is new leader, mons > > cephdwight-mon-1633994557,cephdwight-mon-f7df6839c6,cephdwight-mon-d8788e3256 > > in quorum (ranks 0,1,2) > > > > 2021-09-21 10:16:39.046 7fae3caf8700 1 mgr handle_mgr_map respawning > > because set of enabled modules changed! > > > > This didn't create any problems AFAICT. > > > > * The osds performed the expected fsck after restarting. Their logs > > are spammed with things like > > > > 2021-09-21T11:15:23.233+0200 7f85901bd700 -1 > > bluestore(/var/lib/ceph/osd/ceph-1) fsck warning: > > #174:1e024a6e:::10009663a55.:head# has omap that is not > > per-pool or pgmeta > > > > but that is fully expected AFAIU. Each osd took just under 10 minutes to > > fsck: > > > > 2021-09-21T11:22:27.188+0200 7f85a3a2bf00 1 > > bluestore(/var/lib/ceph/osd/ceph-1) _fsck_on_open <<>> with 0 > > errors, 197756 warnings, 197756 repaired, 0 remaining in 475.083056 > > seconds > > > > For reference, this cluster was created many major releases ago (maybe > > firefly) but osds were probably re-created in luminous. > > The memory usage was quite normal, we didn't suffer any OOMs. > > > > * The active mds restarted into octopus without incident. > > > > In summary it was a very smooth upgrade. After a week of observation > > we'll proceed with more production clusters. > > For our largest S3 cluster with slow hdds, we expect huge fsck > > transactions, so will wait for https://github.com/ceph/ceph/pull/42958 > > to be merged before upgrading. > > > > Best Regards, and thanks to all the devs for their work, > > > > Dan > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Successful Upgrade from 14.2.22 to 15.2.14
Hi Eugen, All of our prod clusters are still old school rpm packages managed by our private puppet manifests. Even our newest pacific pre-prod cluster is still managed like that. We have a side project to test and move to cephadm / containers but that is still a WIP. (Our situation is complicated by the fact that we'll need to continue puppet managing things like firewall with cephadm doing the daemon placement). Cheers, Dan On Wed, Sep 22, 2021 at 10:32 AM Eugen Block wrote: > > Thanks for the summary, Dan! > > I'm still hesitating upgrading our production environment from N to O, > your experience sounds reassuring though. I have one question, did you > also switch to cephadm and containerize all daemons? We haven't made a > decision yet, but I guess at some point we'll have to switch anyway, > so we could also just get over it. :-D We'll need to discuss it with > the team... > > Thanks, > Eugen > > > Zitat von Dan van der Ster : > > > Dear friends, > > > > This morning we upgraded our pre-prod cluster from 14.2.22 to 15.2.14, > > successfully, following the procedure at > > https://docs.ceph.com/en/latest/releases/octopus/#upgrading-from-mimic-or-nautilus > > It's a 400TB cluster which is 10% used with 72 osds (block=hdd, > > block.db=ssd) and 40M objects. > > > > * The mons upgraded cleanly as expected. > > * One minor surprise was that the mgrs respawned themselves moments > > after the leader restarted into octopus: > > > > 2021-09-21T10:16:38.992219+0200 mon.cephdwight-mon-1633994557 (mon.0) > > 16 : cluster [INF] mon.cephdwight-mon-1633994557 is new leader, mons > > cephdwight-mon-1633994557,cephdwight-mon-f7df6839c6,cephdwight-mon-d8788e3256 > > in quorum (ranks 0,1,2) > > > > 2021-09-21 10:16:39.046 7fae3caf8700 1 mgr handle_mgr_map respawning > > because set of enabled modules changed! > > > > This didn't create any problems AFAICT. > > > > * The osds performed the expected fsck after restarting. Their logs > > are spammed with things like > > > > 2021-09-21T11:15:23.233+0200 7f85901bd700 -1 > > bluestore(/var/lib/ceph/osd/ceph-1) fsck warning: > > #174:1e024a6e:::10009663a55.:head# has omap that is not > > per-pool or pgmeta > > > > but that is fully expected AFAIU. Each osd took just under 10 > > minutes to fsck: > > > > 2021-09-21T11:22:27.188+0200 7f85a3a2bf00 1 > > bluestore(/var/lib/ceph/osd/ceph-1) _fsck_on_open <<>> with 0 > > errors, 197756 warnings, 197756 repaired, 0 remaining in 475.083056 > > seconds > > > > For reference, this cluster was created many major releases ago (maybe > > firefly) but osds were probably re-created in luminous. > > The memory usage was quite normal, we didn't suffer any OOMs. > > > > * The active mds restarted into octopus without incident. > > > > In summary it was a very smooth upgrade. After a week of observation > > we'll proceed with more production clusters. > > For our largest S3 cluster with slow hdds, we expect huge fsck > > transactions, so will wait for https://github.com/ceph/ceph/pull/42958 > > to be merged before upgrading. > > > > Best Regards, and thanks to all the devs for their work, > > > > Dan > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Balancer vs. Autoscaler
Hi everyone, I had the autoscale_mode set to "on" and the autoscaler went to work and started adjusting the number of PGs in that pool. Since this implies a huge shift in data, the reweights that the balancer had carefully adjusted (in crush-compat mode) are now rubbish, and more and more OSDs become nearful (we sadly have very different sized OSDs). Now apparently both manager modules, balancer and pg_autoscaler, have the same threshold for operation, namely target_max_misplaced_ratio. So the balancer won't become active as long as the pg_autoscaler is still adjusting the number of PGs. I already set the autoscale_mode to "warn" on all pools, but apparently the autoscaler is determined to finish what it started. Is there any way to pause the autoscaler so the balancer has a chance of fixing the reweights? Because even in manual mode (ceph balancer optimize), the balancer won't compute a plan when the misplaced ratio is higher than target_max_misplaced_ratio. I know about "ceph osd reweight-*", but they adjust the reweights (visible in "ceph osd tree"), whereas the balancer adjusts the "compat weight-set", which I don't know how to convert back to the old-style reweights. Best regards, Jan-Philipp ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Successful Upgrade from 14.2.22 to 15.2.14
Thanks for the summary, Dan! I'm still hesitating upgrading our production environment from N to O, your experience sounds reassuring though. I have one question, did you also switch to cephadm and containerize all daemons? We haven't made a decision yet, but I guess at some point we'll have to switch anyway, so we could also just get over it. :-D We'll need to discuss it with the team... Thanks, Eugen Zitat von Dan van der Ster : Dear friends, This morning we upgraded our pre-prod cluster from 14.2.22 to 15.2.14, successfully, following the procedure at https://docs.ceph.com/en/latest/releases/octopus/#upgrading-from-mimic-or-nautilus It's a 400TB cluster which is 10% used with 72 osds (block=hdd, block.db=ssd) and 40M objects. * The mons upgraded cleanly as expected. * One minor surprise was that the mgrs respawned themselves moments after the leader restarted into octopus: 2021-09-21T10:16:38.992219+0200 mon.cephdwight-mon-1633994557 (mon.0) 16 : cluster [INF] mon.cephdwight-mon-1633994557 is new leader, mons cephdwight-mon-1633994557,cephdwight-mon-f7df6839c6,cephdwight-mon-d8788e3256 in quorum (ranks 0,1,2) 2021-09-21 10:16:39.046 7fae3caf8700 1 mgr handle_mgr_map respawning because set of enabled modules changed! This didn't create any problems AFAICT. * The osds performed the expected fsck after restarting. Their logs are spammed with things like 2021-09-21T11:15:23.233+0200 7f85901bd700 -1 bluestore(/var/lib/ceph/osd/ceph-1) fsck warning: #174:1e024a6e:::10009663a55.:head# has omap that is not per-pool or pgmeta but that is fully expected AFAIU. Each osd took just under 10 minutes to fsck: 2021-09-21T11:22:27.188+0200 7f85a3a2bf00 1 bluestore(/var/lib/ceph/osd/ceph-1) _fsck_on_open <<>> with 0 errors, 197756 warnings, 197756 repaired, 0 remaining in 475.083056 seconds For reference, this cluster was created many major releases ago (maybe firefly) but osds were probably re-created in luminous. The memory usage was quite normal, we didn't suffer any OOMs. * The active mds restarted into octopus without incident. In summary it was a very smooth upgrade. After a week of observation we'll proceed with more production clusters. For our largest S3 cluster with slow hdds, we expect huge fsck transactions, so will wait for https://github.com/ceph/ceph/pull/42958 to be merged before upgrading. Best Regards, and thanks to all the devs for their work, Dan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Modify pgp number after pg_num increased
Hi, IIRC in a different thread you pasted your max-backfill config and it was the lowest possible value (1), right? That's why your backfill is slow. Zitat von "Szabo, Istvan (Agoda)" : Hi, By default in the newer versions of ceph when you increase the pg_num the cluster will start to increase the pgp_num slowly up to the value of the pg_num. I've increased the ec-code data pool from 32 to 128 but 1 node has been added to the cluster and it's very slow. pool 28 'hkg.rgw.buckets.data' erasure profile data-ec size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 55 pgp_num_target 128 autoscale_mode warn last_change 16443 lfor 0/0/14828 fl ags hashpspool stripe_width 16384 application rgw At the moment there has been done 55 out of the 128 pg. Is it safe to set the pgp_num at this stage to 64 and wait until the data will be rebalanced to the newly added node? Thank you ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io