[ceph-users] delete s3 bucket too slow?
Hi when deleting an S3 bucket, the operation took longer than the time-out for the dashboard, causing the delete to fail. Some of these buckets are really large (100+TB), also our cluster is very busy with replacing broken OSD disks, but how can such an operation (delete bucket) take too long for the dashboard to time out and not do it? What is the right way to do this? We are currently on Quincy (17.2.7) using packages for Ubuntu. Cheers /Simon -- I'm using my gmail.com address, because the gmail.com dmarc policy is "none", some mail servers will reject this (microsoft?) others will instead allow this when I send mail to a mailling list which has not yet been configured to send mail "on behalf of" the sender, but rather do a kind of "forward". The latter situation causes dkim/dmarc failures and the dmarc policy will be applied. see https://wiki.list.org/DEV/DMARC for more details ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Is there a way to find out which client uses which version of ceph?
Hi Wes, thanks the `ceph tell mon.* sessions` got me the answer very quickly :-) Cheers /Simon On Thu, 21 Dec 2023 at 18:27, Wesley Dillingham wrote: > You can ask the monitor to dump its sessions (which should expose the IPs > and the release / features) you can then track down by IP those with the > undesirable features/release > > ceph daemon mon.`hostname -s` sessions > > Assuming your mon is named after the short hostname, you may need to do > this for every mon. Alternatively using the `ceph tell mon.* sessions` to > hit every mon at once. > > Respectfully, > > *Wes Dillingham* > w...@wesdillingham.com > LinkedIn <http://www.linkedin.com/in/wesleydillingham> > > > On Thu, Dec 21, 2023 at 10:46 AM Anthony D'Atri > wrote: > >> [rook@rook-ceph-tools-5ff8d58445-gkl5w .aws]$ ceph features >> { >> "mon": [ >> { >> "features": "0x3f01cfbf7ffd", >> "release": "luminous", >> "num": 3 >> } >> ], >> "osd": [ >> { >> "features": "0x3f01cfbf7ffd", >> "release": "luminous", >> "num": 600 >> } >> ], >> "client": [ >> { >> "features": "0x2f018fb87aa4aafe", >> "release": "luminous", >> "num": 41 >> }, >> { >> "features": "0x3f01cfbf7ffd", >> "release": "luminous", >> "num": 147 >> } >> ], >> "mgr": [ >> { >> "features": "0x3f01cfbf7ffd", >> "release": "luminous", >> "num": 2 >> } >> ] >> } >> [rook@rook-ceph-tools-5ff8d58445-gkl5w .aws]$ >> >> IIRC there are nuances, there are case where a client can *look* like >> Jewel but actually be okay. >> >> >> > On Dec 21, 2023, at 10:41, Simon Oosthoek >> wrote: >> > >> > Hi, >> > >> > Our cluster is currently running quincy, and I want to set the minimal >> > client version to luminous, to enable upmap balancer, but when I tried >> to, >> > I got this: >> > >> > # ceph osd set-require-min-compat-client luminous Error EPERM: cannot >> set >> > require_min_compat_client to luminous: 2 connected client(s) look like >> > jewel (missing 0x800); add --yes-i-really-mean-it to do it >> > anyway >> > >> > I think I know the most likely candidate (and I've asked them), but is >> > there a way to find out, the way ceph seems to know? >> > >> > tnx >> > >> > /Simon >> > -- >> > I'm using my gmail.com address, because the gmail.com dmarc policy is >> > "none", some mail servers will reject this (microsoft?) others will >> instead >> > allow this when I send mail to a mailling list which has not yet been >> > configured to send mail "on behalf of" the sender, but rather do a kind >> of >> > "forward". The latter situation causes dkim/dmarc failures and the dmarc >> > policy will be applied. see https://wiki.list.org/DEV/DMARC for more >> details >> > ___ >> > ceph-users mailing list -- ceph-users@ceph.io >> > To unsubscribe send an email to ceph-users-le...@ceph.io >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> > -- I'm using my gmail.com address, because the gmail.com dmarc policy is "none", some mail servers will reject this (microsoft?) others will instead allow this when I send mail to a mailling list which has not yet been configured to send mail "on behalf of" the sender, but rather do a kind of "forward". The latter situation causes dkim/dmarc failures and the dmarc policy will be applied. see https://wiki.list.org/DEV/DMARC for more details ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Is there a way to find out which client uses which version of ceph?
Hi, Our cluster is currently running quincy, and I want to set the minimal client version to luminous, to enable upmap balancer, but when I tried to, I got this: # ceph osd set-require-min-compat-client luminous Error EPERM: cannot set require_min_compat_client to luminous: 2 connected client(s) look like jewel (missing 0x800); add --yes-i-really-mean-it to do it anyway I think I know the most likely candidate (and I've asked them), but is there a way to find out, the way ceph seems to know? tnx /Simon -- I'm using my gmail.com address, because the gmail.com dmarc policy is "none", some mail servers will reject this (microsoft?) others will instead allow this when I send mail to a mailling list which has not yet been configured to send mail "on behalf of" the sender, but rather do a kind of "forward". The latter situation causes dkim/dmarc failures and the dmarc policy will be applied. see https://wiki.list.org/DEV/DMARC for more details ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] planning upgrade from pacific to quincy
Hi All (apologies if you get again, I suspect mails from my @science.ru.nl account get dropped by most receiving mail servers, due to the strict DMARC policy (p=reject) in place) after a long while being in health_err state (due to an unfound object, which we eventually decided to "forget"), we are now planning to upgrade our cluster which is running Pacific (at least on the mons/mdss/osds, the gateways are by accident running quincy already). The installation is via packages from ceph.com, unless it's quincy from ubuntu. ceph versions: "mon": {"ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e) pacific (stable)": 3}, "mgr": {"ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e) pacific (stable)": 3}, "osd": {"ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e) pacific (stable)": 252, "ceph version 16.2.14 (238ba602515df21ea7ffc75c88db29f9e5ef12c9) pacific (stable)": 12 }, "mds": { "ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e) pacific (stable)": 2 }, "rgw": {"ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)": 8 }, "overall": {"ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e) pacific (stable)": 260, "ceph version 16.2.14 (238ba602515df21ea7ffc75c88db29f9e5ef12c9) pacific (stable)": 12, "ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)": 8 } The OS on the mons and mdss are still ubuntu 18.04, the osds are a mix of ubuntu 18 and ubuntu 20. The gateways are ubuntu 22.04, which is why these are already on quincy. The plan is to move to quincy and eventually cephadm/containered ceph, since that is apparently "the way to go", though I have my doubts. The steps we think are the right order are: - reinstall the mons with ubuntu 22.04 + quincy - reinstall the osds (same) - reinstall the mdss (same) Once this is up and running, we want to investigate and migrate to cephadm orchestration. Alternative appear to be: move to orchestration first and then upgrade ceph to quincy (possibly skipping the ubuntu upgrade?) Another alternative could be to upgrade to quincy on ubuntu 18.04 using packages, but I haven't investigated the availability of quincy packages for ubuntu 18.04 (which is out of free (LTS) support by canonical) Cheers /Simon -- I'm using my gmail.com address, because the gmail.com dmarc policy is "none", some mail servers will reject this (microsoft?) others will instead allow this when I send mail to a mailling list which has not yet been configured to send mail "on behalf of" the sender, but rather do a kind of "forward". The latter situation causes dkim/dmarc failures and the dmarc policy will be applied. see https://wiki.list.org/DEV/DMARC for more details ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] planning upgrade from pacific to quincy
Hi All (apologies if you get this twice, I suspect mails from my @science.ru.nl account get dropped by most receiving mail servers, due to the strict DMARC policies in place) after a long while being in health_err state (due to an unfound object, which we eventually decided to "forget"), we are now planning to upgrade our cluster which is running Pacific (at least on the mons/mdss/osds, the gateways are by accident running quincy already). The installation is via packages from ceph.com, unless it's quincy from ubuntu. ceph versions: "mon": {"ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e) pacific (stable)": 3}, "mgr": {"ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e) pacific (stable)": 3}, "osd": {"ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e) pacific (stable)": 252, "ceph version 16.2.14 (238ba602515df21ea7ffc75c88db29f9e5ef12c9) pacific (stable)": 12 }, "mds": { "ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e) pacific (stable)": 2 }, "rgw": {"ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)": 8 }, "overall": {"ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e) pacific (stable)": 260, "ceph version 16.2.14 (238ba602515df21ea7ffc75c88db29f9e5ef12c9) pacific (stable)": 12, "ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)": 8 } The OS on the mons and mdss are still ubuntu 18.04, the osds are a mix of ubuntu 18 and ubuntu 20. The gateways are ubuntu 22.04, which is why these are already on quincy. The plan is to move to quincy and eventually cephadm/containered ceph, since that is apparently "the way to go", though I have my doubts. The steps we think are the right order are: - reinstall the mons with ubuntu 22.04 + quincy - reinstall the osds (same) - reinstall the mdss (same) Once this is up and running, we want to investigate and migrate to cephadm orchestration. Alternative appear to be: move to orchestration first and then upgrade ceph to quincy (possibly skipping the ubuntu upgrade?) Another alternative could be to upgrade to quincy on ubuntu 18.04 using packages, but I haven't investigated the availability of quincy packages for ubuntu 18.04 (which is out of free (LTS) support by canonical) Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] planning upgrade from pacific to quincy
Hi All after a long while being in health_err state (due to an unfound object, which we eventually decided to "forget"), we are now planning to upgrade our cluster which is running Pacific (at least on the mons/mdss/osds, the gateways are by accident running quincy already). The installation is via packages from ceph.com, unless it's quincy from ubuntu. ceph versions: "mon": {"ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e) pacific (stable)": 3}, "mgr": {"ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e) pacific (stable)": 3}, "osd": {"ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e) pacific (stable)": 252, "ceph version 16.2.14 (238ba602515df21ea7ffc75c88db29f9e5ef12c9) pacific (stable)": 12 }, "mds": { "ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e) pacific (stable)": 2 }, "rgw": {"ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)": 8 }, "overall": {"ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e) pacific (stable)": 260, "ceph version 16.2.14 (238ba602515df21ea7ffc75c88db29f9e5ef12c9) pacific (stable)": 12, "ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)": 8 } The OS on the mons and mdss are still ubuntu 18.04, the osds are a mix of ubuntu 18 and ubuntu 20. The gateways are ubuntu 22.04, which is why these are already on quincy. The plan is to move to quincy and eventually cephadm/containered ceph, since that is apparently "the way to go", though I have my doubts. The steps we think are the right order are: - reinstall the mons with ubuntu 22.04 + quincy - reinstall the osds (same) - reinstall the mdss (same) Once this is up and running, we want to investigate and migrate to cephadm orchestration. Alternative appear to be: move to orchestration first and then upgrade ceph to quincy (possibly skipping the ubuntu upgrade?) Another alternative could be to upgrade to quincy on ubuntu 18.04 using packages, but I haven't investigated the availability of quincy packages for ubuntu 18.04 (which is out of free (LTS) support by canonical) Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] compounded problems interfering with recovery
Hi we're still struggling with our getting our ceph to health_ok. We're having compounded issues interfering with recovery, as I understand it. To summarize, we have a cluster of 22 osd nodes running ceph 16.2.x. About a month back we had one of the OSDs break down (just the OS disk, but we didn't have a cold spare available, it took a week to get it fixed). Since the failure of the node, ceph has been repairing the situation of course, but then it became a problem that our OSDs are really unevenly balanced (lowest below 50%, highest around 85%). So whenever a disk fails (and there were 2 since then), the load spreads over the other OSDs and our fullest OSDs go over the 85% threshold, slowing down recovery, normal use and rebalancing. We had issues with degraded PGs, but they weren't being repaired (because we had turned on the scrubbing during recovery, since we got messages that lots of PGs weren't being scrubbed in time. Now there's still one remaining PG degraded because one object is unfound. The whole error state is taking far too long now and as this is going on, I was wondering how the balancer wasn't doing its job. Turns out this is dependent on the cluster being OK or at least not having any degraded things in it. The balancer hasn't done it's job even though our cluster was OK for a long time before; we added some 8 nodes a few years ago and still the newer nodes are having the lowest used OSDs. Our cluster has about 70-71% usage overall, but with the unbalanced situation we cannot grow any more. The single node issue (though now resolved) and ongoing disk failures (we are seeing a handful of OSDs with read-repaired messages), it looks like we can't get back to health for a while. I'm trying to mitigate this by reweighting the fullest OSDs, but the fuller OSDs keep going over the threshold, while the emptiest OSDs have plenty of space (just 55% full now). If you read this far ;-) I'm wondering, can I force repair a PG around all the restrictions so it doesn't block auto rebalancing? It seems to me, like that would help, but perhaps there are other things I can do as well? (Budget wise, adding more OSD nodes is a bit difficult at the moment...) Thanks for reading! Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cannot repair a handful of damaged pg's
Hi Wesley, On 06/10/2023 17:48, Wesley Dillingham wrote: A repair is just a type of scrub and it is also limited by osd_max_scrubs which in pacific is 1. We've increased that to 4 (and temporarily to 8) since we have so many OSDs and are running behind on scrubbing. If another scrub is occurring on any OSD in the PG it wont start. that explains a lot. do "ceph osd set noscrub" and "ceph osd set nodeep-scrub" wait for all scrubs to stop (a few seconds probably) Then issue the pg repair command again. It may start. The script Kai linked seems like a good idea to fix this when needed. You also have pgs in backfilling state. Note that by default OSDs in backfill or backfill_wait also wont perform scrubs. You can modify this behavior with `ceph config set osd osd_scrub_during_recovery true` We've set this already I would suggest only setting that after the noscub flags are set and the only scrub you want to get processed is your manual repair. Then rm the scrub_during_recovery config item before unsetting the noscrub flags. Thanks for the suggestion! Cheers /Simon Respectfully, *Wes Dillingham* w...@wesdillingham.com <mailto:w...@wesdillingham.com> LinkedIn <http://www.linkedin.com/in/wesleydillingham> On Fri, Oct 6, 2023 at 11:02 AM Simon Oosthoek <mailto:s.oosth...@science.ru.nl>> wrote: On 06/10/2023 16:09, Simon Oosthoek wrote: > Hi > > we're still in HEALTH_ERR state with our cluster, this is the top of the > output of `ceph health detail` > > HEALTH_ERR 1/846829349 objects unfound (0.000%); 248 scrub errors; > Possible data damage: 1 pg recovery_unfound, 2 pgs inconsistent; > Degraded data redundancy: 6/7118781559 objects degraded (0.000%), 1 pg > degraded, 1 pg undersized; 63 pgs not deep-scrubbed in time; 657 pgs not > scrubbed in time > [WRN] OBJECT_UNFOUND: 1/846829349 objects unfound (0.000%) > pg 26.323 has 1 unfound objects > [ERR] OSD_SCRUB_ERRORS: 248 scrub errors > [ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound, 2 pgs > inconsistent > pg 26.323 is active+recovery_unfound+degraded+remapped, acting > [92,109,116,70,158,128,243,189,256], 1 unfound > pg 26.337 is active+clean+inconsistent, acting > [139,137,48,126,165,89,237,199,189] > pg 26.3e2 is active+clean+inconsistent, acting > [12,27,24,234,195,173,98,32,35] > [WRN] PG_DEGRADED: Degraded data redundancy: 6/7118781559 objects > degraded (0.000%), 1 pg degraded, 1 pg undersized > pg 13.3a5 is stuck undersized for 4m, current state > active+undersized+remapped+backfilling, last acting > [2,45,32,62,2147483647,55,116,25,225,202,240] > pg 26.323 is active+recovery_unfound+degraded+remapped, acting > [92,109,116,70,158,128,243,189,256], 1 unfound > > > For the PG_DAMAGED pgs I try the usual `ceph pg repair 26.323` etc., > however it fails to get resolved. > > The osd.116 is already marked out and is beginning to get empty. I've > tried restarting the osd processes of the first osd listed for each PG, > but that doesn't get it resolved either. > > I guess we should have enough redundancy to get the correct data back, > but how can I tell ceph to fix it in order to get back to a healthy state? I guess this could be related to the number of scrubs going on, I read somewhere that this may interfere with the repair request. I would expect the repair would have priority over scrubs... BTW, we're running pacific for now, we want to update when the cluster is healthy again. Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-le...@ceph.io <mailto:ceph-users-le...@ceph.io> ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cannot repair a handful of damaged pg's
On 06/10/2023 16:09, Simon Oosthoek wrote: Hi we're still in HEALTH_ERR state with our cluster, this is the top of the output of `ceph health detail` HEALTH_ERR 1/846829349 objects unfound (0.000%); 248 scrub errors; Possible data damage: 1 pg recovery_unfound, 2 pgs inconsistent; Degraded data redundancy: 6/7118781559 objects degraded (0.000%), 1 pg degraded, 1 pg undersized; 63 pgs not deep-scrubbed in time; 657 pgs not scrubbed in time [WRN] OBJECT_UNFOUND: 1/846829349 objects unfound (0.000%) pg 26.323 has 1 unfound objects [ERR] OSD_SCRUB_ERRORS: 248 scrub errors [ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound, 2 pgs inconsistent pg 26.323 is active+recovery_unfound+degraded+remapped, acting [92,109,116,70,158,128,243,189,256], 1 unfound pg 26.337 is active+clean+inconsistent, acting [139,137,48,126,165,89,237,199,189] pg 26.3e2 is active+clean+inconsistent, acting [12,27,24,234,195,173,98,32,35] [WRN] PG_DEGRADED: Degraded data redundancy: 6/7118781559 objects degraded (0.000%), 1 pg degraded, 1 pg undersized pg 13.3a5 is stuck undersized for 4m, current state active+undersized+remapped+backfilling, last acting [2,45,32,62,2147483647,55,116,25,225,202,240] pg 26.323 is active+recovery_unfound+degraded+remapped, acting [92,109,116,70,158,128,243,189,256], 1 unfound For the PG_DAMAGED pgs I try the usual `ceph pg repair 26.323` etc., however it fails to get resolved. The osd.116 is already marked out and is beginning to get empty. I've tried restarting the osd processes of the first osd listed for each PG, but that doesn't get it resolved either. I guess we should have enough redundancy to get the correct data back, but how can I tell ceph to fix it in order to get back to a healthy state? I guess this could be related to the number of scrubs going on, I read somewhere that this may interfere with the repair request. I would expect the repair would have priority over scrubs... BTW, we're running pacific for now, we want to update when the cluster is healthy again. Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] cannot repair a handful of damaged pg's
Hi we're still in HEALTH_ERR state with our cluster, this is the top of the output of `ceph health detail` HEALTH_ERR 1/846829349 objects unfound (0.000%); 248 scrub errors; Possible data damage: 1 pg recovery_unfound, 2 pgs inconsistent; Degraded data redundancy: 6/7118781559 objects degraded (0.000%), 1 pg degraded, 1 pg undersized; 63 pgs not deep-scrubbed in time; 657 pgs not scrubbed in time [WRN] OBJECT_UNFOUND: 1/846829349 objects unfound (0.000%) pg 26.323 has 1 unfound objects [ERR] OSD_SCRUB_ERRORS: 248 scrub errors [ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound, 2 pgs inconsistent pg 26.323 is active+recovery_unfound+degraded+remapped, acting [92,109,116,70,158,128,243,189,256], 1 unfound pg 26.337 is active+clean+inconsistent, acting [139,137,48,126,165,89,237,199,189] pg 26.3e2 is active+clean+inconsistent, acting [12,27,24,234,195,173,98,32,35] [WRN] PG_DEGRADED: Degraded data redundancy: 6/7118781559 objects degraded (0.000%), 1 pg degraded, 1 pg undersized pg 13.3a5 is stuck undersized for 4m, current state active+undersized+remapped+backfilling, last acting [2,45,32,62,2147483647,55,116,25,225,202,240] pg 26.323 is active+recovery_unfound+degraded+remapped, acting [92,109,116,70,158,128,243,189,256], 1 unfound For the PG_DAMAGED pgs I try the usual `ceph pg repair 26.323` etc., however it fails to get resolved. The osd.116 is already marked out and is beginning to get empty. I've tried restarting the osd processes of the first osd listed for each PG, but that doesn't get it resolved either. I guess we should have enough redundancy to get the correct data back, but how can I tell ceph to fix it in order to get back to a healthy state? Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph osd down doesn't seem to work
Hoi Josh, thanks for the explanation, I want to mark it out, not down :-) Most use of our cluster is in EC 8+3 or 5+4 pools, so one missing osd isn't bad, but if some of the blocks can still be read it may help to move them to safety. (This is how I imagine things anyway ;-) I'll have to look into the manually correcting of those inconsistent PGs if they don't recover by ceph-magic alone... Cheers /Simon On 03/10/2023 18:21, Josh Baergen wrote: Hi Simon, If the OSD is actually up, using 'ceph osd down` will cause it to flap but come back immediately. To prevent this, you would want to 'ceph osd set noup'. However, I don't think this is what you actually want: I'm thinking (but perhaps incorrectly?) that it would be good to keep the OSD down+in, to try to read from it as long as possible In this case, you actually want it up+out ('ceph osd out XXX'), though if it's replicated then marking it out will switch primaries around so that it's not actually read from anymore. It doesn't look like you have that much recovery backfill left, so hopefully you'll be in a clean state soon, though you'll have to deal with those 'inconsistent' and 'recovery_unfound' PGs. Josh On Tue, Oct 3, 2023 at 10:14 AM Simon Oosthoek wrote: Hi I'm trying to mark one OSD as down, so we can clean it out and replace it. It keeps getting medium read errors, so it's bound to fail sooner rather than later. When I command ceph from the mon to mark the osd down, it doesn't actually do it. When the service on the osd stops, it is also marked out and I'm thinking (but perhaps incorrectly?) that it would be good to keep the OSD down+in, to try to read from it as long as possible. Why doesn't it get marked down and stay that way when I command it? Context: Our cluster is in a bit of a less optimal state (see below), this is after one of OSD nodes had failed and took a week to get back up (long story). Due to a seriously unbalanced filling of our OSDs we kept having to reweight OSDs to keep below the 85% threshold. Several disks are starting to fail now (they're 4+ years old and failures are expected to occur more frequently). I'm open to suggestions to help get us back to health_ok more quickly, but I think we'll get there eventually anyway... Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] ceph osd down doesn't seem to work
Hi I'm trying to mark one OSD as down, so we can clean it out and replace it. It keeps getting medium read errors, so it's bound to fail sooner rather than later. When I command ceph from the mon to mark the osd down, it doesn't actually do it. When the service on the osd stops, it is also marked out and I'm thinking (but perhaps incorrectly?) that it would be good to keep the OSD down+in, to try to read from it as long as possible. Why doesn't it get marked down and stay that way when I command it? Context: Our cluster is in a bit of a less optimal state (see below), this is after one of OSD nodes had failed and took a week to get back up (long story). Due to a seriously unbalanced filling of our OSDs we kept having to reweight OSDs to keep below the 85% threshold. Several disks are starting to fail now (they're 4+ years old and failures are expected to occur more frequently). I'm open to suggestions to help get us back to health_ok more quickly, but I think we'll get there eventually anyway... Cheers /Simon # ceph -s cluster: health: HEALTH_ERR 1 clients failing to respond to cache pressure 1/843763422 objects unfound (0.000%) noout flag(s) set 14 scrub errors Possible data damage: 1 pg recovery_unfound, 1 pg inconsistent Degraded data redundancy: 13795525/7095598195 objects degraded (0.194%), 13 pgs degraded, 12 pgs undersized 70 pgs not deep-scrubbed in time 65 pgs not scrubbed in time services: mon: 3 daemons, quorum cephmon3,cephmon1,cephmon2 (age 11h) mgr: cephmon3(active, since 35h), standbys: cephmon1 mds: 1/1 daemons up, 1 standby osd: 264 osds: 264 up (since 2m), 264 in (since 75m); 227 remapped pgs flags noout rgw: 8 daemons active (4 hosts, 1 zones) data: volumes: 1/1 healthy pools: 15 pools, 3681 pgs objects: 843.76M objects, 1.2 PiB usage: 2.0 PiB used, 847 TiB / 2.8 PiB avail pgs: 13795525/7095598195 objects degraded (0.194%) 54839263/7095598195 objects misplaced (0.773%) 1/843763422 objects unfound (0.000%) 3374 active+clean 195 active+remapped+backfill_wait 65 active+clean+scrubbing+deep 20 active+remapped+backfilling 11 active+clean+snaptrim 10 active+undersized+degraded+remapped+backfill_wait 2active+undersized+degraded+remapped+backfilling 2active+clean+scrubbing 1active+recovery_unfound+degraded 1active+clean+inconsistent progress: Global Recovery Event (8h) [==..] (remaining: 2h) ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: v16.2.12 Pacific (hot-fix) released
Dear List we upgraded to 16.2.12 on April 17th, since then we've seen some unexplained downed osd services in our cluster (264 osds), is there any risk of data loss, if so, would it be possible to downgrade or is a fix expected soon? if so, when? ;-) FYI, we are running a cluster without cephadm, installed from packages. Cheers /Simon On 23/04/2023 03:03, Yuri Weinstein wrote: We are writing to inform you that Pacific v16.2.12, released on April 14th, has many unintended commits in the changelog than listed in the release notes [1]. As these extra commits are not fully tested, we request that all users please refrain from upgrading to v16.2.12 at this time. The current v16.2.12 will be QE validated and released as soon as possible. v16.2.12 was a hotfix release meant to resolve several performance flaws in ceph-volume, particularly during osd activation. The extra commits target v16.2.13. We apologize for the inconvenience. Please reach out to the mailing list with any questions. [1] https://urldefense.com/v3/__https://ceph.io/en/news/blog/2023/v16-2-12-pacific-released/__;!!HJOPV4FYYWzcc1jazlU!-OuIFoOFfOQDsz4abuBV7neIEO7j0XkOM1YBEIhz_IYTdUAIMuO9upMHj_R8bAFFrWQ8OBHwS6x4I5-fNaPJ0M8$ On Fri, Apr 14, 2023 at 9:42 AM Yuri Weinstein wrote: We're happy to announce the 12th hot-fix release in the Pacific series. https://urldefense.com/v3/__https://ceph.io/en/news/blog/2023/v16-2-12-pacific-released/__;!!HJOPV4FYYWzcc1jazlU!-OuIFoOFfOQDsz4abuBV7neIEO7j0XkOM1YBEIhz_IYTdUAIMuO9upMHj_R8bAFFrWQ8OBHwS6x4I5-fNaPJ0M8$ Notable Changes --- This is a hotfix release that resolves several performance flaws in ceph-volume, particularly during osd activation (https://urldefense.com/v3/__https://tracker.ceph.com/issues/57627__;!!HJOPV4FYYWzcc1jazlU!-OuIFoOFfOQDsz4abuBV7neIEO7j0XkOM1YBEIhz_IYTdUAIMuO9upMHj_R8bAFFrWQ8OBHwS6x4I5-fg0yeu7U$ ) Getting Ceph * Git at git://github.com/ceph/ceph.git * Tarball at https://urldefense.com/v3/__https://download.ceph.com/tarballs/ceph-16.2.12.tar.gz__;!!HJOPV4FYYWzcc1jazlU!-OuIFoOFfOQDsz4abuBV7neIEO7j0XkOM1YBEIhz_IYTdUAIMuO9upMHj_R8bAFFrWQ8OBHwS6x4I5-fBEJl5p4$ * Containers at https://urldefense.com/v3/__https://quay.io/repository/ceph/ceph__;!!HJOPV4FYYWzcc1jazlU!-OuIFoOFfOQDsz4abuBV7neIEO7j0XkOM1YBEIhz_IYTdUAIMuO9upMHj_R8bAFFrWQ8OBHwS6x4I5-fc7HeSms$ * For packages, see https://urldefense.com/v3/__https://docs.ceph.com/en/latest/install/get-packages/__;!!HJOPV4FYYWzcc1jazlU!-OuIFoOFfOQDsz4abuBV7neIEO7j0XkOM1YBEIhz_IYTdUAIMuO9upMHj_R8bAFFrWQ8OBHwS6x4I5-fAKdWZK4$ * Release git sha1: 5a2d516ce4b134bfafc80c4274532ac0d56fc1e2 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] dashboard version of ceph versions shows N/A
Dear list Yesterday we updated our ceph cluster from 15.2.17 to 16.2.10 using packages. Our cluster is a mix of ubuntu 18 and ubuntu 20 with ceph coming from packages in the ceph.com repo. All went well and we now have all nodes running Pacific. However, there's something odd in the dashboard, because when I look at '/#/hosts', the dashboard shows 'N/A' for every column starting from "Model", the columns "Labels" and "Status" are empty. We didn't change anything in the prometheus/grafana node, so I think this could be an issue, but I don't know if it's causing this particular problem. NB, Ceph seems happy enough, it's just the dashboard not showing the versions anymore. Does this ring a bell for anyone? Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: how to upgrade host os under ceph
Hi Anthony On 27/10/2022 21:44, Anthony D'Atri wrote: Another factor is “Do I *really* need to upgrade the OS?” that's a good question, opinions vary on this I've noticed ;-) If you have org-wide management/configuration that requires you to upgrade, that’s one thing, but presumably your hosts are not accessible from the outside, so do you have a compelling reason? The “immutable infrastructure” folks may be on to something. Upgrades always take a lot of engineer time and are when things tend to go wrong. Obviously the ceph nodes are not publicly accessible, but we do like to keep the cluster as maintainable as possible by keeping things simple. Having an older, unsupported ubuntu version around is kind of a red flag, even though it could be fine to remain so. And of course there's the problem that we want to keep ceph not too far behind supported releases, and at some point (before the hardware is expiring) no new versions of ceph is available for the older unsupported ubuntu. Furthermore, waiting until this happens is a recipe for having to re-invent the wheel, I believe we should get practice and comfortable doing this, so it's not such a looming big issue. Also useful to have in our fingers when e.g. the OS disk fails for some reason. So that would be my reason to still want to upgrade, even though there may not be an urgent reason... Cheers /Simon On Oct 27, 2022, at 03:16, Simon Oosthoek wrote: Dear list thanks for the answers, it looks like we have worried about this far too much ;-) Cheers /Simon On 26/10/2022 22:21, shubjero wrote: We've done 14.04 -> 16.04 -> 18.04 -> 20.04 all at various stages of our ceph cluster life. The latest 18.04 to 20.04 was painless and we ran: |apt update && apt dist-upgrade -y -o Dpkg::Options::=\"--force-confdef\" -o Dpkg::Options::=\"--force-confold\"| |do||-release-upgrade --allow-third-party -f DistUpgradeViewNonInteractive| | | On Wed, Oct 26, 2022 at 11:17 AM Reed Dier mailto:reed.d...@focusvq.com>> wrote: You should be able to `do-release-upgrade` from bionic/18 to focal/20. Octopus/15 is shipped for both dists from ceph. Its been a while since I did this, the release upgrader might disable the ceph repo, and uninstall the ceph* packages. However, the OSDs should still be there, re-enable the ceph repo, install ceph-osd, and then `ceph-volume lvm activate —all` should find and start all of the OSDs. Caveat, if you’re using cephadm, I’m sure the process is different. And also, if you’re trying to go to jammy/22, thats a different story, because ceph isn’t shipping packages for jammy yet for any version of ceph. I assume that they are going to ship quincy for jammy at some point, which will give a stepping stone from focal to jammy with the quincy release, because I don’t imagine that there will be a reef release for focal. Reed > On Oct 26, 2022, at 9:14 AM, Simon Oosthoek mailto:s.oosth...@science.ru.nl>> wrote: > > Dear list, > > I'm looking for some guide or pointers to how people upgrade the underlying host OS in a ceph cluster (if this is the right way to proceed, I don't even know...) > > Our cluster is nearing the 4.5 years of age and now our ubuntu 18.04 is nearing the end of support date. We have a mixed cluster of u18 and u20 nodes, all running octopus at the moment. > > We would like to upgrade the OS on the nodes, without changing the ceph version for now (or per se). > > Is it as easy as installing a new OS version, installing the ceph-osd package and a correct ceph.conf file and restoring the host key? > > Or is more needed regarding the specifics of the OSD disks/WAL/journal? > > Or is it necessary to drain a node of all data and re-add the OSDs as new units? (This would be too much work, so I doubt it ;-) > > The problem with searching for information about this, is that it seems undocumented in the ceph documentation, and search results are flooded with ceph version upgrades. > > Cheers > > /Simon > ___ > ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io> > To unsubscribe send an email to ceph-users-le...@ceph.io <mailto:ceph-users-le...@ceph.io> ___ ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-le...@ceph.io <mailto:ceph-users-le...@ceph.io> ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: how to upgrade host os under ceph
Dear list thanks for the answers, it looks like we have worried about this far too much ;-) Cheers /Simon On 26/10/2022 22:21, shubjero wrote: We've done 14.04 -> 16.04 -> 18.04 -> 20.04 all at various stages of our ceph cluster life. The latest 18.04 to 20.04 was painless and we ran: |apt update && apt dist-upgrade -y -o Dpkg::Options::=\"--force-confdef\" -o Dpkg::Options::=\"--force-confold\"| |do||-release-upgrade --allow-third-party -f DistUpgradeViewNonInteractive| | | On Wed, Oct 26, 2022 at 11:17 AM Reed Dier <mailto:reed.d...@focusvq.com>> wrote: You should be able to `do-release-upgrade` from bionic/18 to focal/20. Octopus/15 is shipped for both dists from ceph. Its been a while since I did this, the release upgrader might disable the ceph repo, and uninstall the ceph* packages. However, the OSDs should still be there, re-enable the ceph repo, install ceph-osd, and then `ceph-volume lvm activate —all` should find and start all of the OSDs. Caveat, if you’re using cephadm, I’m sure the process is different. And also, if you’re trying to go to jammy/22, thats a different story, because ceph isn’t shipping packages for jammy yet for any version of ceph. I assume that they are going to ship quincy for jammy at some point, which will give a stepping stone from focal to jammy with the quincy release, because I don’t imagine that there will be a reef release for focal. Reed > On Oct 26, 2022, at 9:14 AM, Simon Oosthoek mailto:s.oosth...@science.ru.nl>> wrote: > > Dear list, > > I'm looking for some guide or pointers to how people upgrade the underlying host OS in a ceph cluster (if this is the right way to proceed, I don't even know...) > > Our cluster is nearing the 4.5 years of age and now our ubuntu 18.04 is nearing the end of support date. We have a mixed cluster of u18 and u20 nodes, all running octopus at the moment. > > We would like to upgrade the OS on the nodes, without changing the ceph version for now (or per se). > > Is it as easy as installing a new OS version, installing the ceph-osd package and a correct ceph.conf file and restoring the host key? > > Or is more needed regarding the specifics of the OSD disks/WAL/journal? > > Or is it necessary to drain a node of all data and re-add the OSDs as new units? (This would be too much work, so I doubt it ;-) > > The problem with searching for information about this, is that it seems undocumented in the ceph documentation, and search results are flooded with ceph version upgrades. > > Cheers > > /Simon > ___ > ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io> > To unsubscribe send an email to ceph-users-le...@ceph.io <mailto:ceph-users-le...@ceph.io> ___ ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-le...@ceph.io <mailto:ceph-users-le...@ceph.io> ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] how to upgrade host os under ceph
Dear list, I'm looking for some guide or pointers to how people upgrade the underlying host OS in a ceph cluster (if this is the right way to proceed, I don't even know...) Our cluster is nearing the 4.5 years of age and now our ubuntu 18.04 is nearing the end of support date. We have a mixed cluster of u18 and u20 nodes, all running octopus at the moment. We would like to upgrade the OS on the nodes, without changing the ceph version for now (or per se). Is it as easy as installing a new OS version, installing the ceph-osd package and a correct ceph.conf file and restoring the host key? Or is more needed regarding the specifics of the OSD disks/WAL/journal? Or is it necessary to drain a node of all data and re-add the OSDs as new units? (This would be too much work, so I doubt it ;-) The problem with searching for information about this, is that it seems undocumented in the ceph documentation, and search results are flooded with ceph version upgrades. Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: post-mortem of a ceph disruption
On 26/10/2022 10:57, Stefan Kooman wrote: On 10/25/22 17:08, Simon Oosthoek wrote: At this point, one of noticed that a strange ip adress was mentioned; 169.254.0.2, it turns out that a recently added package (openmanage) and some configuration had added this interface and address to hardware nodes from Dell. For us, our single interface assumption is now out the window and 0.0.0.0/0 is a bad idea in /etc/ceph/ceph.conf for public and cluster network (though it's the same network for us). Our 3 datacenters are on three different subnets so it becomes a bit difficult to make it more specific. The nodes are all under the same /16, so we can choose that, but it is starting to look like a weird network setup. I've always thought that this configuration was kind of non-intuitive and I still do. And now it has bitten us :-( Thanks for reading and if you have any suggestions on how to fix/prevent this kind of error, we'll be glad to hear it! We don't have the public_network specified in our cluster(s). AFAIK It's not needed (anymore). There is no default network address range configured. So I would just get rid of it. Same for cluster_network if you have that configured. There I fixed it! ;-). Hi Stefan thanks for the suggestions! I've removed the cluster_network definition, but retained the public_network definition in a more specific way (list of the subnets that we are using for ceph nodes). In the code it isn't entirely clear to us what happens when public_network is undefined... If you don't use IPv6, I would explicitly turn it off: ms_bind_ipv6 = false I just added this, it seems like a no brainer. Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] post-mortem of a ceph disruption
Dear list, recently we experienced a short outage of our ceph storage, it was a surprising cause, and probably indicates a subtle misconfiguration on our part, I'm hoping for a useful suggestion ;-) We are running a 3PB cluster with 21 osd nodes (spread across 3 datacenters), 3 mon/mgrs and 2mds nodes. Currently we are on octopus 15.2.16 (will upgrade to .17 soon). The cluster has a single network interface (most are a bond) with 25Gbit/s. The physical nodes are all Dell AMD EPYC hardware. The "cluster network" and "public network" configurations in /etc/ceph/ceph.conf were all set to 0.0.0.0/0 since we only have a single interface for all Ceph nodes (or so we thought...) Our nodes are managed using cfengine3 (community), though we avoid package upgrades during normal operation. New packages are installed though, if commanded by cfengine. Last Sunday at around 23:05 (local time) we experienced a short network glitch (an MLAG link lost one sublink for 4 seconds)), our logs show that it should have been relatively painless, since the peer-link took over and after 4s the MLAG went back to FULL mode. However, it seems a lot of ceph-osd services restarted or re-connected to the network and failed to find the other ceph osd's. They consequently shut themselves down. Shortly after this happened, the ceph services became unavailable due to not enough osd nodes, so services of ours depending on ceph became unavailable as well. At this point I was able to start trying to fix it, I tried rebooting a ceph osd machine and also tried restarting just the osd services on the nodes. Both seemed to work and I could soon turn in when all was well again. When trying to understand what had happened, we obviously suspected all kinds of unrelated things (the ceph logs are way too noisy to quickly get to the point), but one thing "osd.54 662927 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory" turned out to be more important than we first thought after some googling. (https://forum.proxmox.com/threads/ceph-set_numa_affinity-unable-to-identify-public-interface.58239/) We couldn't understand why the network glitch could cause such a massive die-off of ceph-osd services. In the assumption that sooner or later we were going to need some help with this, it seemed a good idea to first try to get busy updating the nodes to latest and then supported releases of ceph, so we started the upgrade to 15.2.17 today. The upgrade of the 2 virtual and 1 physical mon went OK, also the first osd node was fine. But on the second osd node, the osd services would not keep running after the upgrade+reboot. Again we noticed this numa message, but now 6 times in a row and then the nice: "_committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last 600.00 seconds, shutting down" and "received signal: Interrupt from Kernel" At this point, one of noticed that a strange ip adress was mentioned; 169.254.0.2, it turns out that a recently added package (openmanage) and some configuration had added this interface and address to hardware nodes from Dell. For us, our single interface assumption is now out the window and 0.0.0.0/0 is a bad idea in /etc/ceph/ceph.conf for public and cluster network (though it's the same network for us). Our 3 datacenters are on three different subnets so it becomes a bit difficult to make it more specific. The nodes are all under the same /16, so we can choose that, but it is starting to look like a weird network setup. I've always thought that this configuration was kind of non-intuitive and I still do. And now it has bitten us :-( Thanks for reading and if you have any suggestions on how to fix/prevent this kind of error, we'll be glad to hear it! Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] crush rule for 4 copy over 3 failure domains?
Dear ceph users, Since recently we have 3 locations with ceph osd nodes, for 3 copy pools, it is trivial to create a crush rule that uses all 3 datacenters for each block, but 4 copy is harder. Our current "replicated" rule is this: rule replicated_rule { id 0 type replicated min_size 2 max_size 4 step take default step choose firstn 2 type datacenter step chooseleaf firstn 2 type host step emit } For 3 copy, the rule would be rule replicated_rule_3copy { id 5 type replicated min_size 2 max_size 3 step take default step choose firstn 3 type datacenter step chooseleaf firstn 1 type host step emit } But 4 copy requires an additional osd, so how to tell the crush algorithm to first take one from each datacenter and then take one more from any datacenter? I'd be interested to know if this is possible and if so, how... Having said that, I don't think there's much additional value for a 4 copy pool, compared to a 3copy pool with 3 separate locations. Or is there (apart from the 1 more copy thing in general)? Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: crushtool -i; more info from output?
In case anyone is interested; I hacked up some more perl code to parse the tree output of crushtool to use the actual info from the new crushmap, instead of the production info from ceph itself. See: https://gist.github.com/pooh22/53960df4744efd9d7e0261ff92e7e8f4 Cheers /Simon On 02/12/2021 13:23, Simon Oosthoek wrote: On 02/12/2021 10:20, Simon Oosthoek wrote: Dear ceph-users, We want to optimise our crush rules further and to test adjustments without impact to the cluster, we use crushtool to show the mappings. eg: crushtool -i crushmap.16 --test --num-rep 4 --show-mappings --rule 0|tail -n 10 CRUSH rule 0 x 1014 [121,125,195,197] CRUSH rule 0 x 1015 [20,1,40,151] CRUSH rule 0 x 1016 [194,244,158,3] CRUSH rule 0 x 1017 [39,113,242,179] CRUSH rule 0 x 1018 [131,113,199,179] CRUSH rule 0 x 1019 [64,63,221,181] CRUSH rule 0 x 1020 [26,111,188,179] CRUSH rule 0 x 1021 [125,78,247,214] CRUSH rule 0 x 1022 [48,125,246,258] CRUSH rule 0 x 1023 [0,88,237,211] The osd numbers in brackets are not the full story, of course... It would be nice to see more info about the location hierarchy that is in the crushmap, because we want to make sure the redundancy is spread optimally accross our datacenters and racks/hosts. In the current output, this requires lookups to find out the locations for the osds before we can be sure. Since the info is already known in the crushmap, I was wondering if someone has already hacked up some wrapper script that looks up the locations of the osds, or if work is ongoing to add an option to crushtool to output the locations with the osd numbers? If not, I might write a wrapper myself... Dear list, I created a very rudimentory parser; just pipe the output of the crushtool -i command to this script. In the script you can either uncomment the full location tree info, or just the top level location. The script is here: https://gist.github.com/pooh22/5065d7c8777e6f07b0801d0b30c027d2 Please use as you like, I welcome comments and improvements of course... /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: crushtool -i; more info from output?
On 02/12/2021 10:20, Simon Oosthoek wrote: Dear ceph-users, We want to optimise our crush rules further and to test adjustments without impact to the cluster, we use crushtool to show the mappings. eg: crushtool -i crushmap.16 --test --num-rep 4 --show-mappings --rule 0|tail -n 10 CRUSH rule 0 x 1014 [121,125,195,197] CRUSH rule 0 x 1015 [20,1,40,151] CRUSH rule 0 x 1016 [194,244,158,3] CRUSH rule 0 x 1017 [39,113,242,179] CRUSH rule 0 x 1018 [131,113,199,179] CRUSH rule 0 x 1019 [64,63,221,181] CRUSH rule 0 x 1020 [26,111,188,179] CRUSH rule 0 x 1021 [125,78,247,214] CRUSH rule 0 x 1022 [48,125,246,258] CRUSH rule 0 x 1023 [0,88,237,211] The osd numbers in brackets are not the full story, of course... It would be nice to see more info about the location hierarchy that is in the crushmap, because we want to make sure the redundancy is spread optimally accross our datacenters and racks/hosts. In the current output, this requires lookups to find out the locations for the osds before we can be sure. Since the info is already known in the crushmap, I was wondering if someone has already hacked up some wrapper script that looks up the locations of the osds, or if work is ongoing to add an option to crushtool to output the locations with the osd numbers? If not, I might write a wrapper myself... Dear list, I created a very rudimentory parser; just pipe the output of the crushtool -i command to this script. In the script you can either uncomment the full location tree info, or just the top level location. The script is here: https://gist.github.com/pooh22/5065d7c8777e6f07b0801d0b30c027d2 Please use as you like, I welcome comments and improvements of course... /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] crushtool -i; more info from output?
Dear ceph-users, We want to optimise our crush rules further and to test adjustments without impact to the cluster, we use crushtool to show the mappings. eg: crushtool -i crushmap.16 --test --num-rep 4 --show-mappings --rule 0|tail -n 10 CRUSH rule 0 x 1014 [121,125,195,197] CRUSH rule 0 x 1015 [20,1,40,151] CRUSH rule 0 x 1016 [194,244,158,3] CRUSH rule 0 x 1017 [39,113,242,179] CRUSH rule 0 x 1018 [131,113,199,179] CRUSH rule 0 x 1019 [64,63,221,181] CRUSH rule 0 x 1020 [26,111,188,179] CRUSH rule 0 x 1021 [125,78,247,214] CRUSH rule 0 x 1022 [48,125,246,258] CRUSH rule 0 x 1023 [0,88,237,211] The osd numbers in brackets are not the full story, of course... It would be nice to see more info about the location hierarchy that is in the crushmap, because we want to make sure the redundancy is spread optimally accross our datacenters and racks/hosts. In the current output, this requires lookups to find out the locations for the osds before we can be sure. Since the info is already known in the crushmap, I was wondering if someone has already hacked up some wrapper script that looks up the locations of the osds, or if work is ongoing to add an option to crushtool to output the locations with the osd numbers? If not, I might write a wrapper myself... Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph-ansible and crush location
On 03/11/2021 16:03, Simon Oosthoek wrote: > On 03/11/2021 15:48, Stefan Kooman wrote: >> On 11/3/21 15:35, Simon Oosthoek wrote: >>> Dear list, >>> >>> I've recently found it is possible to supply ceph-ansible with >>> information about a crush location, however I fail to understand how >>> this is actually used. It doesn't seem to have any effect when create >>> a cluster from scratch (I'm testing on a bunch of vm's generated by >>> vagrant and cloud-init and some custom ansible playbooks). >>> It turns out (I think) that to be able to use this, you need both "crush_rule_config: true" and "create_crush_tree: true" Then it works as expected. The unknown for now is what happens with existing crush rules and if they would be removed, how we could define them in the osds.yml for ceph-ansible... Eventually I hope this can be a useful thing to enable, but perhaps not for now. Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph-ansible and crush location
On 03/11/2021 15:48, Stefan Kooman wrote: On 11/3/21 15:35, Simon Oosthoek wrote: Dear list, I've recently found it is possible to supply ceph-ansible with information about a crush location, however I fail to understand how this is actually used. It doesn't seem to have any effect when create a cluster from scratch (I'm testing on a bunch of vm's generated by vagrant and cloud-init and some custom ansible playbooks). Then I thought I may need to add the locations to the crushmap by hand and then rerun the site.yml, but this also doesn't update the crushmap. Then I was looking at the documentation here: https://docs.ceph.com/en/octopus/rados/operations/crush-map/#crush-location And it seems ceph is able to update the osd location upon startup, if configured to do so... I don't think this is being used in a cluster generated by ceph-ansible though... osd_crush_update_on_start is true by default. So you would have to disable it explicitly. OK, so this isn't happening, because there's no configuration for it in our nodes' /etc/ceph/ceph.conf files... Would it be possible/wise to modify ceph-ansible to e.g. generate files like /etc/ceph/crushlocation and fill that with information from the inventory, like Possible: yes. Wise: not sure. If you mess this up for whatever reason, and buckets / OSDs get reshuffled this might lead to massive data movement and possibly even worse, availability issues., i.e. when all your OSDs are moved to buckets that are are not matching any CRUSH rule. Indeed, getting this wrong is a major PITA, but not having the OSDs in the correct location is also undesirable. I prefer to document/configure everything in one place, so there aren't any contradicting data. In this light, I would say that ceph-ansible is the right way to set this up. (Now to figure out how and where ;-) And of course, it's bothersome to maintain a patch on top of the stock ceph-ansible, so it would be really nice if this kind of change could be added upstream to ceph-ansible. Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] ceph-ansible and crush location
Dear list, I've recently found it is possible to supply ceph-ansible with information about a crush location, however I fail to understand how this is actually used. It doesn't seem to have any effect when create a cluster from scratch (I'm testing on a bunch of vm's generated by vagrant and cloud-init and some custom ansible playbooks). Then I thought I may need to add the locations to the crushmap by hand and then rerun the site.yml, but this also doesn't update the crushmap. Then I was looking at the documentation here: https://docs.ceph.com/en/octopus/rados/operations/crush-map/#crush-location And it seems ceph is able to update the osd location upon startup, if configured to do so... I don't think this is being used in a cluster generated by ceph-ansible though... Would it be possible/wise to modify ceph-ansible to e.g. generate files like /etc/ceph/crushlocation and fill that with information from the inventory, like --- root=default datacenter=d1 rack=r1 --- And place a small shell script to interpret this file and return the output like --- #!/bin/sh . /etc/ceph/crushlocation echo "host=$(hostname -s) datacenter=$datacenter rack=$rack root=$root" --- And configure /etc/ceph/ceph.conf to contain a configuration line (under which heading?) to do this? --- crush location hook = /path/to/customized-ceph-crush-location --- And finally, are locations automatically defined in the crushmap when they are invoked? Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] ceph-ansible stable-5.0 repository must be quincy?
Hi we're trying to get ceph-ansible working again for our current version of ceph (octopus), in order to be able to add some osd nodes to our cluster. (Obviously there's a longer story here, but just a quick question for now...) When we add in all.yml ceph_origin: repository ceph_repository: community # Enabled when ceph_repository == 'community' # ceph_mirror: https://eu.ceph.com ceph_stable_key: https://eu.ceph.com/keys/release.asc ceph_stable_release: octopus ceph_stable_repo: "{{ ceph_mirror }}/debian-{{ ceph_stable_release }}" This fails with a message originating from - name: validate ceph_repository_community fail: msg: "ceph_stable_release must be 'quincy'" when: - ceph_origin == 'repository' - ceph_repository == 'community' - ceph_stable_release not in ['quincy'] in: ceph-ansible/roles/ceph-validate/tasks/main.yml This is from the "Stable-5.0" branch of ceph-ansible, which is specifically for Octopus, as I understand it... Is this a bug in ceph-ansible in the stable-5.0 branch, or is this our problem in understanding what to put in all.yml to get the octopus repository for ubuntu 20.04? Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: upgrade problem nautilus 14.2.15 -> 14.2.18? (Broken ceph!)
On 25/03/2021 21:08, Simon Oosthoek wrote: > > I'll wait a bit before upgrading the remaining nodes. I hope 14.2.19 > will be available quickly. > Hi Dan, Just FYI, I upgraded the cluster this week to 14.2.19 and all systems are good now. I've removed the workaround configuration in the /etc/ceph/ceph.conf again. Thanks for the quick help at the time! Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: upgrade problem nautilus 14.2.15 -> 14.2.18? (Broken ceph!)
On 25/03/2021 20:56, Stefan Kooman wrote: On 3/25/21 8:47 PM, Simon Oosthoek wrote: On 25/03/2021 20:42, Dan van der Ster wrote: netstat -anp | grep LISTEN | grep mgr # netstat -anp | grep LISTEN | grep mgr tcp 0 0 127.0.0.1:6801 0.0.0.0:* LISTEN 1310/ceph-mgr tcp 0 0 127.0.0.1:6800 0.0.0.0:* LISTEN 1310/ceph-mgr tcp6 0 0 :::8443 :::* LISTEN 1310/ceph-mgr tcp6 0 0 :::9283 :::* LISTEN 1310/ceph-mgr unix 2 [ ACC ] STREAM LISTENING 26205 1564/master private/tlsmgr unix 2 [ ACC ] STREAM LISTENING 26410 1310/ceph-mgr /var/run/ceph/ceph-mgr.cephmon1.asok Looks like :-( Ok, but that is easily fixable: ceph config set osd.$id public_addr your_ip_here Or you can put that in the ceph.conf for the OSDs on each storage server. Do you have a cluster network as well? If so you should set that IP too. Only when you run IPv6 only and have not yet set ms_bind_ipv4=false you should not do this. In that case you first have to make sure you set ms_bind_ipv4=false. As soon as your OSDs are bound to their correct IP again they can peer with each other and it will fix itself. @Ceph devs: a 14.2.19 with a fix for this issue would avoid other people running into this issue. Gr. Stefan Hoi Stefan tnx, I only have one network (25Gbit should be enough), after fixing the mon/mgr nodes and the one OSD node that I upgraded, the cluster seems to be recovering. At first I understood Dan's fix to put the mgr's address in all nodes' configs, but after watching the errors, I changed it to the node's own address on each node... I'll wait a bit before upgrading the remaining nodes. I hope 14.2.19 will be available quickly. Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: upgrade problem nautilus 14.2.15 -> 14.2.18? (Broken ceph!)
On 25/03/2021 20:42, Dan van der Ster wrote: netstat -anp | grep LISTEN | grep mgr has it bound to 127.0.0.1 ? (also check the other daemons). If so this is another case of https://tracker.ceph.com/issues/49938 Do you have any idea for a workaround (or should I downgrade?). I'm running ceph on ubuntu 18.04 LTS this seems to be happening on the mons/mgrs and osds Cheers /Simon -- dan On Thu, Mar 25, 2021 at 8:34 PM Simon Oosthoek wrote: Hi I'm in a bit of a panic :-( Recently we started attempting to configure a radosgw to our ceph cluster, which was until now only doing cephfs (and rbd wss working as well). We were messing about with ceph-ansible, as this was how we originally installed the cluster. Anyway, it installed nautilus 14.2.18 on the radosgw and I though it would be good to pull up the rest of the cluster to that level as well using our tried and tested ceph upgrade script (it basically does an update of all ceph nodes one by one and checks whether ceph is ok again before doing the next) After the 3rd mon/mgr was done, all pg's were unavailable :-( obviously, the script is not continuing, but ceph is also broken now... The message deceptively is: HEALTH_WARN Reduced data availability: 5568 pgs inactive That's all PGs! I tried as a desperate measure to upgrade one ceph OSD node, but that broke as well, the osd service on that node gets an interrupt from the kernel the versions are now like: 20:29 [root@cephmon1 ~]# ceph versions { "mon": { "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 3 }, "osd": { "ceph version 14.2.15 (afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)": 156 }, "mds": { "ceph version 14.2.15 (afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)": 2 }, "overall": { "ceph version 14.2.15 (afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)": 158, "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 6 } } 12 OSDs are down # ceph -s cluster: id: b489547c-ba50-4745-a914-23eb78e0e5dc health: HEALTH_WARN Reduced data availability: 5568 pgs inactive services: mon: 3 daemons, quorum cephmon3,cephmon1,cephmon2 (age 50m) mgr: cephmon1(active, since 53m), standbys: cephmon3, cephmon2 mds: cephfs:1 {0=cephmds2=up:active} 1 up:standby osd: 168 osds: 156 up (since 28m), 156 in (since 18m); 1722 remapped pgs data: pools: 12 pools, 5568 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs: 100.000% pgs unknown 5568 unknown progress: Rebalancing after osd.103 marked in [..] ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: upgrade problem nautilus 14.2.15 -> 14.2.18? (Broken ceph!)
On 25/03/2021 20:42, Dan van der Ster wrote: netstat -anp | grep LISTEN | grep mgr # netstat -anp | grep LISTEN | grep mgr tcp0 0 127.0.0.1:6801 0.0.0.0:* LISTEN 1310/ceph-mgr tcp0 0 127.0.0.1:6800 0.0.0.0:* LISTEN 1310/ceph-mgr tcp6 0 0 :::8443 :::* LISTEN 1310/ceph-mgr tcp6 0 0 :::9283 :::* LISTEN 1310/ceph-mgr unix 2 [ ACC ] STREAM LISTENING 262051564/master private/tlsmgr unix 2 [ ACC ] STREAM LISTENING 264101310/ceph-mgr /var/run/ceph/ceph-mgr.cephmon1.asok Looks like :-( /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] upgrade problem nautilus 14.2.15 -> 14.2.18? (Broken ceph!)
Hi I'm in a bit of a panic :-( Recently we started attempting to configure a radosgw to our ceph cluster, which was until now only doing cephfs (and rbd wss working as well). We were messing about with ceph-ansible, as this was how we originally installed the cluster. Anyway, it installed nautilus 14.2.18 on the radosgw and I though it would be good to pull up the rest of the cluster to that level as well using our tried and tested ceph upgrade script (it basically does an update of all ceph nodes one by one and checks whether ceph is ok again before doing the next) After the 3rd mon/mgr was done, all pg's were unavailable :-( obviously, the script is not continuing, but ceph is also broken now... The message deceptively is: HEALTH_WARN Reduced data availability: 5568 pgs inactive That's all PGs! I tried as a desperate measure to upgrade one ceph OSD node, but that broke as well, the osd service on that node gets an interrupt from the kernel the versions are now like: 20:29 [root@cephmon1 ~]# ceph versions { "mon": { "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 3 }, "osd": { "ceph version 14.2.15 (afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)": 156 }, "mds": { "ceph version 14.2.15 (afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)": 2 }, "overall": { "ceph version 14.2.15 (afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)": 158, "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 6 } } 12 OSDs are down # ceph -s cluster: id: b489547c-ba50-4745-a914-23eb78e0e5dc health: HEALTH_WARN Reduced data availability: 5568 pgs inactive services: mon: 3 daemons, quorum cephmon3,cephmon1,cephmon2 (age 50m) mgr: cephmon1(active, since 53m), standbys: cephmon3, cephmon2 mds: cephfs:1 {0=cephmds2=up:active} 1 up:standby osd: 168 osds: 156 up (since 28m), 156 in (since 18m); 1722 remapped pgs data: pools: 12 pools, 5568 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs: 100.000% pgs unknown 5568 unknown progress: Rebalancing after osd.103 marked in [..] ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph slow at 80% full, mds nodes lots of unused memory
On 25/02/2021 11:19, Dylan McCulloch wrote: > Simon Oosthoek wrote: >> On 24/02/2021 22:28, Patrick Donnelly wrote: >> > Hello Simon, >> > >> > On Wed, Feb 24, 2021 at 7:43 AM Simon Oosthoek > <s.oosthoek(a)science.ru.nl> wrote: >> > >> > On 24/02/2021 12:40, Simon Oosthoek wrote: >> > Hi >> > >> > we've been running our Ceph cluster for nearly 2 years now (Nautilus) >> > and recently, due to a temporary situation the cluster is at 80% full. >> > >> > We are only using CephFS on the cluster. >> > >> > Normally, I realize we should be adding OSD nodes, but this is a >> > temporary situation, and I expect the cluster to go to <60% full > quite soon. >> > >> > Anyway, we are noticing some really problematic slowdowns. There are >> > some things that could be related but we are unsure... >> > >> > - Our 2 MDS nodes (1 active, 1 standby) are configured with 128GB RAM, >> > but are not using more than 2GB, this looks either very inefficient, or >> > wrong ;-) >> > After looking at our monitoring history, it seems the mds cache is >> > actually used more fully, but most of our servers are getting a weekly >> > reboot by default. This clears the mds cache obviously. I wonder if >> > that's a smart idea for an MDS node...? ;-) >> > No, it's not. Can you also check that you do not have mds_cache_size >> > configured, perhaps on the MDS local ceph.conf? >> > >> Hi Patrick, >> >> I've already changed the reboot period to 1 month. >> >> The mds_cache_size is not configured locally in the /etc/ceph/ceph.conf >> file, so I guess it's just the weekly reboot that cleared the memory of >> cache data... >> >> I'm starting to think that a full ceph cluster could probably be the >> only cause of performance problems. Though I don't know why that would be. > > Did the performance issue only arise when OSDs in the cluster reached > 80% usage? What is your osd nearfull_ratio? > $ ceph osd dump | grep ratio full_ratio 0.95 backfillfull_ratio 0.9 nearfull_ratio 0.85 > Is the cluster in HEALTH_WARN with nearfull OSDs? ]# ceph -s cluster: id: b489547c-ba50-4745-a914-23eb78e0e5dc health: HEALTH_WARN 2 pgs not deep-scrubbed in time 957 pgs not scrubbed in time services: mon: 3 daemons, quorum cephmon3,cephmon1,cephmon2 (age 7d) mgr: cephmon3(active, since 2M), standbys: cephmon1, cephmon2 mds: cephfs:1 {0=cephmds2=up:active} 1 up:standby osd: 168 osds: 168 up (since 11w), 168 in (since 9M); 43 remapped pgs task status: scrub status: mds.cephmds2: idle data: pools: 10 pools, 5280 pgs objects: 587.71M objects, 804 TiB usage: 1.4 PiB used, 396 TiB / 1.8 PiB avail pgs: 9634168/5101965463 objects misplaced (0.189%) 5232 active+clean 29 active+remapped+backfill_wait 14 active+remapped+backfilling 5active+clean+scrubbing+deep+repair io: client: 136 MiB/s rd, 600 MiB/s wr, 386 op/s rd, 359 op/s wr recovery: 328 MiB/s, 169 objects/s > We noticed recently when one of our clusters had nearfull OSDs that > cephfs client performance was heavily impacted. > Our cluster is nautilus 14.2.15. Clients are kernel 4.19.154. > We determined that it was most likely due to the ceph client forcing > sync file writes when nearfull flag is present. > https://github.com/ceph/ceph-client/commit/7614209736fbc4927584d4387faade4f31444fce > Increasing and decreasing the nearfull ratio confirmed that performance > was only impacted while the nearfull flag was present. > Not sure if that's relevant for your case. I think this could be very similar in our cluster, thanks for sharing your insights! Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph slow at 80% full, mds nodes lots of unused memory
On 24/02/2021 22:28, Patrick Donnelly wrote: > Hello Simon, > > On Wed, Feb 24, 2021 at 7:43 AM Simon Oosthoek > wrote: >> >> On 24/02/2021 12:40, Simon Oosthoek wrote: >>> Hi >>> >>> we've been running our Ceph cluster for nearly 2 years now (Nautilus) >>> and recently, due to a temporary situation the cluster is at 80% full. >>> >>> We are only using CephFS on the cluster. >>> >>> Normally, I realize we should be adding OSD nodes, but this is a >>> temporary situation, and I expect the cluster to go to <60% full quite soon. >>> >>> Anyway, we are noticing some really problematic slowdowns. There are >>> some things that could be related but we are unsure... >>> >>> - Our 2 MDS nodes (1 active, 1 standby) are configured with 128GB RAM, >>> but are not using more than 2GB, this looks either very inefficient, or >>> wrong ;-) >> >> After looking at our monitoring history, it seems the mds cache is >> actually used more fully, but most of our servers are getting a weekly >> reboot by default. This clears the mds cache obviously. I wonder if >> that's a smart idea for an MDS node...? ;-) > > No, it's not. Can you also check that you do not have mds_cache_size > configured, perhaps on the MDS local ceph.conf? > Hi Patrick, I've already changed the reboot period to 1 month. The mds_cache_size is not configured locally in the /etc/ceph/ceph.conf file, so I guess it's just the weekly reboot that cleared the memory of cache data... I'm starting to think that a full ceph cluster could probably be the only cause of performance problems. Though I don't know why that would be. Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph slow at 80% full, mds nodes lots of unused memory
On 24/02/2021 12:40, Simon Oosthoek wrote: > Hi > > we've been running our Ceph cluster for nearly 2 years now (Nautilus) > and recently, due to a temporary situation the cluster is at 80% full. > > We are only using CephFS on the cluster. > > Normally, I realize we should be adding OSD nodes, but this is a > temporary situation, and I expect the cluster to go to <60% full quite soon. > > Anyway, we are noticing some really problematic slowdowns. There are > some things that could be related but we are unsure... > > - Our 2 MDS nodes (1 active, 1 standby) are configured with 128GB RAM, > but are not using more than 2GB, this looks either very inefficient, or > wrong ;-) After looking at our monitoring history, it seems the mds cache is actually used more fully, but most of our servers are getting a weekly reboot by default. This clears the mds cache obviously. I wonder if that's a smart idea for an MDS node...? ;-) > > "ceph config dump |grep mds": > mdsbasicmds_cache_memory_limit > 107374182400 > mdsadvanced mds_max_scrub_ops_in_progress 10 > > Perhaps we require more or different settings to properly use the MDS > memory? > > - On all our OSD nodes, the memory line is red in "atop", though no swap > is in use, it seems the memory on the OSD nodes is taking quite a > beating, is this normal, or can we tweak settings to make it less stressed? > > This is the first time we are having performance issues like this, I > think, I'd like to learn some commands to help me analyse this... > > I hope this will ring a bell with someone... > > Cheers > > /Simon > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] ceph slow at 80% full, mds nodes lots of unused memory
Hi we've been running our Ceph cluster for nearly 2 years now (Nautilus) and recently, due to a temporary situation the cluster is at 80% full. We are only using CephFS on the cluster. Normally, I realize we should be adding OSD nodes, but this is a temporary situation, and I expect the cluster to go to <60% full quite soon. Anyway, we are noticing some really problematic slowdowns. There are some things that could be related but we are unsure... - Our 2 MDS nodes (1 active, 1 standby) are configured with 128GB RAM, but are not using more than 2GB, this looks either very inefficient, or wrong ;-) "ceph config dump |grep mds": mdsbasicmds_cache_memory_limit 107374182400 mdsadvanced mds_max_scrub_ops_in_progress 10 Perhaps we require more or different settings to properly use the MDS memory? - On all our OSD nodes, the memory line is red in "atop", though no swap is in use, it seems the memory on the OSD nodes is taking quite a beating, is this normal, or can we tweak settings to make it less stressed? This is the first time we are having performance issues like this, I think, I'd like to learn some commands to help me analyse this... I hope this will ring a bell with someone... Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: BlueFS spillover detected, why, what?
Hi Michael, thanks for the pointers! This is our first production ceph cluster and we have to learn as we go... Small files is always a problem for all (networked) filesystems, usually it just trashes performance, but in this case it has another unfortunate side effect with the rocksdb :-( Cheers /Simon On 20/08/2020 11:06, Michael Bisig wrote: Hi Simon Unfortunately, the other NVME space is wasted or at least, this is the information we gathered during our research. This fact is due to the RocksDB level management which is explained here (https://github.com/facebook/rocksdb/wiki/Leveled-Compaction). I don't think it's a hard limit but it will be something above these values. Also consult this thread (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-February/033286.html). It's probably better to go a bit over these limits to be on the safe side. Exactly, reality is always different. We also struggle with small files which lead to further problems. Accordingly, the right initial setting is pretty important and depends on your individual usecase. Regards, Michael On 20.08.20, 10:40, "Simon Oosthoek" wrote: Hi Michael, thanks for the explanation! So if I understand correctly, we waste 93 GB per OSD on unused NVME space, because only 30GB is actually used...? And to improve the space for rocksdb, we need to plan for 300GB per rocksdb partition in order to benefit from this advantage Reducing the number of small files is something we always ask of our users, but reality is what it is ;-) I'll have to look into how I can get an informative view on these metrics... It's pretty overwhelming the amount of information coming out of the ceph cluster, even when you look only superficially... Cheers, /Simon On 20/08/2020 10:16, Michael Bisig wrote: > Hi Simon > > As far as I know, RocksDB only uses "leveled" space on the NVME partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every DB space above such a limit will automatically end up on slow devices. > In your setup where you have 123GB per OSD that means you only use 30GB of fast device. The DB which spills over this limit will be offloaded to the HDD and accordingly, it slows down requests and compactions. > > You can proof what your OSD currently consumes with: >ceph daemon osd.X perf dump > > Informative values are `db_total_bytes`, `db_used_bytes` and `slow_used_bytes`. This changes regularly because of the ongoing compactions but Prometheus mgr module exports these values such that you can track it. > > Small files generally leads to bigger RocksDB, especially when you use EC, but this depends on the actual amount and file sizes. > > I hope this helps. > Regards, > Michael > > On 20.08.20, 09:10, "Simon Oosthoek" wrote: > > Hi > > Recently our ceph cluster (nautilus) is experiencing bluefs spillovers, > just 2 osd's and I disabled the warning for these osds. > (ceph config set osd.125 bluestore_warn_on_bluefs_spillover false) > > I'm wondering what causes this and how this can be prevented. > > As I understand it the rocksdb for the OSD needs to store more than fits > on the NVME logical volume (123G for 12T OSD). A way to fix it could be > to increase the logical volume on the nvme (if there was space on the > nvme, which there isn't at the moment). > > This is the current size of the cluster and how much is free: > > [root@cephmon1 ~]# ceph df > RAW STORAGE: > CLASS SIZEAVAIL USEDRAW USED %RAW USED > hdd 1.8 PiB 842 TiB 974 TiB 974 TiB 53.63 > TOTAL 1.8 PiB 842 TiB 974 TiB 974 TiB 53.63 > > POOLS: > POOLID STORED OBJECTS USED > %USED MAX AVAIL > cephfs_data 1 572 MiB 121.26M 2.4 GiB > 0 167 TiB > cephfs_metadata 2 56 GiB 5.15M 57 GiB > 0 167 TiB > cephfs_data_3copy8 201 GiB 51.68k 602 GiB > 0.09 222 TiB > cephfs_data_ec8313 643 TiB 279.75M 953 TiB > 58.86 485 TiB > rbd 14 21 GiB 5.66k 64 GiB > 0 222 TiB > .rgw.root 15 1.2 KiB 4
[ceph-users] Re: BlueFS spillover detected, why, what?
Hi Michael, thanks for the explanation! So if I understand correctly, we waste 93 GB per OSD on unused NVME space, because only 30GB is actually used...? And to improve the space for rocksdb, we need to plan for 300GB per rocksdb partition in order to benefit from this advantage Reducing the number of small files is something we always ask of our users, but reality is what it is ;-) I'll have to look into how I can get an informative view on these metrics... It's pretty overwhelming the amount of information coming out of the ceph cluster, even when you look only superficially... Cheers, /Simon On 20/08/2020 10:16, Michael Bisig wrote: Hi Simon As far as I know, RocksDB only uses "leveled" space on the NVME partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every DB space above such a limit will automatically end up on slow devices. In your setup where you have 123GB per OSD that means you only use 30GB of fast device. The DB which spills over this limit will be offloaded to the HDD and accordingly, it slows down requests and compactions. You can proof what your OSD currently consumes with: ceph daemon osd.X perf dump Informative values are `db_total_bytes`, `db_used_bytes` and `slow_used_bytes`. This changes regularly because of the ongoing compactions but Prometheus mgr module exports these values such that you can track it. Small files generally leads to bigger RocksDB, especially when you use EC, but this depends on the actual amount and file sizes. I hope this helps. Regards, Michael On 20.08.20, 09:10, "Simon Oosthoek" wrote: Hi Recently our ceph cluster (nautilus) is experiencing bluefs spillovers, just 2 osd's and I disabled the warning for these osds. (ceph config set osd.125 bluestore_warn_on_bluefs_spillover false) I'm wondering what causes this and how this can be prevented. As I understand it the rocksdb for the OSD needs to store more than fits on the NVME logical volume (123G for 12T OSD). A way to fix it could be to increase the logical volume on the nvme (if there was space on the nvme, which there isn't at the moment). This is the current size of the cluster and how much is free: [root@cephmon1 ~]# ceph df RAW STORAGE: CLASS SIZEAVAIL USEDRAW USED %RAW USED hdd 1.8 PiB 842 TiB 974 TiB 974 TiB 53.63 TOTAL 1.8 PiB 842 TiB 974 TiB 974 TiB 53.63 POOLS: POOLID STORED OBJECTS USED %USED MAX AVAIL cephfs_data 1 572 MiB 121.26M 2.4 GiB 0 167 TiB cephfs_metadata 2 56 GiB 5.15M 57 GiB 0 167 TiB cephfs_data_3copy8 201 GiB 51.68k 602 GiB 0.09 222 TiB cephfs_data_ec8313 643 TiB 279.75M 953 TiB 58.86 485 TiB rbd 14 21 GiB 5.66k 64 GiB 0 222 TiB .rgw.root 15 1.2 KiB 4 1 MiB 0 167 TiB default.rgw.control 16 0 B 8 0 B 0 167 TiB default.rgw.meta17 765 B 4 1 MiB 0 167 TiB default.rgw.log 18 0 B 207 0 B 0 167 TiB cephfs_data_ec5720 433 MiB 230 1.2 GiB 0 278 TiB The amount used can still grow a bit before we need to add nodes, but apparently we are running into the limits of our rocskdb partitions. Did we choose a parameter (e.g. minimal object size) too small, so we have too much objects on these spillover OSDs? Or is it that too many small files are stored on the cephfs filesystems? When we expand the cluster, we can choose larger nvme devices to allow larger rocksdb partitions, but is that the right way to deal with this, or should we adjust some parameters on the cluster that will reduce the rocksdb size? Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] BlueFS spillover detected, why, what?
Hi Recently our ceph cluster (nautilus) is experiencing bluefs spillovers, just 2 osd's and I disabled the warning for these osds. (ceph config set osd.125 bluestore_warn_on_bluefs_spillover false) I'm wondering what causes this and how this can be prevented. As I understand it the rocksdb for the OSD needs to store more than fits on the NVME logical volume (123G for 12T OSD). A way to fix it could be to increase the logical volume on the nvme (if there was space on the nvme, which there isn't at the moment). This is the current size of the cluster and how much is free: [root@cephmon1 ~]# ceph df RAW STORAGE: CLASS SIZEAVAIL USEDRAW USED %RAW USED hdd 1.8 PiB 842 TiB 974 TiB 974 TiB 53.63 TOTAL 1.8 PiB 842 TiB 974 TiB 974 TiB 53.63 POOLS: POOLID STORED OBJECTS USED %USED MAX AVAIL cephfs_data 1 572 MiB 121.26M 2.4 GiB 0 167 TiB cephfs_metadata 2 56 GiB 5.15M 57 GiB 0 167 TiB cephfs_data_3copy8 201 GiB 51.68k 602 GiB 0.09 222 TiB cephfs_data_ec8313 643 TiB 279.75M 953 TiB 58.86 485 TiB rbd 14 21 GiB 5.66k 64 GiB 0 222 TiB .rgw.root 15 1.2 KiB 4 1 MiB 0 167 TiB default.rgw.control 16 0 B 8 0 B 0 167 TiB default.rgw.meta17 765 B 4 1 MiB 0 167 TiB default.rgw.log 18 0 B 207 0 B 0 167 TiB cephfs_data_ec5720 433 MiB 230 1.2 GiB 0 278 TiB The amount used can still grow a bit before we need to add nodes, but apparently we are running into the limits of our rocskdb partitions. Did we choose a parameter (e.g. minimal object size) too small, so we have too much objects on these spillover OSDs? Or is it that too many small files are stored on the cephfs filesystems? When we expand the cluster, we can choose larger nvme devices to allow larger rocksdb partitions, but is that the right way to deal with this, or should we adjust some parameters on the cluster that will reduce the rocksdb size? Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Combining erasure coding and replication?
On 27/03/2020 09:56, Eugen Block wrote: > Hi, > >> I guess what you are suggesting is something like k+m with m>=k+2, for >> example k=4, m=6. Then, one can distribute 5 shards per DC and sustain >> the loss of an entire DC while still having full access to redundant >> storage. > > that's exactly what I mean, yes. We have an EC pool of 5+7, which works that way. Currently we have no demand for it, but it should do the job. Cheers /Simon > >> Now, a long time ago I was in a lecture about error-correcting codes >> (Reed-Solomon codes). From what I remember, the computational >> complexity of these codes explodes at least exponentially with m. Out >> of curiosity, how does m>3 perform in practice? What's the CPU >> requirement per OSD? > > Such a setup usually would be considered for archiving purposes so the > performance requirements aren't very high, but so far we haven't heard > any complaints performance-wise. > I don't have details on CPU requirements at hand right now. > > Regards, > Eugen > > > Zitat von Frank Schilder : > >> Dear Eugen, >> >> I guess what you are suggesting is something like k+m with m>=k+2, for >> example k=4, m=6. Then, one can distribute 5 shards per DC and sustain >> the loss of an entire DC while still having full access to redundant >> storage. >> >> Now, a long time ago I was in a lecture about error-correcting codes >> (Reed-Solomon codes). From what I remember, the computational >> complexity of these codes explodes at least exponentially with m. Out >> of curiosity, how does m>3 perform in practice? What's the CPU >> requirement per OSD? >> >> Best regards, >> >> = >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> >> From: Eugen Block >> Sent: 27 March 2020 08:33:45 >> To: ceph-users@ceph.io >> Subject: [ceph-users] Re: Combining erasure coding and replication? >> >> Hi Brett, >> >>> Our concern with Ceph is the cost of having three replicas. Storage >>> may be cheap but I’d rather not buy ANOTHER 5pb for a third replica >>> if there are ways to do this more efficiently. Site-level redundancy >>> is important to us so we can’t simply create an erasure-coded volume >>> across two buildings – if we lose power to a building, the entire >>> array would become unavailable. >> >> can you elaborate on that? Why is EC not an option? We have installed >> several clusters with two datacenters resilient to losing a whole dc >> (and additional disks if required). So it's basically the choice of >> the right EC profile. Or did I misunderstand something? >> >> >> Zitat von Brett Randall : >> >>> Hi all >>> >>> Had a fun time trying to join this list, hopefully you don’t get >>> this message 3 times! >>> >>> On to Ceph… We are looking at setting up our first ever Ceph cluster >>> to replace Gluster as our media asset storage and production system. >>> The Ceph cluster will have 5pb of usable storage. Whether we use it >>> as object-storage, or put CephFS in front of it, is still TBD. >>> >>> Obviously we’re keen to protect this data well. Our current Gluster >>> setup utilises RAID-6 on each of the nodes and then we have a single >>> replica of each brick. The Gluster bricks are split between >>> buildings so that the replica is guaranteed to be in another >>> premises. By doing it this way, we guarantee that we can have a >>> decent number of disk or node failures (even an entire building) >>> before we lose both connectivity and data. >>> >>> Our concern with Ceph is the cost of having three replicas. Storage >>> may be cheap but I’d rather not buy ANOTHER 5pb for a third replica >>> if there are ways to do this more efficiently. Site-level redundancy >>> is important to us so we can’t simply create an erasure-coded volume >>> across two buildings – if we lose power to a building, the entire >>> array would become unavailable. Likewise, we can’t simply have a >>> single replica – our fault tolerance would drop way down on what it >>> is right now. >>> >>> Is there a way to use both erasure coding AND replication at the >>> same time in Ceph to mimic the architecture we currently have in >>> Gluster? I know we COULD just create RAID6 volumes on each node and >>> use the entire volume as a single OSD, but that this is not the >>> recommended way to use Ceph. So is there some other way? >>> >>> Apologies if this is a nonsensical question, I’m still trying to >>> wrap my head around Ceph, CRUSH maps, placement rules, volume types, >>> etc etc! >>> >>> TIA >>> >>> Brett >>> >>> ___ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >> >> >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To u
[ceph-users] Re: v15.2.0 Octopus released
On 25/03/2020 10:10, konstantin.ilya...@mediascope.net wrote: > That is why i am asking that question about upgrade instruction. > I really don`t understand, how to upgrade/reinstall CentOS 7 to 8 without > affecting the work of cluster. > As i know, this process is easier on Debian, but we deployed our cluster > Nautilus on CentOS because there weren`t any packages for 14.x for Debian > Stretch (9) or Buster(10). > P.s.: if this is even possible, i would like to know how to upgrade servers > with CentOs7 + ceph 14.2.8 to Debian 10 with ceph 15.2.0 (we have servers > with OSD only and 3 servers with Mon/Mgr/Mds) > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > I guess you could upgrade each node one by one. So upgrade/reinstall the OS, install Ceph 15 and re-initialise the OSDs if necessary. Though it would be nice if there was a way to re-integrate the OSDs from the previous installation... Personally, I'm planning to wait for a while to upgrade to Ceph 15, not in the least because it's not convenient to do stuff like OS upgrades from home ;-) Currently we're running ubuntu 18.04 on the ceph nodes, I'd like to upgrade to ubuntu 20.04 and then to ceph 15. Cheers /Simon ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io