[ceph-users] Re: Write issues on CephFS mounted with root_squash
Hi Nicola, Yes, this issue is already fixed in main [1] and the quincy backport is still pending to be merged. Hopefully will be available in the next Quincy release. [1] https://github.com/ceph/ceph/pull/48027 [2] https://github.com/ceph/ceph/pull/54469 Thanks and Regards, Kotresh H R On Wed, May 15, 2024 at 7:51 PM Fabien Sirjean wrote: > Hi, > > We have the same issue. It seems to come from this bug : > https://access.redhat.com/solutions/6982902 > > We had to disable root_squash, which of course is a huge issue... > > Cheers, > > Fabien > > > On 5/15/24 12:54, Nicola Mori wrote: > > Dear Ceph users, > > > > I'm trying to export a CephFS with the root_squash option. This is the > > client configuration: > > > > client.wizardfs_rootsquash > > key: > > caps: [mds] allow rw fsname=wizardfs root_squash > > caps: [mon] allow r fsname=wizardfs > > caps: [osd] allow rw tag cephfs data=wizardfs > > > > I can mount it flawlessly on several machines using the kernel driver, > > but when a machine writes on it then the content seems fine from the > > writing machine but it's not actually written on disk since other > > machines just see an empty file: > > > > [12:43 mori@stryke ~]$ echo test > /wizard/ceph/software/el9/test > > [12:43 mori@stryke ~]$ ll /wizard/ceph/software/el9/test > > -rw-r--r-- 1 mori wizard 5 mag 15 12:43 /wizard/ceph/software/el9/test > > [12:43 mori@stryke ~]$ cat /wizard/ceph/software/el9/test > > test > > [12:43 mori@stryke ~]$ > > > > [mori@fili ~]$ ll /wizard/ceph/software/el9/test > > -rw-r--r--. 1 mori 1014 0 May 15 06:43 /wizard/ceph/software/el9/test > > [mori@fili ~]$ cat /wizard/ceph/software/el9/test > > [mori@fili ~]$ > > > > Unmounting and then remounting on "stryke" the file is seen as empty, > > so I guess that the content shown just after the write is only a cache > > effect and nothing is effectively written on disk. I checked the posix > > permissions on the folder and I have rw rights from both the machines. > > > > All of the above using Ceph 18.2.2 on the cluster (deployed with > > cephadm) and both the machines. Machine "fili" has kernel 5.14.0 while > > "stryke" has 6.8.9. The same issue happens consistently also in the > > reverse direction (writing from "fili" and reading from "stryke"), and > > also with other machines. > > > > Removing the squash_root option the problem vanishes. > > > > I don't know what might > > > > > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] MDS behind on trimming every 4-5 weeks causing issue for ceph filesystem
Hi, We are using rook-ceph with operator 1.10.8 and ceph 17.2.5. we are using ceph filesystem with 4 mds i.e 2 active & 2 standby MDS every 3-4 weeks filesystem is having issue i.e in ceph status we can see below warnings warnings : 2 MDS reports slow requests 2 MDS Behind on Trimming mds.myfs-a(mds.1) : behind on trimming (6378/128) max_segments:128, num_segments: 6378 mds.myfs-c(mds.1): behind on trimming (6560/128) max_segments:128, num_segments: 6560 to fix it, we have to restart all MDS pods one by one. this is happening every 4-5 weeks. We have seen many ceph issues related to it on ceph tracker and many people are suggesting to increase mds_cache_memory_limit currently for our cluster *mds_cache_memory_limit* is set to default 4GB *mds_log_max_segments* is set to default 128 Should we increase *mds_cache_memory_limit* to 8GB from default 4GB or is there any solution to fix this issue permanently? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephfs-data-scan orphan objects while mds active?
It's unfortunately more complicated than that. I don't think that forward scrub tag gets persisted to the raw objects; it's just a notation for you. And even if it was, it would only be on the first object in every file — larger files would have many more objects forward scrub doesn't touch. This isn't a case anybody has really built tooling for. Your best bet is probably to live with the data leakage, or else find a time to turn it off and run the data-scan tools. -Greg On Tue, May 14, 2024 at 10:26 AM Olli Rajala wrote: > > Tnx Gregory, > > Doesn't sound too safe then. > > Only reason to discover these orphans via scanning would be to delete the > files again and I know all these files were at least one year old... so, I > wonder if I could somehow do something like: > 1) do forward scrub with a custom tag > 2) iterate over all the objects in the pool and delete all objects without > the tag and older than one year > > Is there any tooling to do such an operation? Any risks or flawed logic > there? > > ...or any other ways to discover and get rid of these objects? > > Cheers! > --- > Olli Rajala - Lead TD > Anima Vitae Ltd. > www.anima.fi > --- > > > On Tue, May 14, 2024 at 9:41 AM Gregory Farnum wrote: > > > The cephfs-data-scan tools are built with the expectation that they'll > > be run offline. Some portion of them could be run without damaging the > > live filesystem (NOT all, and I'd have to dig in to check which is > > which), but they will detect inconsistencies that don't really exist > > (due to updates that are committed to the journal but not fully > > flushed out to backing objects) and so I don't think it would do any > > good. > > -Greg > > > > On Mon, May 13, 2024 at 4:33 AM Olli Rajala wrote: > > > > > > Hi, > > > > > > I suspect that I have some orphan objects on a data pool after quite > > > haphazardly evicting and removing a cache pool after deleting 17TB of > > files > > > from cephfs. I have forward scrubbed the mds and the filesystem is in > > clean > > > state. > > > > > > This is a production system and I'm curious if it would be safe to > > > run cephfs-data-scan scan_extents and scan_inodes while the fs is online? > > > Does it help if I give a custom tag while forward scrubbing and then > > > use --filter-tag on the backward scans? > > > > > > ...or is there some other way to check and cleanup orphans? > > > > > > tnx, > > > --- > > > Olli Rajala - Lead TD > > > Anima Vitae Ltd. > > > www.anima.fi > > > --- > > > ___ > > > ceph-users mailing list -- ceph-users@ceph.io > > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Please discuss about Slow Peering
If using jumbo frames, also ensure that they're consistently enabled on all OS instances and network devices. > On May 16, 2024, at 09:30, Frank Schilder wrote: > > This is a long shot: if you are using octopus, you might be hit by this > pglog-dup problem: > https://docs.clyso.com/blog/osds-with-unlimited-ram-growth/. They don't > mention slow peering explicitly in the blog, but its also a consequence > because the up+acting OSDs need to go through the PG_log during peering. > > We are also using octopus and I'm not sure if we have ever seen slow ops > caused by peering alone. It usually happens when a disk cannot handle load > under peering. We have, unfortunately, disks that show random latency spikes > (firmware update pending). You can try to monitor OPS latencies for your > drives when peering and look for something that sticks out. People on this > list were reporting quite bad results for certain infamous NVMe brands. If > you state your model numbers, someone else might recognize it. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: 서민우 > Sent: Thursday, May 16, 2024 7:39 AM > To: ceph-users@ceph.io > Subject: [ceph-users] Please discuss about Slow Peering > > Env: > - OS: Ubuntu 20.04 > - Ceph Version: Octopus 15.0.0.1 > - OSD Disk: 2.9TB NVMe > - BlockStorage (Replication 3) > > Symptom: > - Peering when OSD's node up is very slow. Peering speed varies from PG to > PG, and some PG may even take 10 seconds. But, there is no log for 10 > seconds. > - I checked the effect of client VM's. Actually, Slow queries of mysql > occur at the same time. > > There are Ceph OSD logs of both Best and Worst. > > Best Peering Case (0.5 Seconds) > 2024-04-11T15:32:44.693+0900 7f108b522700 1 osd.7 pg_epoch: 27368 pg[6.8] > state: transitioning to Primary > 2024-04-11T15:32:45.165+0900 7f108f52a700 1 osd.7 pg_epoch: 27371 pg[6.8] > state: Peering, affected_by_map, going to Reset > 2024-04-11T15:32:45.165+0900 7f108f52a700 1 osd.7 pg_epoch: 27371 pg[6.8] > start_peering_interval up [7,6,11] -> [6,11], acting [7,6,11] -> [6,11], > acting_primary 7 -> 6, up_primary 7 -> 6, role 0 -> -1, features acting > 2024-04-11T15:32:45.165+0900 7f108f52a700 1 osd.7 pg_epoch: 27377 pg[6.8] > state: transitioning to Primary > 2024-04-11T15:32:45.165+0900 7f108f52a700 1 osd.7 pg_epoch: 27377 pg[6.8] > start_peering_interval up [6,11] -> [7,6,11], acting [6,11] -> [7,6,11], > acting_primary 6 -> 7, up_primary 6 -> 7, role -1 -> 0, features acting > > Worst Peering Case (11.6 Seconds) > 2024-04-11T15:32:45.169+0900 7f108b522700 1 osd.7 pg_epoch: 27377 pg[30.20] > state: transitioning to Stray > 2024-04-11T15:32:45.169+0900 7f108b522700 1 osd.7 pg_epoch: 27377 pg[30.20] > start_peering_interval up [0,1] -> [0,7,1], acting [0,1] -> [0,7,1], > acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting > 2024-04-11T15:32:46.173+0900 7f108b522700 1 osd.7 pg_epoch: 27378 pg[30.20] > state: transitioning to Stray > 2024-04-11T15:32:46.173+0900 7f108b522700 1 osd.7 pg_epoch: 27378 pg[30.20] > start_peering_interval up [0,7,1] -> [0,7,1], acting [0,7,1] -> [0,1], > acting_primary 0 -> 0, up_primary 0 -> 0, role 1 -> -1, features acting > 2024-04-11T15:32:57.794+0900 7f108b522700 1 osd.7 pg_epoch: 27390 pg[30.20] > state: transitioning to Stray > 2024-04-11T15:32:57.794+0900 7f108b522700 1 osd.7 pg_epoch: 27390 pg[30.20] > start_peering_interval up [0,7,1] -> [0,7,1], acting [0,1] -> [0,7,1], > acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting > > *I wish to know about* > - Why some PG's take 10 seconds until Peering finishes. > - Why Ceph log is quiet during peering. > - Is this symptom intended in Ceph. > > *And please give some advice,* > - Is there any way to improve peering speed? > - Or, Is there a way to not affect the client when peering occurs? > > P.S > - I checked the symptoms in the following environments. > -> Octopus Version, Reef Version, Cephadm, Ceph-Ansible > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Low rate Call girls service In Mehrauli Delhi l Just hire me | 8800733197
___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephadm basic questions: image config, OS reimages
At least for the current up-to-date reef branch (not sure what reef version you're on) when --image is not provided to the shell, it should try to infer the image in this order 1. from the CEPHADM_IMAGE env. variable 2. if you pass --name with a daemon name to the shell command, it will try to get the image that daemon uses 3. next it tries to find the image being used by any ceph container on the host 4. The most recently built ceph image it can find on the host (by CreatedAt metadata field for the image) There is a `ceph cephadm osd activate ` command that is meant to do something similar in terms of OSD activation. If I'm being honest I haven't looked at it in some time, but it does have some CI test coverage via https://github.com/ceph/ceph/blob/main/qa/suites/orch/cephadm/osds/2-ops/rmdir-reactivate.yaml On Thu, May 16, 2024 at 11:45 AM Matthew Vernon wrote: > Hi, > > I've some experience with Ceph, but haven't used cephadm much before, > and am trying to configure a pair of reef clusters with cephadm. A > couple of newbie questions, if I may: > > * cephadm shell image > > I'm in an isolated environment, so pulling from a local repository. I > bootstrapped OK with > cephadm --image docker-registry.wikimedia.org/ceph bootstrap ... > > And that worked nicely, but if I want to run cephadm shell (to do any > sort of admin), then I have to specify > cephadm --image docker-registry.wikimedia.org/ceph shell > > (otherwise it just hangs failing to talk to quay.io). > > I found the docs, which refer to setting lots of other images, but not > the one that cephadm uses: > > https://docs.ceph.com/en/reef/cephadm/install/#deployment-in-an-isolated-environment > > I found an old tracker in this area: https://tracker.ceph.com/issues/47274 > > ...but is there a good way to arrange for cephadm to use the > already-downloaded image without having to remember to specify --image > each time? > > * OS reimages > > We do OS upgrades by reimaging the server (which doesn't touch the > storage disks); on an old-style deployment you could then use > ceph-volume to re-start the OSDs and away you went; how does one do this > in a cephadm cluster? > [I presume involves telling cephadm to download a new image for podman > to use and suchlike] > > Would the process be smoother if we arranged to leave /var/lib/ceph > intact between reimages? > > Thanks, > > Matthew > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephadm basic questions: image config, OS reimages
On 5/16/24 17:50, Robert Sander wrote: cephadm osd activate HOST would re-activate the OSDs. Small but important typo: It's ceph cephadm osd activate HOST Regards -- Robert Sander Heinlein Consulting GmbH Schwedter Str. 8/9b, 10119 Berlin https://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Amtsgericht Berlin-Charlottenburg - HRB 220009 B Geschäftsführer: Peer Heinlein - Sitz: Berlin ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephadm basic questions: image config, OS reimages
Hi, On 5/16/24 17:44, Matthew Vernon wrote: cephadm --image docker-registry.wikimedia.org/ceph shell ...but is there a good way to arrange for cephadm to use the already-downloaded image without having to remember to specify --image each time? You could create a shell alias: alias cephshell="cephadm --image docker-registry.wikimedia.org/ceph shell" * OS reimages We do OS upgrades by reimaging the server (which doesn't touch the storage disks); on an old-style deployment you could then use ceph-volume to re-start the OSDs and away you went; how does one do this in a cephadm cluster? [I presume involves telling cephadm to download a new image for podman to use and suchlike] cephadm osd activate HOST would re-activate the OSDs. Before doing maintenance on a host run ceph orch host maintenance enter HOST and the orchestrator will stop the OSDs and set them to noout and will try to move other services away from the host if possible. Regards -- Robert Sander Heinlein Consulting GmbH Schwedter Str. 8/9b, 10119 Berlin https://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Amtsgericht Berlin-Charlottenburg - HRB 220009 B Geschäftsführer: Peer Heinlein - Sitz: Berlin ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] cephadm basic questions: image config, OS reimages
Hi, I've some experience with Ceph, but haven't used cephadm much before, and am trying to configure a pair of reef clusters with cephadm. A couple of newbie questions, if I may: * cephadm shell image I'm in an isolated environment, so pulling from a local repository. I bootstrapped OK with cephadm --image docker-registry.wikimedia.org/ceph bootstrap ... And that worked nicely, but if I want to run cephadm shell (to do any sort of admin), then I have to specify cephadm --image docker-registry.wikimedia.org/ceph shell (otherwise it just hangs failing to talk to quay.io). I found the docs, which refer to setting lots of other images, but not the one that cephadm uses: https://docs.ceph.com/en/reef/cephadm/install/#deployment-in-an-isolated-environment I found an old tracker in this area: https://tracker.ceph.com/issues/47274 ...but is there a good way to arrange for cephadm to use the already-downloaded image without having to remember to specify --image each time? * OS reimages We do OS upgrades by reimaging the server (which doesn't touch the storage disks); on an old-style deployment you could then use ceph-volume to re-start the OSDs and away you went; how does one do this in a cephadm cluster? [I presume involves telling cephadm to download a new image for podman to use and suchlike] Would the process be smoother if we arranged to leave /var/lib/ceph intact between reimages? Thanks, Matthew ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Please discuss about Slow Peering
This is a long shot: if you are using octopus, you might be hit by this pglog-dup problem: https://docs.clyso.com/blog/osds-with-unlimited-ram-growth/. They don't mention slow peering explicitly in the blog, but its also a consequence because the up+acting OSDs need to go through the PG_log during peering. We are also using octopus and I'm not sure if we have ever seen slow ops caused by peering alone. It usually happens when a disk cannot handle load under peering. We have, unfortunately, disks that show random latency spikes (firmware update pending). You can try to monitor OPS latencies for your drives when peering and look for something that sticks out. People on this list were reporting quite bad results for certain infamous NVMe brands. If you state your model numbers, someone else might recognize it. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: 서민우 Sent: Thursday, May 16, 2024 7:39 AM To: ceph-users@ceph.io Subject: [ceph-users] Please discuss about Slow Peering Env: - OS: Ubuntu 20.04 - Ceph Version: Octopus 15.0.0.1 - OSD Disk: 2.9TB NVMe - BlockStorage (Replication 3) Symptom: - Peering when OSD's node up is very slow. Peering speed varies from PG to PG, and some PG may even take 10 seconds. But, there is no log for 10 seconds. - I checked the effect of client VM's. Actually, Slow queries of mysql occur at the same time. There are Ceph OSD logs of both Best and Worst. Best Peering Case (0.5 Seconds) 2024-04-11T15:32:44.693+0900 7f108b522700 1 osd.7 pg_epoch: 27368 pg[6.8] state: transitioning to Primary 2024-04-11T15:32:45.165+0900 7f108f52a700 1 osd.7 pg_epoch: 27371 pg[6.8] state: Peering, affected_by_map, going to Reset 2024-04-11T15:32:45.165+0900 7f108f52a700 1 osd.7 pg_epoch: 27371 pg[6.8] start_peering_interval up [7,6,11] -> [6,11], acting [7,6,11] -> [6,11], acting_primary 7 -> 6, up_primary 7 -> 6, role 0 -> -1, features acting 2024-04-11T15:32:45.165+0900 7f108f52a700 1 osd.7 pg_epoch: 27377 pg[6.8] state: transitioning to Primary 2024-04-11T15:32:45.165+0900 7f108f52a700 1 osd.7 pg_epoch: 27377 pg[6.8] start_peering_interval up [6,11] -> [7,6,11], acting [6,11] -> [7,6,11], acting_primary 6 -> 7, up_primary 6 -> 7, role -1 -> 0, features acting Worst Peering Case (11.6 Seconds) 2024-04-11T15:32:45.169+0900 7f108b522700 1 osd.7 pg_epoch: 27377 pg[30.20] state: transitioning to Stray 2024-04-11T15:32:45.169+0900 7f108b522700 1 osd.7 pg_epoch: 27377 pg[30.20] start_peering_interval up [0,1] -> [0,7,1], acting [0,1] -> [0,7,1], acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting 2024-04-11T15:32:46.173+0900 7f108b522700 1 osd.7 pg_epoch: 27378 pg[30.20] state: transitioning to Stray 2024-04-11T15:32:46.173+0900 7f108b522700 1 osd.7 pg_epoch: 27378 pg[30.20] start_peering_interval up [0,7,1] -> [0,7,1], acting [0,7,1] -> [0,1], acting_primary 0 -> 0, up_primary 0 -> 0, role 1 -> -1, features acting 2024-04-11T15:32:57.794+0900 7f108b522700 1 osd.7 pg_epoch: 27390 pg[30.20] state: transitioning to Stray 2024-04-11T15:32:57.794+0900 7f108b522700 1 osd.7 pg_epoch: 27390 pg[30.20] start_peering_interval up [0,7,1] -> [0,7,1], acting [0,1] -> [0,7,1], acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting *I wish to know about* - Why some PG's take 10 seconds until Peering finishes. - Why Ceph log is quiet during peering. - Is this symptom intended in Ceph. *And please give some advice,* - Is there any way to improve peering speed? - Or, Is there a way to not affect the client when peering occurs? P.S - I checked the symptoms in the following environments. -> Octopus Version, Reef Version, Cephadm, Ceph-Ansible ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Reef: RGW Multisite object fetch limits
Den tors 16 maj 2024 kl 07:47 skrev Jayanth Reddy : > > Hello Community, > In addition, we've 3+ Gbps links and the average object size is 200 > kilobytes. So the utilization is about 300 Mbps to ~ 1.8 Gbps and not more > than that. > We seem to saturate the link when the secondary zone fetches bigger objects > sometimes but the objects per second always seem to be 1k to 1.5k per > second. Is it possible that the small object sizes makes it impossible for the replication to get any decent speed? If it makes a new tcp connection for every S3 object, then round-trip-times and the small sizes would make it impossible to get up to decent speed over the network before the object is finished, and then it restarts again with a new object with a new slow start and so on. -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io