[ceph-users] Cannot get backfill speed up
Hi. Fresh cluster - but despite setting: jskr@dkcphhpcmgt028:/$ sudo ceph config show osd.0 | grep recovery_max_active_ssd osd_recovery_max_active_ssd 50 mon default[20] jskr@dkcphhpcmgt028:/$ sudo ceph config show osd.0 | grep osd_max_backfills osd_max_backfills100 mon default[10] I still get jskr@dkcphhpcmgt028:/$ sudo ceph status cluster: id: 5c384430-da91-11ed-af9c-c780a5227aff health: HEALTH_OK services: mon: 3 daemons, quorum dkcphhpcmgt031,dkcphhpcmgt029,dkcphhpcmgt028 (age 16h) mgr: dkcphhpcmgt031.afbgjx(active, since 33h), standbys: dkcphhpcmgt029.bnsegi, dkcphhpcmgt028.bxxkqd mds: 2/2 daemons up, 1 standby osd: 40 osds: 40 up (since 45h), 40 in (since 39h); 21 remapped pgs data: volumes: 2/2 healthy pools: 9 pools, 495 pgs objects: 24.85M objects, 60 TiB usage: 117 TiB used, 159 TiB / 276 TiB avail pgs: 10655690/145764002 objects misplaced (7.310%) 474 active+clean 15 active+remapped+backfilling 6 active+remapped+backfill_wait io: client: 0 B/s rd, 1.4 MiB/s wr, 0 op/s rd, 116 op/s wr recovery: 328 MiB/s, 108 objects/s progress: Global Recovery Event (9h) [==..] (remaining: 25m) With these numbers for the setting - I would expect to get more than 15 active backfilling... (and based on SSD's and 2x25gbit network, I can also spend more resources on recovery than 328 MiB/s Thanks, . -- Jesper Krogh ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Rook on bare-metal?
Morning, we are running some ceph clusters with rook on bare metal and can very much recomend it. You should have proper k8s knowledge, knowing how to change objects such as configmaps or deployments, in case things go wrong. In regards to stability, the rook operator is written rather defensive, not changing monitors or the cluster if the quorom is not met and checking how the osd status is on removal/adding of osds. So TL;DR: very much usable and rather k8s native. BR, Nico zs...@tuta.io writes: > Hello! > > I am looking to simplify ceph management on bare-metal by deploying > Rook onto kubernetes that has been deployed on bare metal (rke). I > have used rook in a cloud environment but I have not used it on > bare-metal. I am wondering if anyone here runs rook in bare-metal? > Would you recommend it to cephadm or would you steer clear of it? > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io -- Sustainable and modern Infrastructures by ungleich.ch ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] pg_num != pgp_num - and unable to change.
Hi. Fresh cluster - after a dance where the autoscaler did not work (returned black) as described in the doc - I now seemingly have it working. It has bumpted target to something reasonable -- and is slowly incrementing pg_num and pgp_num by 2 over time (hope this is correct?) But . jskr@dkcphhpcmgt028:/$ sudo ceph osd pool ls detail | grep 62 pool 22 'cephfs.archive.ec62data' erasure profile ecprof62 size 8 min_size 7 crush_rule 3 object_hash rjenkins pg_num 150 pgp_num 22 pg_num_target 512 pgp_num_target 512 autoscale_mode on last_change 9159 lfor 0/0/9147 flags hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 24576 pg_num_min 128 target_size_ratio 0.4 application cephfs pg_num = 150 pgp_num = 22 and setting pgp_num seemingly have zero effect on the system .. not even with autoscaling set to off. jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data pg_autoscale_mode off set pool 22 pg_autoscale_mode to off jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data pgp_num 150 set pool 22 pgp_num to 150 jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data pg_num_min 128 set pool 22 pg_num_min to 128 jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data pg_num 150 set pool 22 pg_num to 150 jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data pg_autoscale_mode on set pool 22 pg_autoscale_mode to on jskr@dkcphhpcmgt028:/$ sudo ceph progress PG autoscaler increasing pool 22 PGs from 150 to 512 (14s) [] jskr@dkcphhpcmgt028:/$ sudo ceph osd pool ls detail | grep 62 pool 22 'cephfs.archive.ec62data' erasure profile ecprof62 size 8 min_size 7 crush_rule 3 object_hash rjenkins pg_num 150 pgp_num 22 pg_num_target 512 pgp_num_target 512 autoscale_mode on last_change 9159 lfor 0/0/9147 flags hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 24576 pg_num_min 128 target_size_ratio 0.4 application cephfs pgp_num != pg_num ? In earlier versions of ceph (without autoscaler) I have only experienced that setting pg_num and pgp_num took immidiate effect? Jesper jskr@dkcphhpcmgt028:/$ sudo ceph version ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) jskr@dkcphhpcmgt028:/$ sudo ceph health HEALTH_OK jskr@dkcphhpcmgt028:/$ sudo ceph status cluster: id: 5c384430-da91-11ed-af9c-c780a5227aff health: HEALTH_OK services: mon: 3 daemons, quorum dkcphhpcmgt031,dkcphhpcmgt029,dkcphhpcmgt028 (age 15h) mgr: dkcphhpcmgt031.afbgjx(active, since 32h), standbys: dkcphhpcmgt029.bnsegi, dkcphhpcmgt028.bxxkqd mds: 2/2 daemons up, 1 standby osd: 40 osds: 40 up (since 44h), 40 in (since 39h); 33 remapped pgs data: volumes: 2/2 healthy pools: 9 pools, 495 pgs objects: 24.85M objects, 60 TiB usage: 117 TiB used, 158 TiB / 276 TiB avail pgs: 13494029/145763897 objects misplaced (9.257%) 462 active+clean 23 active+remapped+backfilling 10 active+remapped+backfill_wait io: client: 0 B/s rd, 1.1 MiB/s wr, 0 op/s rd, 94 op/s wr recovery: 705 MiB/s, 208 objects/s progress: -- Jesper Krogh ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] CLT Meeting minutes 2023-07-05
Hello! Releasing Reef - * RC2 is out but we still have several PRs to go, including blockers. * RC3 might be worth doing but we Reef shall go before end of the month. Misc --- * For the sake of unittesting of dencoders interoperatbility we're going to impose some extra work (like registering types within ceph-dencoder) on developers writing encodable structs. This will be discussed further in a CDM. * A lab issue got fixed. Regards Radek ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Rook on bare-metal?
Hello! I am looking to simplify ceph management on bare-metal by deploying Rook onto kubernetes that has been deployed on bare metal (rke). I have used rook in a cloud environment but I have not used it on bare-metal. I am wondering if anyone here runs rook in bare-metal? Would you recommend it to cephadm or would you steer clear of it? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] ceph quota qustion
Hi, I contact you for some question about quota. Situation is following below. 1. I set the user quota 10M 2. Using s3 browser, upload one 12M file 3. The upload failed as i wish, but some object remains in the pool(almost 10M) and s3brower doesn't show failed file. I expected nothing to be left in Ceph. My question is "can user or admin remove the remaining objects?" ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Erasure coding and backfilling speed
Hi. I have a Ceph (NVME) based cluster with 12 hosts and 40 OSD's .. currently it is backfilling pg's but I cannot get it to run more than 20 backfilling (pgs) at the same time (6+2 profile) osd_max_backfills = 100 and osd_recovery_max_active_ssd = 50 (non-sane) but it still stops at 20 with 40+ in backfill_wait Any idea about how to speed it up? Thanks. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RBD with PWL cache shows poor performance compared to cache device
Hi , Matthew I see "rbd with pwl cache: 5210112 ns", This latency is beyond my expectations and I believe it is unlikely to occur. In theory, this value should be around a few hundred microseconds. But I'm not sure what went wrong in your steps. Can you use perf for latency analysis. Hi @Ilya Dryomov , do you have any suggestions? Perf, some command: admin_socket = /mnt/pmem/cache.asok ceph --admin-daemon /mnt/pmem/cache.asok perf reset all ceph --admin-daemon /mnt/pmem/cache.asok perf dump -Original Message- From: Matthew Booth Sent: Monday, July 3, 2023 6:09 PM To: Yin, Congmin Cc: Ilya Dryomov ; Giulio Fidente ; Tang, Guifeng ; Vikhyat Umrao ; Jdurgin ; John Fulton ; Francesco Pantano ; ceph-users@ceph.io Subject: Re: [ceph-users] RBD with PWL cache shows poor performance compared to cache device On Fri, 30 Jun 2023 at 08:50, Yin, Congmin wrote: > > Hi Matthew, > > Due to the latency of rbd layers, the write latency of the pwl cache is more > than ten times that of the Raw device. > I replied directly below the 2 questions. > > Best regards. > Congmin Yin > > > -Original Message- > From: Matthew Booth > Sent: Thursday, June 29, 2023 7:23 PM > To: Ilya Dryomov > Cc: Giulio Fidente ; Yin, Congmin > ; Tang, Guifeng ; > Vikhyat Umrao ; Jdurgin ; John > Fulton ; Francesco Pantano ; > ceph-users@ceph.io > Subject: Re: [ceph-users] RBD with PWL cache shows poor performance > compared to cache device > > On Wed, 28 Jun 2023 at 22:44, Ilya Dryomov wrote: > >> ** TL;DR > >> > >> In testing, the write latency performance of a PWL-cache backed RBD > >> disk was 2 orders of magnitude worse than the disk holding the PWL > >> cache. > > > > PWL cache can use pmem or SSD as cache devices. Using PMEM, based on > my test environment at that time, I can give specific data as follows: > the write latency of the pmem Raw device is about 10+us, the write > latency of the pwl cache is about 100us+(from the latency of the rbd > layers), and the write latency of the ceph cluster is about > 1000+us(from messengers and network). But for SSDs, there are many > types, and I cannot provide a specific value, but it will definitely > be worse than pmem. So, for a phenomenon that is 2 orders of magnitude > lower, it is worse than expected. Can you provide detailed values of > the three for analysis. (SSD, pwl cache, ceph cluster) I'm not entirely sure what you're asking for. Which values are you looking for? I did provide 3 sets of test results below, is that what you mean? * rbd no cache: 1417216 ns * pwl cache device: 44288 ns * rbd with pwl cache: 5210112 ns These are all outputs from the benchmarking test. The first is executing in the VM writing to a ceph RBD disk *without* PWL. The second is executing on the host writing directly to the SSD which is being used for the PWL cache. The third is execuing in the VM writing to the same ceph RBD disk, but this time *with* PWL. Incidentally, the client and server machines are identical, and the SSD used by the client for PWL is the same model used on the server as the OSDs. The SSDs are SAMSUNG MZ7KH480HAHQ0D3 SSDs attached to PERC H730P Mini (Embedded). > == > > >> > >> ** Summary > >> > >> I was hoping that PWL cache might be a good solution to the problem > >> of write latency requirements of etcd when running a kubernetes > >> control plane on ceph. Etcd is extremely write latency sensitive > >> and becomes unstable if write latency is too high. The etcd > >> workload can be characterised by very small (~4k) writes with a queue > >> depth of 1. > >> Throughput, even on a busy system, is normally very low. As etcd is > >> distributed and can safely handle the loss of un-flushed data from > >> a single node, a local ssd PWL cache for etcd looked like an ideal > >> solution. > > > > > > Right, this is exactly the use case that the PWL cache is supposed to > > address. > > Good to know! > > >> My expectation was that adding a PWL cache on a local SSD to an > >> RBD-backed would improve write latency to something approaching the > >> write latency performance of the local SSD. However, in my testing > >> adding a PWL cache to an rbd-backed VM increased write latency by > >> approximately 4x over not using a PWL cache. This was over 100x > >> more than the write latency performance of the underlying SSD. > > > > > When using image as the VM's disk, you may have used commands like the > following. In many cases, using parameters such as writeback will force the > start of rbd cache, which is a memory cache. It is normal for pwl cache to be > several times slower than it. Please confirm. > There is currently no parameter support for using only pwl cache instead of > rbd cache. I have tested the latency of using pwl cache (pmem) by modifying > the code myself, which is about twice as high as using rbd cache. > > qemu -m 1024 -drive >
[ceph-users] Re: upgrading from 15.2.17 to 16.2.11 - Health ERROR
I've met this issue when try to upgrade octopus 15.2.17 to 16.2.13 last night. Upgrade process failed at mgr module phase after the new MGR version become to active state. I tried to enable debug `ceph config set mgr mgr/cephadm/log_to_cluster_level debug ` and I saw the message like @xadhoom76 about config_key "registry_credentials" I guessed the root cause because of this line `json.loads(str(self.mgr.get_store('registry_credentials'` and the key_store had a wrong value. Then I got the empty value when run this command "ceph config-key dump | grep 'registry_credentials'" and the same for "ceph config-key get mgr/cephadm/registry_credentials" By check `cephadm` source i see the value should be a json format like that ceph config-key set mgr/cephadm/registry_credentials '{"url": "registry.local:5000", "username": "user-deployer", "password": "xxxzz"}' After set this key and `ceph mgr fail` to reload , my cluster issue was gone . ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [multisite] The purpose of zonegroup
thanks Yixin, On Tue, Jul 4, 2023 at 1:20 PM Yixin Jin wrote: > > Hi Casey, > Thanks a lot for the clarification. I feel that zonegroup made a great sense > at the beginning when multisite feature was conceived and (I suspect) zones > were always syncing from all other zones within a zonegroup. However, once > the "sync_from" was introduced and later the sync policy further enhanced the > granularity of the control over data sync, it seems not much advantage is > left with zonegroup. both sync_from and sync policy do offer finer-grained control over which zones sync from which, but they can't represent a bucket's 'residency' the way that the zonegroup-based LocationConstraint does. by redirecting requests to the bucket's resident zonegroup, the goal is to present a single eventually-consistent set of objects per bucket. while features like sync_from and bucket replication policy do complicate this picture, i think this concept of residency and redirects are important to make sense of s3's LocationConstraint. but perhaps the sync policy model could be extended to take over the zonegroup's role here? > Both "sync_from" and sync policy could be moved up to realm level while the > isolation of datasets can still be maintained. On the other hand, if some new > features are introduced to enable some isolation of metadata within the same > realm, probably at zonegroup level, its usefulness may be more justified.> > Regards,Yixin > > On Friday, June 30, 2023 at 11:29:16 a.m. EDT, Casey Bodley > wrote: > > you're correct that the distinction is between metadata and data; > metadata like users and buckets will replicate to all zonegroups, > while object data only replicates within a single zonegroup. any given > bucket is 'owned' by the zonegroup that creates it (or overridden by > the LocationConstraint on creation). requests for data in that bucket > sent to other zonegroups should redirect to the zonegroup where it > resides > > the ability to create multiple zonegroups can be useful in cases where > you want some isolation for the datasets, but a shared namespace of > users and buckets. you may have several connected sites sharing > storage, but only require a single backup for purposes of disaster > recovery. there it could make sense to create several zonegroups with > only two zones each to avoid replicating all objects to all zones > > in other cases, it could make more sense to isolate things in separate > realms with a single zonegroup each. zonegroups just provide some > flexibility to control the isolation of data and metadata separately > > On Thu, Jun 29, 2023 at 5:48 PM Yixin Jin wrote: > > > > Hi folks, > > In the multisite environment, we can get one realm that contains multiple > > zonegroups, each in turn can have multiple zones. However, the purpose of > > zonegroup isn't clear to me. It seems that when a user is created, its > > metadata is synced to all zones within the same realm, regardless whether > > they are in different zonegroups or not. The same happens to buckets. > > Therefore, what is the purpose of having zonegroups? Wouldn't it be easier > > to just have realm and zones? > > Thanks,Yixin > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [multisite] The purpose of zonegroup
The information provided by Casey has been added to doc/radosgw/multisite.rst in this PR: https://github.com/ceph/ceph/pull/52324 Zac Dover Upstream Docs Ceph Foundation --- Original Message --- On Saturday, July 1st, 2023 at 1:45 AM, Casey Bodley wrote: > > > cc Zac, who has been working on multisite docs in > https://tracker.ceph.com/issues/58632 > > On Fri, Jun 30, 2023 at 11:37 AM Alexander E. Patrakov > patra...@gmail.com wrote: > > > Thanks! This is something that should be copy-pasted at the top of > > https://docs.ceph.com/en/latest/radosgw/multisite/ > > > > Actually, I reported a documentation bug for something very similar. > > > > On Fri, Jun 30, 2023 at 11:30 PM Casey Bodley cbod...@redhat.com wrote: > > > > > you're correct that the distinction is between metadata and data; > > > metadata like users and buckets will replicate to all zonegroups, > > > while object data only replicates within a single zonegroup. any given > > > bucket is 'owned' by the zonegroup that creates it (or overridden by > > > the LocationConstraint on creation). requests for data in that bucket > > > sent to other zonegroups should redirect to the zonegroup where it > > > resides > > > > > > the ability to create multiple zonegroups can be useful in cases where > > > you want some isolation for the datasets, but a shared namespace of > > > users and buckets. you may have several connected sites sharing > > > storage, but only require a single backup for purposes of disaster > > > recovery. there it could make sense to create several zonegroups with > > > only two zones each to avoid replicating all objects to all zones > > > > > > in other cases, it could make more sense to isolate things in separate > > > realms with a single zonegroup each. zonegroups just provide some > > > flexibility to control the isolation of data and metadata separately > > > > > > On Thu, Jun 29, 2023 at 5:48 PM Yixin Jin yji...@yahoo.ca wrote: > > > > > > > Hi folks, > > > > In the multisite environment, we can get one realm that contains > > > > multiple zonegroups, each in turn can have multiple zones. However, the > > > > purpose of zonegroup isn't clear to me. It seems that when a user is > > > > created, its metadata is synced to all zones within the same realm, > > > > regardless whether they are in different zonegroups or not. The same > > > > happens to buckets. Therefore, what is the purpose of having > > > > zonegroups? Wouldn't it be easier to just have realm and zones? > > > > Thanks,Yixin > > > > ___ > > > > ceph-users mailing list -- ceph-users@ceph.io > > > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > > > ___ > > > ceph-users mailing list -- ceph-users@ceph.io > > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > -- > > Alexander E. Patrakov ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RBD with PWL cache shows poor performance compared to cache device
On 7/4/23 10:39, Matthew Booth wrote: On Tue, 4 Jul 2023 at 10:00, Matthew Booth wrote: On Mon, 3 Jul 2023 at 18:33, Ilya Dryomov wrote: On Mon, Jul 3, 2023 at 6:58 PM Mark Nelson wrote: On 7/3/23 04:53, Matthew Booth wrote: On Thu, 29 Jun 2023 at 14:11, Mark Nelson wrote: This container runs: fio --rw=write --ioengine=sync --fdatasync=1 --directory=/var/lib/etcd --size=100m --bs=8000 --name=etcd_perf --output-format=json --runtime=60 --time_based=1 And extracts sync.lat_ns.percentile["99.00"] Matthew, do you have the rest of the fio output captured? It would be interesting to see if it's just the 99th percentile that is bad or the PWL cache is worse in general. Sure. With PWL cache: https://paste.openstack.org/show/820504/ Without PWL cache: https://paste.openstack.org/show/b35e71zAwtYR2hjmSRtR/ With PWL cache, 'rbd_cache'=false: https://paste.openstack.org/show/byp8ZITPzb3r9bb06cPf/ Also, how's the CPU usage client side? I would be very curious to see if unwindpmp shows anything useful (especially lock contention): https://github.com/markhpc/uwpmp Just attach it to the client-side process and start out with something like 100 samples (more are better but take longer). You can run it like: ./unwindpmp -n 100 -p I've included the output in this gist: https://gist.github.com/mdbooth/2d68b7e081a37e27b78fe396d771427d That gist contains 4 runs: 2 with PWL enabled and 2 without, and also a markdown file explaining the collection method. Matt Thanks Matt! I looked through the output. Looks like the symbols might have gotten mangled. I'm not an expert on the RBD client, but I don't think we would really be calling into rbd_group_snap_rollback_with_progress from librbd::cache::pwl::ssd::WriteLogEntry::writeback_bl. Was it possible you used the libdw backend for unwindpmp? libdw sometimes gives strange/mangled callgraphs, but I haven't seen it before with libunwind. Hopefully Congmin Yin or Ilya can confirm if it's garbage. So with that said, assuming we can trust these callgraphs at all, it looks like it might be worth looking at the latency of the AbstractWriteLog, librbd::cache::pwl::ssd::WriteLogEntry::writeback_bl, and possibly usage of librados::v14_2_0::IoCtx::object_list. On the Hi Mark, Both rbd_group_snap_rollback_with_progress and librados::v14_2_0::IoCtx::object_list entries don't make sense to me, so I'd say it's garbage. Unfortunately I'm not at all familiar with this tool. Do you know how it obtains its symbols? I didn't install any debuginfo packages, so I was a bit surprised to see any symbols at all. I installed the following debuginfo packages and re-ran the tests: elfutils-debuginfod-client-0.189-2.fc38.x86_64 elfutils-debuginfod-client-devel-0.189-2.fc38.x86_64 ceph-debuginfo-17.2.6-3.fc38.x86_64 librbd1-debuginfo-17.2.6-3.fc38.x86_64 librados2-debuginfo-17.2.6-3.fc38.x86_64 qemu-debuginfo-7.2.1-2.fc38.x86_64 qemu-system-x86-core-debuginfo-7.2.1-2.fc38.x86_64 boost-debuginfo-1.78.0-11.fc38.x86_64 Note that unwindpmp now runs considerably slower (because it re-reads debug symbols for each sample?), so I had to reduce the number of samples to 500. It basically just uses libunwind or libdw to unwind the stack over and over and then unwindpmp turns the resulting samples into a forward or reverse call graph. The libunwind backend code is here: https://github.com/markhpc/uwpmp/blob/master/src/tracer/unwind_tracer.cc I'm sort of amazed that it gave you symbols without the debuginfo packages installed. I'll need to figure out a way to prevent that. Having said that, your new traces look more accurate to me. The thing that sticks out to me is the (slight?) amount of contention on the PWL m_lock in dispatch_deferred_writes, update_root_scheduled_ops, append_ops, append_sync_point(), etc. I don't know if the contention around the m_lock is enough to cause an increase in 99% tail latency from 1.4ms to 5.2ms, but it's the first thing that jumps out at me. There appears to be a large number of threads (each tp_pwl thread, the io_context_pool threads, the qemu thread, and the bstore_aio thread) that all appear to have potential to contend on that lock. You could try dropping the number of tp_pwl threads from 4 to 1 and see if that changes anything. Mark I have updated the gist with the new results: https://gist.github.com/mdbooth/2d68b7e081a37e27b78fe396d771427d Thanks, Matt -- Best Regards, Mark Nelson Head of R (USA) Clyso GmbH p: +49 89 21552391 12 a: Loristraße 8 | 80335 München | Germany w: https://clyso.com | e: mark.nel...@clyso.com We are hiring: https://www.clyso.com/jobs/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Slow ACL Changes in Secondary Zone
Dear Ceph community, I'm facing an issue with ACL changes in the secondary zone of my Ceph cluster after making modifications to the API name in the master zone of my master zonegroup. I would appreciate any insights or suggestions on how to resolve this problem. Here's the background information on my setup: - I have two clusters that are part of a single realm. - Each cluster has a zone within a single zonegroup. - Initially, all functionality was working perfectly fine. However, the problem arose when I changed the API name in the master zone of my master zonegroup. Since then, all functionalities appear to be functioning as expected, except for ACL changes, which have become extremely slow specifically in the secondary zone. Whenever I attempt to change the ACL in the secondary zone, it takes approximately 10 seconds or more for a response to be received. I would like to understand why this delay is occurring and find a solution to improve the performance of ACL changes in the secondary zone. Any suggestions, explanations, or guidance would be greatly appreciated. Thank you in advance for your help. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Mishap after disk replacement, db and block split into separate OSD's in ceph-volume
We had a fauly disk which was causing many errors, and replacement took a while so we had to try to stop ceph from using the OSD in during this time. However I think we must have done that wrong and after the disk replacement our ceph orch seems to have picked up /dev/sdp and added the a new osd and automatically (588), without a separate DB device (since that was still taken by the old OSD 31 maybe? I'm not sure how to ). This led to issues where osd31 of course wouldn't start, and some actions were attempted to clear this out, which might have just caused more harm. Long story short, we are currently in a odd position where we have still have ceph-volume lvm list osd.31 with only a [db] section: == osd.31 == [db] /dev/ceph-1b309b1e-a4a6-4861-b16c-7c06ecde1a3d/osd-db-fb09a714-f955-4418-99f2-6bccd8c6220e block device /dev/ceph-48f7dbd8-4a7c-4f7e-8962-104e756ae864/osd-block-33538b36-52b3-421d-bf66-6c729a057707 block uuidbykFYi-z8T6-OWXp-i1OB-H7CE-uLDm-Td6QTI cephx lockbox secret cluster fsid 5406fed0-d52b-11ec-beff-7ed30a54847b cluster name ceph crush device classNone db device /dev/ceph-1b309b1e-a4a6-4861-b16c-7c06ecde1a3d/osd-db-fb09a714-f955-4418-99f2-6bccd8c6220e db uuid Vy3aOA-qseQ-RIDT-741e-z7o0-y376-kKTXRE encrypted 0 osd fsid 33538b36-52b3-421d-bf66-6c729a057707 osd id31 osdspec affinity osd_spec type db vdo 0 devices /dev/nvme0n1 and a seperate extra osd.588 (which is running) which has taken only the [block] device = osd.588 == [block] /dev/ceph-f63ef837-3b18-47a4-be55-d5c2c0db8927/osd-block-58b33b8f-9623-46b3-a86a-3061602a76b5 block device /dev/ceph-f63ef837-3b18-47a4-be55-d5c2c0db8927/osd-block-58b33b8f-9623-46b3-a86a-3061602a76b5 block uuidKYHzBq-zgJJ-Nw93-j7Jx-Oz5i-BMuU-ndtTCH cephx lockbox secret cluster fsid 5406fed0-d52b-11ec-beff-7ed30a54847b cluster name ceph crush device class encrypted 0 osd fsid 58b33b8f-9623-46b3-a86a-3061602a76b5 osd id588 osdspec affinity all-available-devices type block vdo 0 devices /dev/sdp I figured the best action was to clear out both of these faulty OSDs via orch "ceph orch osd rm XX" but osd 31 isn't recognized [ceph: root@mimer-osd01 /]# ceph orch osd rm 31 Unable to find OSDs: ['31'] Deleting 588 is recognized. Should I attempt to clear out the osd.31 from ceph-volume manually? I'd really like to get back to a situation where I have osd.31 with the osd fsid that matches the device names, with /dev/sdp and /dev/nmve0n1 but I'm really afraid of just breaking things even more. >From what i can see from files laying around, the OSD spec we have is simply: placement: host_pattern: "mimer-osd01" service_id: osd_spec service_type: osd spec: data_devices: rotational: 1 db_devices: rotational: 0 in case this matters. I appreciate any help or guidance. Best regards, Mikael ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io